stephenpanaro.com

In Pursuit of Fast KV-Cached Attention for Apple Neural Engine

2024-10-10T14:05:00Z

Building a memory-friendly KV Cache with static shapes

No one wants a slow LLM. Most LLMs run on GPUs and most methods to make them fast are tailored specifically to GPUs.

LLMs can also run on Apple Neural Engine (ANE), Apple's efficient ML processor that comes in every new iPhone and Mac. Existing GPU optimizations do not easily translate to the Neural Engine which means you end up leaving speed on the table.

Today we'll unlock some speed by adapting a popular optimization known as KV caching to the Neural Engine.

Credit to Apple: this approach is largely the same as one they use in their own models. I've written about it before, but we'll dive a bit deeper today and add some improvements.

Attention Crash Course

To follow along it's helpful to understand the basic mechanics of a transformer LLM.

tl;dr The number of tokens and cache size determines the number of rows in 3 matrices (Q, K, V). We multiply them all together to compute attention and let the LLM predict a new token.

If this is foreign to you, keep reading. Otherwise skip ahead.

An LLM processes a sequence of tokens (word chunks) and predicts what the next token will be. The process is known as a "forward pass", "LLM call" or "prediction". Repeated forward passes build up the words, sentences, and paragraphs of the LLM's response.

_{This LLM predicts that "Fa" follows "Mi".}

Attention is a series of matrix multiplications that happens during the forward pass. It helps the model do a good job at predicting the next token.

There are three matrices involved in attention: Q, K and V. They all have the same number of columns which is not particularly interesting today. The number of rows for each is determined by the number of tokens we're processing.

Q has one row for each token where we want the LLM to make a next-token prediction. K has one row for each token we want the LLM to consider during its predictions. V is the same as K.

The simplest case is we input some fixed number of tokens into the LLM: "Do re mi". This is 3 tokens so Q, K, and V will all have the 3 rows.

Q's 3 rows mean the LLM will predict 3 new tokens: what comes after "Do", and "re", and "mi". We already know that "re" comes after "Do", and "mi" comes after "re", so we'll ignore those predictions but the prediction for what comes after "mi" is new so we'll keep that.

_{The LLM predicts 3 tokens here, but typically we ignore all but the last.}

K and V's 3 rows mean the LLM will consider all 3 tokens when predicting what comes next. So the LLM will make a prediction for what comes after "mi" based on "Do", and "re", and "mi", and their positions relative to each other.

_{You usually wouldn't let an old token like "Do" look at new ones like "Re", but it is technically possible.}

A more interesting case is where the LLM takes a smaller number of tokens as input, and also some K and V matrices that were computed in a prior forward pass. Following a similar example: the input token is now "fa" and we also pass along a partial K and V, each with three rows that correspond to "Do", "re", and "mi".

Q will now have 1 row, from "fa", and the LLM will only predict a new token to follow "fa".

K and V will have not 1 but 4 rows! The 3 for "Do", "re", "mi" that were passed in plus one new row that the LLM generates for "fa". This allows the LLM to make a well-informed prediction since it can still look at all 4 rows of K and V to see what came before "fa". Importantly, it produces exactly the same results as passing all 4 tokens as inputs to the LLM.

_{V is also made up of 3 re-used rows, just like K.}

This process of reusing K and V is the KV caching that we want to implement today.

Now that we know how the shape of these matrices corresponds to our input tokens, we can touch on the actual computation for attention.

First we multiply Q by K. We have to transpose K (swap its rows and columns) for the matrix multiplication to work.

Next we take the result of this multiplication (with Q's number of rows and K's number of rows as columns) and apply a function called softmax. This doesn't change the matrix's shape. This matrix is multiplied by V in the second matrix multiplication which gives us a final matrix that has the same shape as Q originally.

This final matrix then proceeds on through the rest of the LLM. There is more to attention, but this should be enough to follow along below. (If not, let me know on Twitter.)

Neural Engine Constraints

It is often convenient to vary the internal workings of an LLM on the fly. The Neural Engine does not allow this.

For a model to run on the ANE it must have input, output, and intermediate matrices that all have static shapes. They cannot change between calls to the model. The computation graph of the model must also be static. This means no conditional branching even if the intermediate tensors have the same shapes.

Both of these constraints can be slightly relaxed in some circumstances but we will stick with the rigid definition for simplicity.

Static Shaped Attention

We need to pick static sizes for the matrices Q, K, and V. The number of columns is predetermined and constant, so we only need to choose the number of rows. Let's start by focusing on just Q and K, the first attention multiplication, for simplicity.

We'll give K 512 rows. This means the LLM can look back at 512 recent tokens (word chunks) at most in order to predict the next token. This is usable and we can scale it up if needed.

Picking a size for Q is more interesting. The size of Q is equal to the number of input tokens. This size determines how many tokens we can add at once to K for future predictions (typically >1) and how many new tokens we want to predict (typically 1).

These correspond to the two stages of KV-cached LLM processing. Pre-fill: when the LLM ingests your prompt and builds up a cache. Generation: when the LLM responds.

Since we are restricted to static sizes we need to pick a Q that works for both pre-fill and generation. This means that a call to an ANE LLM always processes the same number of tokens and always takes the same amount of time, regardless of processing stage.

If we pick a small size for Q, generation will be fast but pre-fill will be slow since it has to make many calls to process every word in your prompt. But a big size for Q means that generation does a lot of wasted work. We only care about one new token each time but have to multiply a big Q times K.

_{Neither of these is ideal.}

The extremes are no good, so we'll split the difference and give Q 64 rows. This means we can process 64 tokens in each forward pass. It will take at most 8 calls to process a full 512 token prompt (8*64=512). These 8 calls take the same amount of time as the first 8 tokens in the generation phase which seems like a reasonable balance. 64 is also a multiple of 8, which aligns with the ANE hardware.

_{The Goldilocks zone.}

If you are planning to process longer prompts and generate fewer tokens, you might consider a larger Q. Similarly if your prompts will frequently be shorter, a smaller Q will buy some speed. Either way, be sure to benchmark. Performance is often non-linear.

64 might still seem like a lot compared to the single new token we care about. Outside of attention we can use a different trick (reshaping from 1x64 to 8x8) to make more efficient use of the Neural Engine. This helps close the gap.

Fast Sliding Cache

To make pre-fill work, we need to be able to process 64 new tokens at a time. This means we need to compute from scratch the entire Q and also the newest 64 tokens of K on each pass through the LLM. Lucky for us, we don't need to compute the trailing 448 (512-64) tokens of K—we can reuse ones that were previously computed. These reused tokens are the K in "KV cache" and not computing them saves a whole lot of computation and time.

Typically a KV cache is implemented statefully: one long-lived K matrix that continually appends the newest token's entries each time the LLM is called.

This is a no go on ANE, so instead we use a sliding window approach. Each pass through the LLM takes the 64 new tokens and concatenates the 448 next-newest tokens to get the full 512 length K matrix.

These 448 next-newest token K matrices are passed as inputs to the LLM. This means the LLM only needs to do a single concatenation to get the full 512 K. Memory operations, like concat, are slow and only doing 1 is close to the minimum (of zero!).

_{Only 1 concat to get K!}

We do need to actually slide K though, so we return the 64 new K tokens from the model and use a secondary model to combine the old 448 and new 64 into an updated 448 K input.

We have to do this in between every LLM call during pre-fill since we want all 64 tokens to go into the cache immediately. But we only have to do it once every 64 LLM calls during generation: we can reuse the same 448 K until we have a full 64 new tokens.

_{Secondary model to update the cache every 64 tokens. The oldest 64 tokens are discarded.}

The secondary cache sliding model lets us use the ANE when it would otherwise be idle. This is actually faster than using a single model even during pre-fill. It's significantly faster during generation.

If we really hate the idea of using two models, we can make our single model return a pre-slid K matrix. This works but is slow. You have to construct and return many K matrices that you don't actually need and since these are big matrices it takes time just to shuffle them around inside the model (remember, concat is slow).

This leaves us with a nicely optimized sliding K cache:

We return only the 64 new tokens of K. This is the minimum we can return since we need all 64 during pre-fill.
We only perform one K concat during attention. Concat is slow so less is good.
We slide our K cache when the ANE is otherwise idle. We only do this once per 64 tokens during generation.

Avoiding Memory Operations

We've minimized how often we concat, but that's not the only memory operation in attention. Transposing a matrix (flipping it diagonally) is slow too and we have to transpose K in order to multiply it with Q.

We can lean on our K cache here to minimize this memory movement. Instead of waiting until we have the full 512 length K in hand, we can transpose just the new 64 length K which is smaller and transposes faster. Only then do we concat it with the 448 length K, which comes into the model already transposed, to get our full 512 K.

To make this work, we output the transposed 64 K and update our secondary model to work with transposed Ks.

This is basically a free speed up.

Up to this point we have been talking about a single K cache matrix. A real transformer model has many K caches. For instance Llama 7B has 32. This is a lot of matrices to juggle so it is common to see the KV cache stored as a single tensor that contains all of them. On ANE this requires several concatenations that would be nice to avoid. To do so we take in and return each K cache individually. The extra bookkeeping is straightforward and worth it.

The Rest of Attention

The second matrix multiplication is much less interesting than the first but it is important so let's touch on it briefly. Our goal is to multiply the result of Q*K, called W, with V.

V is the same size as K so we can use the same sliding window cache approach. We don't need to do the transpose trick with V because of how the matrix shapes work out.

For convenience we can make our secondary model process both a K and a V at the same time.

That's all there is to it. You now have all the pieces of a static shaped KV cache attention that works on Apple Neural Engine.

_{The input/output widths are to scale, but the KV cache is much much deeper.}

You should see a non-trivial speed up compared to a cache-less model that processes the same number of tokens. For example I have a Llama 2 7B model that saw approximately a 4x speedup.

A (Slow) Single-Model Approach

I also want to touch on a couple things that didn't quite pan out. I'm hopeful there are opportunities to improve and maybe these will give someone an idea.

The purist in me hates using two models to juggle the KV cache. We can avoid it.

Instead of taking the 448 length K cache as an input to the model, we can take in the 7 separate 64 length chunks that make it up. Sliding our cache then just becomes a matter of removing the oldest chunk and adding the newest one.

_{Removing the old chunk and adding the new chunk is zero-cost. But the concat is slow.}

This completely eliminates the need for a second model, but it means we have to concatenate all 8 chunks to get the full K.

Sadly this concat is slow. Very slow. So this approach is dead. Unless…

No Concat Attention (Spoiler: Also Slow)

Turns out you don't actually need to concat the full K before multiplying by Q.

You can multiply each K chunk by Q individually, then hang onto some extra statistics that allow you to compute the rest of attention.

_{You can trade the final concat for 7x additions, but that's slow too.}

This is called the lazy softmax trick (link) and its main selling point is it reduces memory pressure caused by attention. That reduction is traded for, as you might guess, speed. So this is also slow.

Additionally even if it was fast we would need some creative solution to avoid concatenating and summing at the very end.

So I think this too is a dead end for now.

New Hopes

There's a couple places we can potentially squeeze more speed from:

The newest version of iOS/macOS has a feature to enable a stateful KV cache. If you don't care about old OSes, this might be worth a look.

The fact that we recompute 64 tokens each time means we could add some form of multi-token prediction basically for free. There is some research into models that predict many tokens instead of one. There are also speculative decoding methods that could work.

Tweet me or open an issue on GitHub if you have other ideas or questions!

LLMs for your iPhone: Whole-Tensor 4 Bit Quantization

2024-03-06T0:05:00Z

Shrinking models for Apple Silicon

>New to this, but still curious? Don't worry, I wrote the Primer below just for you.

Quantization is often touted as a way to make large language models (LLMs) small enough to run on mobile phones. Despite this, very few of the latest methods are able to use the full power of Apple Silicon on iPhone and Mac. This post introduces a new method of quantization that can.

This method is compatible with all three Apple Silicon co-processors (CPU, GPU, Neural Engine) which allows it to take full advantage of the speed/battery trade-offs offered by the hardware.

When compared to standard 4 bit Apple Silicon-compatible methods, it produces consistently more accurate results. Additionally it approaches the accuracy of GPTQ, a widely-used method that is not compatible.

Finally, this method is comparatively accessible. Access to Colab free tier and an Apple Silicon MacBook is sufficient to quantize the full family of GPT-2 models up to 1.5B parameters.

_{Lower is better in these plots. You want
a
smaller model with better performance (lower perplexity is better). In the first plot, naive clustering
occasionally performs well but is erratic. In the second, GPTQ is better but cannot run fully on Apple
Silicon.}

Acknowledgements

This method extends SqueezeLLM, and remixes ideas from both SmoothQuant and AWQ. It was developed concurrently with OneBit, and shares some similar ideas. Thank you for sharing your research and code!

LLM and Quantization Primer

If you know this, you can skip it. If you don't, hopefully it helps you get oriented. Drop me a line if anything is confusing!

When you ask an LLM a question, that text gets transformed into a matrix of numbers. This matrix is then transformed repeatedly with a bunch of math until a final transformation that converts it from numbers to the first few characters of the LLM's response.

_{This is accurate for our purposes.}

Within these repeated transformations there are many times where the input matrix is multiplied with different hardcoded matrices. These hardcoded matrices can be quite large and end up accounting for most of the space that an LLM takes up. For instance LLaMa 7B, Facebook's open-source LLM, is 13.5GB and 12.9GB of that is the numbers that make up these large matrices.

Typically the matrix's values are stored as 16 bit ^{2 byte} or 32 bit ^{4 byte} floating point numbers. For LLaMa a typical matrix is 4096x4096 which means it takes 33MB on its own. Shrinking those 16 bit elements to 4 bits brings the size of that matrix to 8.3MB. Doing the same for every matrix brings the whole model from 13.5GB to just under 4GB.

Instead of measuring this compression in bytes-per-matrix, it is measured in bits-per-element. (1 byte = 8 bits). This makes it easier to compare across matrices of different sizes and also gives us some flexibility to store a few extra values alongside our matrix. This is actually fairly common. Including them in the bits-per-element calculation makes for fair comparisons.

So, in summary, LLMs do a bunch of math with matrices in order to generate replies. These matrices are big and quantization's goal is to make them smaller without losing the LLM's ability to reply. This shrinks the model and lets us run it on less powerful devices, like your phone.

Challenges with Apple Silicon

If you take the weight from a linear layer (one of the matrices we talked about above) out of an LLM and look at the distribution of its elements, you will generally see a bell curve.

_{It's not a perfect bell curve, but it's
close enough to be useful.}

Nearly all recent quantization schemes are uniform which means they take this bell curve and pick two values for it. They pick a starting point and also a step size which they then use to place equally-spaced points along the x-axis. To actually quantize the matrix they simply snap all matrix elements to the nearest point.

_{The x points are equally
spaced.}

This is a non-optimal use of space in our quantized linear layer. The points on the edges of the bell curve barely capture any matrix elements, but they consume the same amount of space in the LLM as points in the middle which represent many. They are simply not an effective use of bits. (This is fast on GPUs though which is why everyone uses them.)

A common solution for this is to break up the matrix into chunks of either rows, columns, or groups. Having fewer elements in a chunk tends to make quantization more accurate by narrowing the bell curve (more or less) and it only costs a few extra fractions of a bit on average. This sufficiently minimizes the awkwardness of fitting equally-spaced points to bell curve-distributed matrix elements.

Unfortunately Apple Silicon does not support this chunking concept for low (<8) bit quantization. However it makes up for it by allowing models that use a non-uniform quantization scheme. On our bell curve from before, this means we can place our points anywhere we want. So we'll place them optimally.

What is optimal? For each matrix element we calculate how far it is from the nearest point. This is the element's quantization error. We place our points along the bell curve so that the sum of all elements' errors is as low as possible. (k-means clustering is a good way to do this.)

_{Notice how uniform puts points at the edges,
but non-uniform is free to ignore the small number of matrix elements there.}

Placing the points optimally like this is all we need to do when we have a lot of points to place. Most LLMs will perform very well if we place 6 bits or 8 bits worth of points (64 and 256 respectively). However when we drop to 4 bits worth, which is only 16 points, this simple optimal placement is not enough.

_{If we were to shade these to show the error
they would get darker going from top to bottom. The LLM performance also typically goes from nearly perfect,
to good, to bad going from top to bottom.}

Method Overview

Our goal is to improve LLM performance when using 4-bits as much as possible. We achieve this by making 3 complementary modifications to the quantization process and the LLM itself.

Modification 1: Weighting by Importance

The first modification comes directly from another paper, SqueezeLLM. The paper is fairly approachable, but we'll summarize the parts we're using.

It turns out that every element in these matrices is not equally important. In fact the top few percent are significantly more important. When we're placing our points optimally we should not treat every element equally, but let the more important elements have more sway. But how do we know what's important? We take a small number of input texts (100 is enough), send them through our LLM, and observe the impact each matrix element had on the LLM's response. The higher the total impact, the more important.

_{The triangles represent more important
elements. The Naive method is optimal for a standard bell curve, but the importance aware method shifts
closer to the triangles.}

Modification 2: Scaling for Easier Clustering

So far we've been looking at our matrix as a single bell curve. A different way of thinking about it is to look at every column of the matrix as an independent entity that just happens to be joined together in this matrix. Similar to the matrix as a whole, each column's elements are roughly bell curve-shaped. Most of the bell curves have similar centers but they all have different standard deviations (how wide or narrow they are).

_{The columns of our matrix are all centered
around zero, but the standard deviation varies—some are very wide while others are fairly
narrow.}

If we divide the elements of each column by the column's standard deviation we make the bell curves roughly the same shape. This makes it easier to place our points since it prevents one column from having undue influence over the rest. (You can also think of it as squishing more elements towards the middle of the curve where we typically place more points.)

_{The same columns from above after dividing
every element by each column's standard deviation. This reshapes the bell curves.}

It's important that we don't change the output of our LLM, and scaling each column independently changes it. So we need to take the per-column values that we divided by and correct for them somewhere else in our model. All of the matrices that we're quantizing are used in matrix multiplications. Since we divide each column, and columns determine the output of the matrix multiplication, we can add a step in our LLM after the multiplication to re-multiply the removed values back in.

_{We divide before quantizing which makes
quantization easier. We have to restore the scale factors at inference time, when the model is generating a
response.}

This does mean we have to keep a few extra values as 16 bit (the ones we multiply back in). For a 768x768 matrix, we need 768 extra values in 16 bits. This puts us at 4.02 bits on average which is a reasonable trade-off. The average number of bits decreases as the model and matrices get larger which makes this even less of a concern. (LLaMa is 4.003 or less depending on the version.)

Modification 3: Shifting the Other Matrix

So far we've been looking closely at the matrix itself. Let's now look at how it is used. As mentioned above, these matrices are used for matrix multiplication. Specifically the model is performing matrix X times matrix W, X*W, and we are quantizing W. We've talked about how the elements in W are nicely distributed with an average close to zero and bell-shaped. This is not true for X. X depends on the text that the model received as input and can vary significantly in how its elements are distributed.

Why does this matter? Imagine a simple product of two numbers: x*w. Let's say that in this case our 1 element matrix, w, has the value of 2.3 and we quantize it to 2. When we do x*2 we get a quantization error of x*0.3. The closer that x is to zero, the less error. The farther it is from zero the more we get.

This extrapolates to matrix multiplication. When we look at a column of matrix X, if most of its values are far from zero then the impact of quantization error for that column will be larger.

Similar to our first modification, we can inspect a small number of texts as they flow through the LLM. If we do this we'll see that there is consistency in which columns of X, the input matrix, have larger or smaller values in general.

_{We're focusing on the columns of the left
matrix. This matrix is derived from the input to the LLM.} _{The distribution of average values for
select columns. Notice that they are not centered at zero unlike the matrix we're quantizing.}

To minimize the impact of large values in X we can apply a per-input column shift. This will move most of the values in X closer to zero on average, thereby reducing the impact of our quantization error. We shift by subtracting the average of the values we saw for that column.

_{The same columns from above after
subtracting the average value from each column. They are now centered around zero.}

Similar to our second modification, we need to reverse this change in the model in order to not change its outputs. This one is a little trickier, but again easier to think about without matrices involved. If we take our x*w from earlier we can make a new shifted input y (so, y=x-shift). Now the model will do y*w which is actually (x-shift)*w. If multiply that out we get x*w - shift*w. Since shift and w are both constants we just need to pre-compute that value and subtract it after the matrix multiplication in the model. This undoes the impact of shifting X but reduces the error when w is quantized. (Extrapolating this to matrices is a little harder, but still doable.)

_{In this case we don't subtract the shift
values themselves, but the result of multiplying all the shifts by the matrix we're quantizing.}

Depending on the model this adds between zero and two additional vectors of 768 elements in 16 bits. At a worst case this brings us up to 4.06 bits total.

Results

Used individually, these modifications have varying efficacy. Generally the first modification, from SqueezeLLM, works well on its own. When we add in the other two modifications we see consistent improvement. This leaves us with a quantization scheme that is both more accurate and less erratic than the baseline 4-bit method we wanted to improve upon.

_{SqueezeLLM (Weighting) is surprisingly
effective on its own, even at the whole-tensor level. Adding our other modifications consistently improves
upon it. The improvement for gpt2-large is negligible—something interesting to follow up on.}

Conclusion

tl;dr We used an amalgamation of existing and, I think, new ideas to quantize LLM linear layers to ~4 bits on average, dramatically shrinking model size. Since we do this at the tensor level without grouping, this method is fully compatible with Apple Silicon on iPhone or Mac which opens the door for larger models on your devices.

Thanks for reading! To stay in the loop as I explore more, you can give me a follow on Twitter. If you'd like to give it a go yourself, I've got a drop-in replacement for torch's Linear layer, as well as some instructions: here. Please get in touch, ask questions, and let me know what you learn!

Appendix: Future Ideas

Part of my motivation for writing this is to find folks who are smarter than me, who can maybe check my work, and maybe even take it further. If that's you, do please reach out! There's a couple directions that I think still have more to give / would be interesting to explore:

I couldn't find a way to scale the input channels (weight matrix rows) that was helpful. Seems like there might be something there, either as a way to make clustering easier, or as a way to minimize error from the inputs.
Depending on the model, sometimes computing a weighted standard deviation based on the SqueezeLLM sensitivities performs slightly better. This makes me think that standard deviation is close but not the optimal solution.
These models seem very reactive to how the SqueezeLLM sensitivities are generated. I suspect any improvements there would help.
Explore integrating this with Mixed Bit Precision methods.

Appendix: Modification Comparison

To further support that these modifications are complementary, wikitext perplexity was measured in all possible combinations. As mentioned above, gpt2-large is an outlier but the differences are minor. The SqueezeLLM fisher information (sensitivities) were computed using C4 in all cases.

Model	Weighting	Scaling	Shifting	Weight+Scale	Weight+Shift	Scale+Shift	Weight+Scale+Shift
gpt2	30.8947	43.0891	44.9285	28.8972	29.1401	43.5065	28.1946
gpt2-medium	21.4389	30.9959	23.8464	20.4853	19.8515	23.6801	19.904
gpt2-large	17.2172	22.5075	18.1454	17.1246	17.1282	25.2589	17.1507
gpt2-xl	16.1751	15.89	15.7223	15.148	16.0874	17.0936	15.1148

Model	float16	naive 4-bit	Weight+Scale+Shift	GPTQ
gpt2	25.1876	62.1889	28.1946	26.5
gpt2-medium	18.4739	23.7826	19.904	19.1719
gpt2-large	16.4541	27.3636	17.1507	16.6875
gpt2-xl	14.7951	15.89	15.1148	14.9297

Inside Apple's 2023 Transformer Models

2023-11-16T12:00:00Z

What can we learn from them?

Apple's latest OSes include several transformer models that are optimized for the Apple Neural Engine. We'll take a look at how they're implemented and see if there's anything we can apply to our own models. To make that easier, I've cobbled together support for viewing them in Netron—you can try it yourself here.

While everyone is talking about AI or GPT, Apple made a point to use the words "machine learning" and "transformer" when announcing new features for this year's operating systems (iOS 17 and macOS Sonoma).

Apple has been vocal about their Machine Learning accelerator, the Neural Engine (ANE), so it's no surprise that these models are designed to leverage its capabilities.

In contrast to their normal secrecy, Apple has been fairly public about how to run the transformer model architecture on the ANE. In the past year and a half they:

Wrote a research article about how to optimize transformers for the ANE.
- Released code to demonstrate it in the ml-ane-transformers repo.
Published a Stable Diffusion (text to image) implementation optimized for the ANE in the ml-stable-diffusion repo.
- They have kept this up to date too!

The models embedded in the new OS are not quite as easily inspected as a research article or GitHub project. However they are a year newer. Let's see what we can learn from them!

This is most interesting if you're familiar with transformers and how they work. However if you are just generally curious I've tried to add explainers throughout to fill in some background.

They'll look like this.

No Frills Time Series Compression That Also Works

2023-08-22T12:00:00Z

CSV + gzip will take you far.

So you have some time series data and you want to make it smaller? You may not need an algorithm designed specifically for time series. Generic compressors like gzip work quite well and are much easier to use.

Of course this depends on your data, so there’s some code you can use to try it out here.

Recently I started working on a way to save Bluetooth scale data in my iOS coffee-brewing app. I want to allow people to record from a scale during their coffee-brewing sessions and then view it afterwards. Scale data is just a bunch of timestamps and weight values. Simple, yes, but it felt like something that might take a surprising amount of space to save. So I did some napkin math:


                    1 scale session / day
                    

                    10 minutes / session
                    

                    10 readings / second
                    

                    = 2.19M readings / year


                    1 reading = 1 date + 1 weight
                    

                    = 1 uint64 + 1 float32
                    

                    = 12 bytes
                    

                    2.19M * 12B = 26 MB

26 MB per year is small by most measures. However in my case I keep a few extra copies of my app’s data around as backups so this is more like ~100MB/year. It’s also 40x the size of what I’m saving currently! This puts my app in danger of landing on the one Top Apps list I would not be stoked to be featured on:

_{iCloud storage usage}

So let’s avoid that. At a high-level I see two options:

Save less. 10 scale readings/second is probably more granularity than we’ll ever need. So we could just not save some of them. Of course if I’m wrong about that, they’re gone forever and then we’ll be out of luck.

Save smaller. Looking at some example data, there are a lot of plateaus where the same value repeats over and over. That seems like it could compress well.

_{Example brewing session time series}

Picking Ways to Compress

This is my first rodeo with compression. I’m starting from basics like “compression makes big things small” and “double click to unzip”. Doing a little research seems like a good idea and it pays off.

My scale data is technically “time series data” and it turns out we are not the first to want to compress it. There is a whole family of algorithms designed specifically for time series. This blog post is a great deep dive, but for our purposes today we’ll be looking at two of the algorithms it mentions:

simple-8b which compresses sequences of integers
Gorilla which compresses both integers as well as floating point numbers

Algorithms designed for exactly my problem space sound ideal. However something else catches my eye in a comment about the same blog post:

rklaehn on May 15, 2022
I have found that a very good approach is to apply some very simple transformations such as delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest.

Using a general purpose algorithm is quite intriguing! One thing I’ve noticed is that there are no Swift implementations for simple-8b or Gorilla. This means I would have to wrap an existing implementation (a real hassle) or write a Swift one (risky, I would probably mess it up). General purpose algorithms are much more common and side-step both of those issues.

So we’ll look at both. For simplicity I’ll call simple-8b and Gorilla the “specialist algorithms” and everything else “generalist”.

Evaluating the Specialist Algorithms

Starting with the specialists seems logical. I expect they will perform better which will give us a nice baseline for comparison. But first we need to smooth out a few wrinkles.

Precision

While wiring up an open-source simple-8b implementation I realize that it requires integers and both our timestamp and weight are floating point numbers. To solve this we’ll truncate to milliseconds and milligrams. A honey bee can flap its wings in 5 ms. A grain of salt is approximately 1mg. Both of these feel way more precise than necessary but better to err on that side anyways.


                    49.0335097 seconds
                    

                    17.509999999999998 grams


                    49033 milliseconds
                    

                    17509 milligrams

We’ll use this level of precision for all our tests except Gorilla, which is designed for floating point numbers.

Negative Numbers

Negative numbers show up semi-frequently in scale data because often when you pick something up off a scale it will drop below zero.

Unfortunately for us simple-8b doesn’t like negative numbers. Why? Let’s take a little detour and look at how computers store numbers. They end up as sequences of 1s and 0s like:


                0000000000010110 is 22
                

                0000000001111011 is 123
                

                0000000101011110 is 350

You’ll notice that these tend to have all their 1s all on the right. In fact, only very large numbers will have 1s on the left. simple-8b does something clever where it uses 4 of the leftmost spaces to store some 1s and 0s of its own. This is fine for us. We’re not storing huge numbers so those leftmost spaces will always be 0 in our data.

Now let’s look at some negatives.


                1111111111101010 is -22
                

                1111111110000101 is -123
                

                1111111010100010 is -350

This is not great, the left half is all 1s! Simple-8b has no way of knowing whether the leftmost 1 is something it put there or something we put there so it will refuse to even try to compress these.

One solution for this is something called ZigZag encoding. If you look at the first few positive numbers, normally they’ll look like this:


                0000000000000001 is 1
                

                0000000000000010 is 2
                

                0000000000000011 is 3
                

                0000000000000100 is 4

ZigZag encoding interleaves the negative numbers in between so now these same 0/1 sequences take on a new meaning and zig zag between negative and positive:


                0000000000000001 is -1 zig
                0000000000000010 is  1 zag
                0000000000000011 is -2 zig
                0000000000000100 is  2 zag

If we look at our negative numbers from earlier, we can see that this gets rid of our problematic left-side 1s.

#	Normal	ZigZag
`-22 -123 -350`	`1111111111101010 1111111110000101 1111111010100010`	`0000000000101011 0000000011110101 0000001010111011`

We only need this for simple-8b, but it can be used with other integer encodings too. Kinda cool!

Pre-Compression

Technically we could run our tests now, but we’re going to do two more things to eke out a little extra shrinkage.

First is delta encoding. The concept is simple: you replace each number in your data set with the difference (delta) from the previous value.


                    timestamp,mass
                    

                    1691452800000,250
                    

                    1691452800103,253
                    

                    1691452800305,279
                    

                    …

→


                    timestamp_delta,mass_delta
                    

                    1691452800000,250
                    

                    103,3
                    

                    202,26
                    

                    …

Visually these already look smaller. Amusingly enough they actually are smaller. We’ll use this for all algorithms except Gorilla which does delta encoding for us.

The second tweak relates to the ordering of our data. So far we’ve been talking about time series as pairs of (timestamp, mass) points. Both specialist algorithms require us to provide a single list of numbers. We have two choices to flatten our pairs:


                Choice 1: [first_timestamp, first_mass, second_timestamp, second_mass, …]
                

                Choice 2: [first_timestamp, second_timestamp, … last_timestamp, first_mass, second_mass, …]

Choice 2 compresses better on all algorithms (generalist too) even when we apply it after delta encoding. Again, Gorilla does its own thing–are you seeing the trend?

Specialist Results

We’ve truncated and pre-encoded, so let’s see some results.

Algorithm	Ratio 1	Ratio 2	Ratio 3	Avg. Ratio	Avg. MB/year
simple-8b	6.92	5.4	7.18	6.5	4
gorilla	6.72	4.18	6.88	5.9	4.4
	⊢ higher is better ⊣				lower is better

I tested with three different types of scale recordings for a bit of variety, then backed out the MB/year from the average compression ratio. Going from 26 MB/year to under 5 is a great result!

Now for the Generalist Ones

Similar to the specialist algorithms, we have a few choices to make before we can run our tests on the generalists.

Formatting

For simplicity we’re going to format our data as CSV. This might seem a little odd but it has a few perks:

It’s human-readable which is nice for debugging.
It’s also fairly compact as far as text representations go.
Most languages have native libraries to make reading/writing CSVs easy. ^{(alas,
Swift does
not)}

We’ll use delta encoding like above–it’d be silly not to. We could really stretch the definition of CSV and stack all of the timestamps on top of all the masses into a single column, but that sacrifices a bit of readability so we won’t.

Picking Algorithms

There are a lot of general purpose compression algorithms. One popular benchmark lists over 70! We’re going to pick just 5. They are:

zlib, LZMA, and LZFSE – these come built-in with iOS which makes my life easier. zlib and LZMA are also fairly common.
Zstandard (aka zstd) and Brotli – from Facebook and Google respectively, both companies with an interest in good compression

Picking Levels

We’ve narrowed it down from 70 to 5, but there’s another curveball. Unlike the specialist algorithms which have no configuration options, most generalist algorithms let you choose a level that trades off speed for better compression. You can compress fast or slow down to compress more.

For simplicity (and so I don’t have to show you a table with 40+ rows) we are not going to test all 11 Brotli levels or all 20+ zstd levels. Instead we’re going to choose levels that run at about the same speed. Apple makes this easier for us since LZFSE has no level and iOS only has zlib 5 and LZMA 6. All we have to do is pick levels for Brotli and zstd from this chart.

_{Speed benchmarks for our 5
algorithms}

We’ll use Brotli 4 and zstd 5 since those are in-line with the fastest iOS algorithm. This means that zlib and LZMA are slightly advantaged but we’ll keep that in mind.

Generalist Results

We’ve prepped our CSV and made all our choices, so let’s see some results.

Algorithm	Ratio 1	Ratio 2	Ratio 3	Avg. Ratio	Avg. MB/year
zlib 5	8.50	5.79	8.18	7.49	3.47
lzma 6	8.12	5.55	7.49	7.1	3.7
zstd 5	7.49	5.71	7.74	6.98	3.72
brotli 4	7.84	5.52	7.53	6.96	3.74
lzfse	7.49	5.36	7.12	6.7	3.8
	⊢ higher is better ⊣				lower is better

Wow! Everything is under 4MB. Coming from 26MB this is fantastic.

Specialist v. Generalist

I’ve plotted everything side-by-side:

_{MB/year by algorithm}

Weirdly, the generalist algorithms universally beat the specialists. On top of that, you’ll recall we picked generalist levels that were fairly fast. So we can actually widen the gap if we’re willing to compress slower.

That feels like cheating, but doing the single column CSV doesn’t. Plus I’m really curious about that, so here it is:

_{MB/year by algorithm including single
column
CSV
results}

Seems like if you’re not a CSV purist you can squeeze an extra 400KB or so. Not bad.

What Gives?

It really does not make sense to me that the generalist algorithms come out on top.

It’s possible I made a mistake somewhere. To check this, I look to see if every compressed time series can be reversed back to the original scale time series. They all can.

My second guess is that maybe my time series data is not well-suited for simple-8b and Gorilla. I saw mention that equally spaced timestamps are preferred and my data is anything but:

`timestamps`	`deltas`
`1691685057323 1691685057413 1691685057504 1691685057622 1691685057732`	`n/a 90 91 118 110`

To see if this is the problem, I re-run the benchmarks and truncate timestamps to the nearest 0.01s, 0.1s and even 1s. This ensures that there is a finite sized set of delta values (101, 11 and 2 respectively).

_{Compression ratio by timestamp
granularity}

As expected this does improve the compression ratio of the specialist algorithms. But it also gives a similar boost to the generalist one. So it doesn’t explain the difference.

I don’t have a third guess. Maybe it is real?

Back to Where We Started

This all started since I was anxious about inflating the size of my humble iOS app. Our baseline was adding 26MB of new data each year, which became ~100MB/year in iCloud. With a general purpose compression algorithm it looks like we can get these numbers down to ~4MB and ~16MB per year respectively. Much better.

Any of the generalist algorithms would work. In my case using one of Apple’s built-ins is an easy choice:

It’s ~1 line of code to implement them. ^{Plus a few lines to make a CSV.}
Using Brotli or zstd would increase my app’s download size by 400-700 KB. Not a lot but avoiding it is nice.

Try It at Home

One thing we didn’t touch on is that the distribution of your data can impact how well the compression works. It’s possible these results won’t translate to your data. To help check that, I’ve put my benchmarking CLI tool and a speed-test macOS/iOS app up on GitHub here.

If you can put your data in CSV format, you should be able to drop it in and try out all the algorithms mentioned in this post. If you do, let me know what sort of results you get! I'm curious to see more real-world data points.

Comments or thoughts? Find me on twitter or mastodon.