No one wants a slow LLM. Most LLMs run on GPUs and most methods to make them fast are tailored specifically to GPUs.
LLMs can also run on Apple Neural Engine (ANE), Apple's efficient ML processor that comes in every new iPhone and Mac. Existing GPU optimizations do not easily translate to the Neural Engine which means you end up leaving speed on the table.
Today we'll unlock some speed by adapting a popular optimization known as KV caching to the Neural Engine.
To follow along it's helpful to understand the basic mechanics of a transformer LLM.
An LLM processes a sequence of tokens (word chunks) and predicts what the next token will be. The process is known as a "forward pass", "LLM call" or "prediction". Repeated forward passes build up the words, sentences, and paragraphs of the LLM's response.
_{This LLM predicts that "Fa" follows "Mi".}Attention is a series of matrix multiplications that happens during the forward pass. It helps the model do a good job at predicting the next token.
There are three matrices involved in attention: Q, K and V. They all have the same number of columns which is not particularly interesting today. The number of rows for each is determined by the number of tokens we're processing.
Q has one row for each token where we want the LLM to make a nexttoken prediction. K has one row for each token we want the LLM to consider during its predictions. V is the same as K.
The simplest case is we input some fixed number of tokens into the LLM: "Do re mi". This is 3 tokens so Q, K, and V will all have the 3 rows.
Q's 3 rows mean the LLM will predict 3 new tokens: what comes after "Do", and "re", and "mi". We already know that "re" comes after "Do", and "mi" comes after "re", so we'll ignore those predictions but the prediction for what comes after "mi" is new so we'll keep that.
_{The LLM predicts 3 tokens here, but typically we ignore all but the last.}K and V's 3 rows mean the LLM will consider all 3 tokens when predicting what comes next. So the LLM will make a prediction for what comes after "mi" based on "Do", and "re", and "mi", and their positions relative to each other.
_{You usually wouldn't let an old token like "Do" look at new ones like "Re", but it is technically possible.}A more interesting case is where the LLM takes a smaller number of tokens as input, and also some K and V matrices that were computed in a prior forward pass. Following a similar example: the input token is now "fa" and we also pass along a partial K and V, each with three rows that correspond to "Do", "re", and "mi".
Q will now have 1 row, from "fa", and the LLM will only predict a new token to follow "fa".
K and V will have not 1 but 4 rows! The 3 for "Do", "re", "mi" that were passed in plus one new row that the LLM generates for "fa". This allows the LLM to make a wellinformed prediction since it can still look at all 4 rows of K and V to see what came before "fa". Importantly, it produces exactly the same results as passing all 4 tokens as inputs to the LLM.
_{V is also made up of 3 reused rows, just like K.}This process of reusing K and V is the KV caching that we want to implement today.
Now that we know how the shape of these matrices corresponds to our input tokens, we can touch on the actual computation for attention.
First we multiply Q by K. We have to transpose K (swap its rows and columns) for the matrix multiplication to work.
Next we take the result of this multiplication (with Q's number of rows and K's number of rows as columns) and apply a function called softmax. This doesn't change the matrix's shape. This matrix is multiplied by V in the second matrix multiplication which gives us a final matrix that has the same shape as Q originally.
This final matrix then proceeds on through the rest of the LLM. There is more to attention, but this should be enough to follow along below. (If not, let me know on Twitter.)
It is often convenient to vary the internal workings of an LLM on the fly. The Neural Engine does not allow this.
For a model to run on the ANE it must have input, output, and intermediate matrices that all have static shapes. They cannot change between calls to the model. The computation graph of the model must also be static. This means no conditional branching even if the intermediate tensors have the same shapes.
Both of these constraints can be slightly relaxed in some circumstances but we will stick with the rigid definition for simplicity.
We need to pick static sizes for the matrices Q, K, and V. The number of columns is predetermined and constant, so we only need to choose the number of rows. Let's start by focusing on just Q and K, the first attention multiplication, for simplicity.
We'll give K 512 rows. This means the LLM can look back at 512 recent tokens (word chunks) at most in order to predict the next token. This is usable and we can scale it up if needed.
Picking a size for Q is more interesting. The size of Q is equal to the number of input tokens. This size determines how many tokens we can add at once to K for future predictions (typically >1) and how many new tokens we want to predict (typically 1).
These correspond to the two stages of KVcached LLM processing. Prefill: when the LLM ingests your prompt and builds up a cache. Generation: when the LLM responds.
Since we are restricted to static sizes we need to pick a Q that works for both prefill and generation. This means that a call to an ANE LLM always processes the same number of tokens and always takes the same amount of time, regardless of processing stage.
If we pick a small size for Q, generation will be fast but prefill will be slow since it has to make many calls to process every word in your prompt. But a big size for Q means that generation does a lot of wasted work. We only care about one new token each time but have to multiply a big Q times K.
_{Neither of these is ideal.}The extremes are no good, so we'll split the difference and give Q 64 rows. This means we can process 64 tokens in each forward pass. It will take at most 8 calls to process a full 512 token prompt (8*64=512). These 8 calls take the same amount of time as the first 8 tokens in the generation phase which seems like a reasonable balance. 64 is also a multiple of 8, which aligns with the ANE hardware.
_{The Goldilocks zone.}If you are planning to process longer prompts and generate fewer tokens, you might consider a larger Q. Similarly if your prompts will frequently be shorter, a smaller Q will buy some speed. Either way, be sure to benchmark. Performance is often nonlinear.
To make prefill work, we need to be able to process 64 new tokens at a time. This means we need to compute from scratch the entire Q and also the newest 64 tokens of K on each pass through the LLM. Lucky for us, we don't need to compute the trailing 448 (51264) tokens of K—we can reuse ones that were previously computed. These reused tokens are the K in "KV cache" and not computing them saves a whole lot of computation and time.
Typically a KV cache is implemented statefully: one longlived K matrix that continually appends the newest token's entries each time the LLM is called.
This is a no go on ANE, so instead we use a sliding window approach. Each pass through the LLM takes the 64 new tokens and concatenates the 448 nextnewest tokens to get the full 512 length K matrix.
These 448 nextnewest token K matrices are passed as inputs to the LLM. This means the LLM only needs to do a single concatenation to get the full 512 K. Memory operations, like concat, are slow and only doing 1 is close to the minimum (of zero!).
_{Only 1 concat to get K!}We do need to actually slide K though, so we return the 64 new K tokens from the model and use a secondary model to combine the old 448 and new 64 into an updated 448 K input.
We have to do this in between every LLM call during prefill since we want all 64 tokens to go into the cache immediately. But we only have to do it once every 64 LLM calls during generation: we can reuse the same 448 K until we have a full 64 new tokens.
_{Secondary model to update the cache every 64 tokens. The oldest 64 tokens are discarded.}The secondary cache sliding model lets us use the ANE when it would otherwise be idle. This is actually faster than using a single model even during prefill. It's significantly faster during generation.
If we really hate the idea of using two models, we can make our single model return a preslid K matrix. This works but is slow. You have to construct and return many K matrices that you don't actually need and since these are big matrices it takes time just to shuffle them around inside the model (remember, concat is slow).
This leaves us with a nicely optimized sliding K cache:
We've minimized how often we concat, but that's not the only memory operation in attention. Transposing a matrix (flipping it diagonally) is slow too and we have to transpose K in order to multiply it with Q.
We can lean on our K cache here to minimize this memory movement. Instead of waiting until we have the full 512 length K in hand, we can transpose just the new 64 length K which is smaller and transposes faster. Only then do we concat it with the 448 length K, which comes into the model already transposed, to get our full 512 K.
To make this work, we output the transposed 64 K and update our secondary model to work with transposed Ks.
This is basically a free speed up.
Up to this point we have been talking about a single K cache matrix. A real transformer model has many K caches. For instance Llama 7B has 32. This is a lot of matrices to juggle so it is common to see the KV cache stored as a single tensor that contains all of them. On ANE this requires several concatenations that would be nice to avoid. To do so we take in and return each K cache individually. The extra bookkeeping is straightforward and worth it.
The second matrix multiplication is much less interesting than the first but it is important so let's touch on it briefly. Our goal is to multiply the result of Q*K, called W, with V.
V is the same size as K so we can use the same sliding window cache approach. We don't need to do the transpose trick with V because of how the matrix shapes work out.
For convenience we can make our secondary model process both a K and a V at the same time.
That's all there is to it. You now have all the pieces of a static shaped KV cache attention that works on Apple Neural Engine.
_{The input/output widths are to scale, but the KV cache is much much deeper.}You should see a nontrivial speed up compared to a cacheless model that processes the same number of tokens. For example I have a Llama 2 7B model that saw approximately a 4x speedup.
I also want to touch on a couple things that didn't quite pan out. I'm hopeful there are opportunities to improve and maybe these will give someone an idea.
The purist in me hates using two models to juggle the KV cache. We can avoid it.
Instead of taking the 448 length K cache as an input to the model, we can take in the 7 separate 64 length chunks that make it up. Sliding our cache then just becomes a matter of removing the oldest chunk and adding the newest one.
_{Removing the old chunk and adding the new chunk is zerocost. But the concat is slow.}This completely eliminates the need for a second model, but it means we have to concatenate all 8 chunks to get the full K.
Sadly this concat is slow. Very slow. So this approach is dead. Unless…
Turns out you don't actually need to concat the full K before multiplying by Q.
You can multiply each K chunk by Q individually, then hang onto some extra statistics that allow you to compute the rest of attention.
_{You can trade the final concat for 7x additions, but that's slow too.}This is called the lazy softmax trick (link) and its main selling point is it reduces memory pressure caused by attention. That reduction is traded for, as you might guess, speed. So this is also slow.
Additionally even if it was fast we would need some creative solution to avoid concatenating and summing at the very end.
So I think this too is a dead end for now.
There's a couple places we can potentially squeeze more speed from:
The newest version of iOS/macOS has a feature to enable a stateful KV cache. If you don't care about old OSes, this might be worth a look.
The fact that we recompute 64 tokens each time means we could add some form of multitoken prediction basically for free. There is some research into models that predict many tokens instead of one. There are also speculative decoding methods that could work.
Tweet me or open an issue on GitHub if you have other ideas or questions!
]]>Quantization is often touted as a way to make large language models (LLMs) small enough to run on mobile phones. Despite this, very few of the latest methods are able to use the full power of Apple Silicon on iPhone and Mac. This post introduces a new method of quantization that can.
This method is compatible with all three Apple Silicon coprocessors (CPU, GPU, Neural Engine) which allows it to take full advantage of the speed/battery tradeoffs offered by the hardware.
When compared to standard 4 bit Apple Siliconcompatible methods, it produces consistently more accurate results. Additionally it approaches the accuracy of GPTQ, a widelyused method that is not compatible.
Finally, this method is comparatively accessible. Access to Colab free tier and an Apple Silicon MacBook is sufficient to quantize the full family of GPT2 models up to 1.5B parameters.
_{Lower is better in these plots. You want a smaller model with better performance (lower perplexity is better). In the first plot, naive clustering occasionally performs well but is erratic. In the second, GPTQ is better but cannot run fully on Apple Silicon.}This method extends SqueezeLLM, and remixes ideas from both SmoothQuant and AWQ. It was developed concurrently with OneBit, and shares some similar ideas. Thank you for sharing your research and code!
When you ask an LLM a question, that text gets transformed into a matrix of numbers. This matrix is then transformed repeatedly with a bunch of math until a final transformation that converts it from numbers to the first few characters of the LLM's response.
_{This is accurate for our purposes.}Within these repeated transformations there are many times where the input matrix is multiplied with different hardcoded matrices. These hardcoded matrices can be quite large and end up accounting for most of the space that an LLM takes up. For instance LLaMa 7B, Facebook's opensource LLM, is 13.5GB and 12.9GB of that is the numbers that make up these large matrices.
Typically the matrix's values are stored as 16 bit ^{2 byte} or 32 bit ^{4 byte} floating point numbers. For LLaMa a typical matrix is 4096x4096 which means it takes 33MB on its own. Shrinking those 16 bit elements to 4 bits brings the size of that matrix to 8.3MB. Doing the same for every matrix brings the whole model from 13.5GB to just under 4GB.
Instead of measuring this compression in bytespermatrix, it is measured in bitsperelement. (1 byte = 8 bits). This makes it easier to compare across matrices of different sizes and also gives us some flexibility to store a few extra values alongside our matrix. This is actually fairly common. Including them in the bitsperelement calculation makes for fair comparisons.
So, in summary, LLMs do a bunch of math with matrices in order to generate replies. These matrices are big and quantization's goal is to make them smaller without losing the LLM's ability to reply. This shrinks the model and lets us run it on less powerful devices, like your phone.
If you take the weight from a linear layer (one of the matrices we talked about above) out of an LLM and look at the distribution of its elements, you will generally see a bell curve.
_{It's not a perfect bell curve, but it's close enough to be useful.}Nearly all recent quantization schemes are uniform which means they take this bell curve and pick two values for it. They pick a starting point and also a step size which they then use to place equallyspaced points along the xaxis. To actually quantize the matrix they simply snap all matrix elements to the nearest point.
_{The x points are equally spaced.}This is a nonoptimal use of space in our quantized linear layer. The points on the edges of the bell curve barely capture any matrix elements, but they consume the same amount of space in the LLM as points in the middle which represent many. They are simply not an effective use of bits. (This is fast on GPUs though which is why everyone uses them.)
A common solution for this is to break up the matrix into chunks of either rows, columns, or groups. Having fewer elements in a chunk tends to make quantization more accurate by narrowing the bell curve (more or less) and it only costs a few extra fractions of a bit on average. This sufficiently minimizes the awkwardness of fitting equallyspaced points to bell curvedistributed matrix elements.
Unfortunately Apple Silicon does not support this chunking concept for low (<8) bit quantization. However it makes up for it by allowing models that use a nonuniform quantization scheme. On our bell curve from before, this means we can place our points anywhere we want. So we'll place them optimally.
What is optimal? For each matrix element we calculate how far it is from the nearest point. This is the element's quantization error. We place our points along the bell curve so that the sum of all elements' errors is as low as possible. (kmeans clustering is a good way to do this.)
_{Notice how uniform puts points at the edges, but nonuniform is free to ignore the small number of matrix elements there.}Placing the points optimally like this is all we need to do when we have a lot of points to place. Most LLMs will perform very well if we place 6 bits or 8 bits worth of points (64 and 256 respectively). However when we drop to 4 bits worth, which is only 16 points, this simple optimal placement is not enough.
_{If we were to shade these to show the error they would get darker going from top to bottom. The LLM performance also typically goes from nearly perfect, to good, to bad going from top to bottom.}Our goal is to improve LLM performance when using 4bits as much as possible. We achieve this by making 3 complementary modifications to the quantization process and the LLM itself.
The first modification comes directly from another paper, SqueezeLLM. The paper is fairly approachable, but we'll summarize the parts we're using.
It turns out that every element in these matrices is not equally important. In fact the top few percent are significantly more important. When we're placing our points optimally we should not treat every element equally, but let the more important elements have more sway. But how do we know what's important? We take a small number of input texts (100 is enough), send them through our LLM, and observe the impact each matrix element had on the LLM's response. The higher the total impact, the more important.
_{The triangles represent more important elements. The Naive method is optimal for a standard bell curve, but the importance aware method shifts closer to the triangles. }So far we've been looking at our matrix as a single bell curve. A different way of thinking about it is to look at every column of the matrix as an independent entity that just happens to be joined together in this matrix. Similar to the matrix as a whole, each column's elements are roughly bell curveshaped. Most of the bell curves have similar centers but they all have different standard deviations (how wide or narrow they are).
_{The columns of our matrix are all centered around zero, but the standard deviation varies—some are very wide while others are fairly narrow. }If we divide the elements of each column by the column's standard deviation we make the bell curves roughly the same shape. This makes it easier to place our points since it prevents one column from having undue influence over the rest. (You can also think of it as squishing more elements towards the middle of the curve where we typically place more points.)
_{The same columns from above after dividing every element by each column's standard deviation. This reshapes the bell curves.}It's important that we don't change the output of our LLM, and scaling each column independently changes it. So we need to take the percolumn values that we divided by and correct for them somewhere else in our model. All of the matrices that we're quantizing are used in matrix multiplications. Since we divide each column, and columns determine the output of the matrix multiplication, we can add a step in our LLM after the multiplication to remultiply the removed values back in.
_{We divide before quantizing which makes quantization easier. We have to restore the scale factors at inference time, when the model is generating a response.}This does mean we have to keep a few extra values as 16 bit (the ones we multiply back in). For a 768x768 matrix, we need 768 extra values in 16 bits. This puts us at 4.02 bits on average which is a reasonable tradeoff. The average number of bits decreases as the model and matrices get larger which makes this even less of a concern. (LLaMa is 4.003 or less depending on the version.)
So far we've been looking closely at the matrix itself. Let's now look at how it is used. As mentioned
above, these matrices are used for matrix multiplication. Specifically the model is performing matrix X
times matrix W, X*W
, and we are quantizing W. We've talked about how the elements in W are
nicely distributed with an average close to zero and bellshaped. This is not true for X. X depends on the
text
that the model received as input and can vary significantly in how its elements are distributed.
Why does this matter? Imagine a simple product of two numbers: x*w
. Let's say that in this
case our 1 element matrix, w, has the value of 2.3 and we quantize it to 2. When we do x*2
we
get a quantization error of x*0.3
. The closer that x is to zero, the less error. The farther
it is from zero the more we get.
This extrapolates to matrix multiplication. When we look at a column of matrix X, if most of its values are far from zero then the impact of quantization error for that column will be larger.
Similar to our first modification, we can inspect a small number of texts as they flow through the LLM. If we do this we'll see that there is consistency in which columns of X, the input matrix, have larger or smaller values in general.
_{We're focusing on the columns of the left matrix. This matrix is derived from the input to the LLM.} _{The distribution of average values for select columns. Notice that they are not centered at zero unlike the matrix we're quantizing.}To minimize the impact of large values in X we can apply a perinput column shift. This will move most of the values in X closer to zero on average, thereby reducing the impact of our quantization error. We shift by subtracting the average of the values we saw for that column.
_{The same columns from above after subtracting the average value from each column. They are now centered around zero.}
Similar to our second modification, we need to reverse this change in the model in order to not change its
outputs. This one is a little trickier, but again easier to think about without matrices involved. If we
take our x*w
from earlier we can make a new shifted input y (so, y=xshift
). Now
the model will do y*w
which is actually (xshift)*w
. If multiply that out we get
x*w  shift*w
. Since shift and w are both constants we just need to precompute that value and
subtract it after the matrix multiplication in the model. This undoes the impact of shifting X but reduces
the error when w is quantized. (Extrapolating this to matrices is a little harder, but still doable.)
Depending on the model this adds between zero and two additional vectors of 768 elements in 16 bits. At a worst case this brings us up to 4.06 bits total.
Used individually, these modifications have varying efficacy. Generally the first modification, from SqueezeLLM, works well on its own. When we add in the other two modifications we see consistent improvement. This leaves us with a quantization scheme that is both more accurate and less erratic than the baseline 4bit method we wanted to improve upon.
_{SqueezeLLM (Weighting) is surprisingly effective on its own, even at the wholetensor level. Adding our other modifications consistently improves upon it. The improvement for gpt2large is negligible—something interesting to follow up on.}tl;dr We used an amalgamation of existing and, I think, new ideas to quantize LLM linear layers to ~4 bits on average, dramatically shrinking model size. Since we do this at the tensor level without grouping, this method is fully compatible with Apple Silicon on iPhone or Mac which opens the door for larger models on your devices.
Thanks for reading! To stay in the loop as I explore more, you can give me a follow on Twitter. If you'd like to give it a go yourself, I've got a dropin replacement for torch's Linear layer, as well as some instructions: here. Please get in touch, ask questions, and let me know what you learn!
Part of my motivation for writing this is to find folks who are smarter than me, who can maybe check my work, and maybe even take it further. If that's you, do please reach out! There's a couple directions that I think still have more to give / would be interesting to explore:
To further support that these modifications are complementary, wikitext perplexity was measured in all possible combinations. As mentioned above, gpt2large is an outlier but the differences are minor. The SqueezeLLM fisher information (sensitivities) were computed using C4 in all cases.
Model  Weighting  Scaling  Shifting  Weight+Scale  Weight+Shift  Scale+Shift  Weight+Scale+Shift 

gpt2  30.8947  43.0891  44.9285  28.8972  29.1401  43.5065  28.1946 
gpt2medium  21.4389  30.9959  23.8464  20.4853  19.8515  23.6801  19.904 
gpt2large  17.2172  22.5075  18.1454  17.1246  17.1282  25.2589  17.1507 
gpt2xl  16.1751  15.89  15.7223  15.148  16.0874  17.0936  15.1148 
Model  float16  naive 4bit  Weight+Scale+Shift  GPTQ 

gpt2  25.1876  62.1889  28.1946  26.5 
gpt2medium  18.4739  23.7826  19.904  19.1719 
gpt2large  16.4541  27.3636  17.1507  16.6875 
gpt2xl  14.7951  15.89  15.1148  14.9297 
Apple's latest OSes include several transformer models that are optimized for the Apple Neural Engine. We'll take a look at how they're implemented and see if there's anything we can apply to our own models. To make that easier, I've cobbled together support for viewing them in Netron—you can try it yourself here.
While everyone is talking about AI or GPT, Apple made a point to use the words "machine learning" and "transformer" when announcing new features for this year's operating systems (iOS 17 and macOS Sonoma).
Apple has been vocal about their Machine Learning accelerator, the Neural Engine (ANE), so it's no surprise that these models are designed to leverage its capabilities.
In contrast to their normal secrecy, Apple has been fairly public about how to run the transformer model architecture on the ANE. In the past year and a half they:
The models embedded in the new OS are not quite as easily inspected as a research article or GitHub project. However they are a year newer. Let's see what we can learn from them!
This is most interesting if you're familiar with transformers and how they work. However if you are just generally curious I've tried to add explainers throughout to fill in some background.
They'll look like this.
Feel free to skip them.
We'll look at two models today. One powers the keyboard autocomplete, and the other does speech to text. Both use the transformer architecture to a degree.
_{The input and first layer of the autocomplete model, annotated.}
We won't go too deep into the models individually, rather just highlight the interesting bits.
Model: Keyboard Autocomplete
The outputs of a transformer are a bunch of probabilities for which token out of the vocab should come next. To compute these, you need to load a large mapping from token ID to embedding vector into memory.
One dimension of this mapping matrix is equal to the number of tokens in the vocabulary. For many models this is quite large. gpt2 (2019) has 50,257 tokens in its vocabulary. LLaMa and Llama2 (2023) have 32,000.
Apple's autocomplete model only has 15,000. Not only is this number smaller, it is also just underneath the Neural Engine's threshold for tensor size. This means that the final computation to determine probabilities can happen on the Neural Engine instead of paying the cost to transfer to CPU.
_{The inner_product here is the language modeling (lm) head.}
Lesson: If possible, keep your vocab under 16384. ^{[1]}
_{[1] If you don't have control of this, you can duplicate the embedding matrix and do most of the computation on ANE. Here's an example.}
Model: Speech to Text
When using transformers for text generation, a common way to speed them up is to use KV caching. This saves you a decent amount of computation.
_{An example of how the Key (K) cache is used. With traditional KV caching, the input is 1 token and the cache is the size of all past tokens.}
In most implementations, the size of the KV cache increments for each new token. The ANE requires that a
model's inputs and outputs are a fixed size^{*}, which means a traditional KV cache is off the
table.
_{*not strictly true, but practically}
You can use KV caching for any transformer model, not just text generation, and it seems that Apple has found a way to make it work for their speechtotext model.
They have sidestepped the ANE constraints by using a fixed size input for their new tokens and sliding their KV cache by that same amount for each inference.
_{Apple's KV cache slides so that the inputs are always the same size. In this example there are always 2 input tokens and cache that encodes 3 tokens. This gives an effective sequence length of 5.}
This gives a meaningful speed up (25x in my experience). However there are two caveats.
First, you have to use IOSurfacebacked inputs and outputs otherwise all of the speed gained is lost again by time spent copying them in and out of CoreML. Second, if you are on Sonoma/iOS17, you can't have any CPU segments at the start of your model or it will be really slow—this seems like a regression so I have filed feedback.
Lesson: Use KV caching. If you're on Sonoma/iOS17, do your CPU work in a separate model.
The KV cache is actually a concatenation of caches for two different tensors: a Key (K) and Value (V). Often these are combined into one cache for simplicity, but Apple keeps them separate.
Why keep them separate? First, you can store the Key cache transposed instead of transposing it before using it. Transposing large tensors is extra work that you can avoid (this is in line with Apple's principle of "minimize memory copies"). Secondly, the KV cache is a large tensor and by separating it into two, you keep the intermediate tensors smaller.
_{Separate caches for K and V and K is transposed.}
I don't see much impact from this, but it makes sense to me since you are avoiding work.
Lesson: Maybe transpose your K cache and keep it separate from the V cache.
Model: Both
One of the optimizations Apple recommends for the Neural Engine is to use a layer norm that normalizes along an uncommonly used axis. PyTorch's layer norm doesn't support this, so Apple provides a multistep manual implementation.
I was curious to see what Apple used for the layer norm for two reasons. First, on Ventura/iOS 16 I found that the layer_norm (specifically the reduce_mean) caused my models to lose precision in float16. Second, CoreML has native support for layer norm along the uncommon axis and I was curious if it would be used.
Interestingly enough, it seems like Apple uses the same implementation that they open sourced in mlanetransformers. You can even see that most of the variable names line up!
_{Almost exactly the same! I am slightly confused by the alpha in the zero_mean though.}
I was hoping for something creative here, but on the plus side it seems that layer norm is more resilient in float16 on the new OSes.
Lesson: Just use Apple's custom layer norm.
Model: Both
Both models use quantization to reduce the size of their weight parameters. Transformer models are often bottlenecked by the amount of weight parameters they have to load and then unload. The new OSes have support for runtime dequantization which helps reduce this bottleneck.
This can reduce the accuracy of your model, so keep an eye on that.
Lesson: Try quantizing your model. Two good sources: coremltools docs and this Huggingface/mlstablediffusion article.
There are a couple other things I noticed but I don't know how to take advantage of them. Despite that, they are still interesting in and of themselves—if you see a way to use them, please let me know!
Single Input The text autocomplete model takes 3 inputs: 128 token IDs, 128 position values and 128 segment values. It passes them to the model as one concatenated input and then immediately splits them. I'm not sure the benefit of this, but it seems slightly odd so maybe there is one?
_{In the autocomplete model, the 3 embedding fields are passed as one input.}
Shared Weights The text autocomplete model actually has two versions, one for CPU and one for ANE. They are slightly different (different inputs and outputs), but they both share the same weights. I don't believe this is currently possible using Apple's provided tooling, but it does open up some interesting possibilities. To achieve something similar today you have to ship two copies of the same weights.
$ head n2 unilm_joint_ane.espresso.net
{
"storage": "unilm_joint.espresso.weights",
$ head n2 unilm_joint_cpu.espresso.net
{
"storage": "unilm_joint.espresso.weights",
MultiHead Softmax Apple's implementation of the transformer in mlanetransformers splits a large matrix multiplication up into several smaller ones, then performs a softmax on each result (here). In contrast, the autocomplete model concatenates the results of the split matrix multiplications, performs one softmax, then resplits that. I didn't see any performance difference from doing this, but I was only looking at speed.
Extra Outputs The CPU version of the autocomplete model outputs the next token logits, but also the prelogit embeddings. This isn't super novel, but worth mentioning since the cost of getting alreadyexisting data out of the model seems to be fairly low if you use IOSurfacebacked buffers as mentioned above. This might be counterintuitive since some of these outputs can be rather large.
Those are the eight things that stood out to me from looking at Apple's new models. Four of them are useful, four of them are just interesting.
If you'd like to look for yourself, you can find the models here on macOS Sonoma:
/System/Library/LinguisticData/RequiredAssets_en.bundle/AssetData/en.lm/unilm.bundle
find /System/Library/AssetsV2/com_apple_MobileAsset_Trial_Siri_SiriUnderstandingAsrAssistant name "AMConformer"
I have a hacky fork of Netron here that can open them (it will only open the first 3000 operations of the Speech to Text model since it is huge).
If you find anything interesting or if I misinterpreted something I would love to know. Drop me a line!
]]>So you have some time series data and you want to make it smaller? You may not need an algorithm designed specifically for time series. Generic compressors like gzip work quite well and are much easier to use.
Of course this depends on your data, so there’s some code you can use to try it out here.
Recently I started working on a way to save Bluetooth scale data in my iOS coffeebrewing app. I want to allow people to record from a scale during their coffeebrewing sessions and then view it afterwards. Scale data is just a bunch of timestamps and weight values. Simple, yes, but it felt like something that might take a surprising amount of space to save. So I did some napkin math:
1 scale session / day
10 minutes / session
10 readings / second
= 2.19M readings / year
1 reading = 1 date + 1 weight
= 1 uint64 + 1 float32
= 12 bytes
2.19M * 12B = 26 MB
26 MB per year is small by most measures. However in my case I keep a few extra copies of my app’s data around as backups so this is more like ~100MB/year. It’s also 40x the size of what I’m saving currently! This puts my app in danger of landing on the one Top Apps list I would not be stoked to be featured on:
_{iCloud storage usage}So let’s avoid that. At a highlevel I see two options:
Save less. 10 scale readings/second is probably more granularity than we’ll ever need. So we could just not save some of them. Of course if I’m wrong about that, they’re gone forever and then we’ll be out of luck.
Save smaller. Looking at some example data, there are a lot of plateaus where the same value repeats over and over. That seems like it could compress well.
_{Example brewing session time series}This is my first rodeo with compression. I’m starting from basics like “compression makes big things small” and “double click to unzip”. Doing a little research seems like a good idea and it pays off.
My scale data is technically “time series data” and it turns out we are not the first to want to compress it. There is a whole family of algorithms designed specifically for time series. This blog post is a great deep dive, but for our purposes today we’ll be looking at two of the algorithms it mentions:
Algorithms designed for exactly my problem space sound ideal. However something else catches my eye in a comment about the same blog post:
rklaehn on May 15, 2022I have found that a very good approach is to apply some very simple transformations such as delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest.
Using a general purpose algorithm is quite intriguing! One thing I’ve noticed is that there are no Swift implementations for simple8b or Gorilla. This means I would have to wrap an existing implementation (a real hassle) or write a Swift one (risky, I would probably mess it up). General purpose algorithms are much more common and sidestep both of those issues.
So we’ll look at both. For simplicity I’ll call simple8b and Gorilla the “specialist algorithms” and everything else “generalist”.
Starting with the specialists seems logical. I expect they will perform better which will give us a nice baseline for comparison. But first we need to smooth out a few wrinkles.
While wiring up an opensource simple8b implementation I realize that it requires integers and both our timestamp and weight are floating point numbers. To solve this we’ll truncate to milliseconds and milligrams. A honey bee can flap its wings in 5 ms. A grain of salt is approximately 1mg. Both of these feel way more precise than necessary but better to err on that side anyways.
49.0335097 seconds
17.509999999999998 grams
49033 milliseconds
17509 milligrams
We’ll use this level of precision for all our tests except Gorilla, which is designed for floating point numbers.
Negative numbers show up semifrequently in scale data because often when you pick something up off a scale it will drop below zero.
Unfortunately for us simple8b doesn’t like negative numbers. Why? Let’s take a little detour and look at how computers store numbers. They end up as sequences of 1s and 0s like:
0000000000010110 is 22
0000000001111011 is 123
0000000101011110 is 350
You’ll notice that these tend to have all their 1s all on the right. In fact, only very large numbers will have 1s on the left. simple8b does something clever where it uses 4 of the leftmost spaces to store some 1s and 0s of its own. This is fine for us. We’re not storing huge numbers so those leftmost spaces will always be 0 in our data.
Now let’s look at some negatives.
1111111111101010 is 22
1111111110000101 is 123
1111111010100010 is 350
This is not great, the left half is all 1s! Simple8b has no way of knowing whether the leftmost 1 is something it put there or something we put there so it will refuse to even try to compress these.
One solution for this is something called ZigZag encoding. If you look at the first few positive numbers, normally they’ll look like this:
0000000000000001 is 1
0000000000000010 is 2
0000000000000011 is 3
0000000000000100 is 4
ZigZag encoding interleaves the negative numbers in between so now these same 0/1 sequences take on a new meaning and zig zag between negative and positive:
0000000000000001 is 1 zig
0000000000000010 is 1 zag
0000000000000011 is 2 zig
0000000000000100 is 2 zag
If we look at our negative numbers from earlier, we can see that this gets rid of our problematic leftside 1s.
#  Normal  ZigZag 

22

1111111111101010

0000000000101011

We only need this for simple8b, but it can be used with other integer encodings too. Kinda cool!
Technically we could run our tests now, but we’re going to do two more things to eke out a little extra shrinkage.
First is delta encoding. The concept is simple: you replace each number in your data set with the difference (delta) from the previous value.
timestamp,mass
1691452800000,250
1691452800103,253
1691452800305,279
…
timestamp_delta,mass_delta
1691452800000,250
103,3
202,26
…
Visually these already look smaller. Amusingly enough they actually are smaller. We’ll use this for all algorithms except Gorilla which does delta encoding for us.
The second tweak relates to the ordering of our data. So far we’ve been talking about time series as pairs of (timestamp, mass) points. Both specialist algorithms require us to provide a single list of numbers. We have two choices to flatten our pairs:
Choice 1: [first_timestamp, first_mass, second_timestamp, second_mass, …]
Choice 2: [first_timestamp, second_timestamp, … last_timestamp, first_mass, second_mass, …]
Choice 2 compresses better on all algorithms (generalist too) even when we apply it after delta encoding. Again, Gorilla does its own thing–are you seeing the trend?
We’ve truncated and preencoded, so let’s see some results.
Algorithm  Ratio 1  Ratio 2  Ratio 3  Avg. Ratio  Avg. MB/year 

simple8b  6.92  5.4  7.18  6.5  4 
gorilla  6.72  4.18  6.88  5.9  4.4 
⊢ higher is better
⊣

lower is better

I tested with three different types of scale recordings for a bit of variety, then backed out the MB/year from the average compression ratio. Going from 26 MB/year to under 5 is a great result!
Similar to the specialist algorithms, we have a few choices to make before we can run our tests on the generalists.
For simplicity we’re going to format our data as CSV. This might seem a little odd but it has a few perks:
We’ll use delta encoding like above–it’d be silly not to. We could really stretch the definition of CSV and stack all of the timestamps on top of all the masses into a single column, but that sacrifices a bit of readability so we won’t.
There are a lot of general purpose compression algorithms. One popular benchmark lists over 70! We’re going to pick just 5. They are:
We’ve narrowed it down from 70 to 5, but there’s another curveball. Unlike the specialist algorithms which have no configuration options, most generalist algorithms let you choose a level that trades off speed for better compression. You can compress fast or slow down to compress more.
For simplicity (and so I don’t have to show you a table with 40+ rows) we are not going to test all 11 Brotli levels or all 20+ zstd levels. Instead we’re going to choose levels that run at about the same speed. Apple makes this easier for us since LZFSE has no level and iOS only has zlib 5 and LZMA 6. All we have to do is pick levels for Brotli and zstd from this chart.
_{Speed benchmarks for our 5 algorithms}We’ll use Brotli 4 and zstd 5 since those are inline with the fastest iOS algorithm. This means that zlib and LZMA are slightly advantaged but we’ll keep that in mind.
We’ve prepped our CSV and made all our choices, so let’s see some results.
Algorithm  Ratio 1  Ratio 2  Ratio 3  Avg. Ratio  Avg. MB/year 

zlib 5  8.50  5.79  8.18  7.49  3.47 
lzma 6  8.12  5.55  7.49  7.1  3.7 
zstd 5  7.49  5.71  7.74  6.98  3.72 
brotli 4  7.84  5.52  7.53  6.96  3.74 
lzfse  7.49  5.36  7.12  6.7  3.8 
⊢ higher is better
⊣

lower is better

Wow! Everything is under 4MB. Coming from 26MB this is fantastic.
I’ve plotted everything sidebyside:
_{MB/year by algorithm}Weirdly, the generalist algorithms universally beat the specialists. On top of that, you’ll recall we picked generalist levels that were fairly fast. So we can actually widen the gap if we’re willing to compress slower.
That feels like cheating, but doing the single column CSV doesn’t. Plus I’m really curious about that, so here it is:
_{MB/year by algorithm including single column CSV results}Seems like if you’re not a CSV purist you can squeeze an extra 400KB or so. Not bad.
It really does not make sense to me that the generalist algorithms come out on top.
It’s possible I made a mistake somewhere. To check this, I look to see if every compressed time series can be reversed back to the original scale time series. They all can.
My second guess is that maybe my time series data is not wellsuited for simple8b and Gorilla. I saw mention that equally spaced timestamps are preferred and my data is anything but:
timestamps

deltas

1691685057323

n/a

To see if this is the problem, I rerun the benchmarks and truncate timestamps to the nearest 0.01s, 0.1s and even 1s. This ensures that there is a finite sized set of delta values (101, 11 and 2 respectively).
_{Compression ratio by timestamp granularity}As expected this does improve the compression ratio of the specialist algorithms. But it also gives a similar boost to the generalist one. So it doesn’t explain the difference.
I don’t have a third guess. Maybe it is real?
This all started since I was anxious about inflating the size of my humble iOS app. Our baseline was adding 26MB of new data each year, which became ~100MB/year in iCloud. With a general purpose compression algorithm it looks like we can get these numbers down to ~4MB and ~16MB per year respectively. Much better.
Any of the generalist algorithms would work. In my case using one of Apple’s builtins is an easy choice:
One thing we didn’t touch on is that the distribution of your data can impact how well the compression works. It’s possible these results won’t translate to your data. To help check that, I’ve put my benchmarking CLI tool and a speedtest macOS/iOS app up on GitHub here.
If you can put your data in CSV format, you should be able to drop it in and try out all the algorithms mentioned in this post. If you do, let me know what sort of results you get! I'm curious to see more realworld data points.
]]>