The two numbers that decide local LLMs: 100 tokens/sec and 1M context
I ran a 754-billion-parameter model on my desk. Here is why I still don't code with it.
TL;DR
Anything below roughly 100 tokens/sec decode, without native 1M token context, is hard to justify as a daily coding tool. After months of benchmarking frontier-size open models on a 512GB Mac Studio M3 Ultra, across MLX and llama.cpp on Metal, nothing I could run cleared both bars. Nothing cleared either bar, honestly. The best quality-competitive model I measured decodes at 63 tokens/sec, and the "1M context" flag on a 754B model turned out to be a warning message and a cap.
The honest exception: at 15 tokens/sec, a giant local model is a genuinely good employee for long-running invisible work, the jobs where nobody is watching the cursor. That is what mine does all day. It just doesn't sit in my editor.
The bar
Two numbers decide whether a local model can replace a cloud one for coding. The first is decode speed. Agentic coding tools burn tokens in loops: plan, edit, run, read the error, edit again. At 100+ tokens/sec that loop feels like a conversation. At 15 tokens/sec you watch a progress bar that happens to be made of words. The second is context. Real work means a repo, its docs, a long session history, and tool output, and the frontier labs now treat a million tokens as table stakes. "Native" matters as much as the number itself: a model extrapolating past its training window isn't remembering your code, it's hallucinating about it.
This post is the evidence file for that claim, from one machine: a Mac Studio M3 Ultra with 512GB of unified memory. Every number below comes from my own benchmark logs. This is the sequel to April's Gemma 4 vs Qwen 3.5 benchmark, after the question changed from "which local model is best" to "why am I still not using any of them for work".
Capacity was never the problem
The headline spec of this machine is that it fits absurd things. GLM-5.1 is a 754B-parameter mixture-of-experts model (40B active per token). The Unsloth IQ3_XXS quant is 268GB across 7 GGUF shards, and it loads onto a single desktop with about 205GB of headroom to spare. I later ran GLM-5.2, a 743B MoE whose 4-bit quant weighed 340GB per my install notes, the same way. Weights were never the constraint. Disk was a constraint (926GB fills up fast when models are 100 to 400GB each), and bandwidth was the constraint, but memory capacity, the thing the spec sheet sells, was solved.
So the interesting question isn't "can it run". Everything runs. The question is what running feels like.
What decode actually looks like
My benchmark harness sends 26 prompts across 6 categories at each model through an OpenAI-compatible server, three repeats each, and a scoring harness grades the output. The throughput column reconstructs honest tokens/sec including reasoning traces, because thinking models make the naive number lie. Here are the headline rows of the leaderboard as of April, sorted by quality:
| Model | Quant | Quality /5 | Decode tok/s | Disk |
|---|---|---|---|---|
| SuperGemma4-26B | MLX 4-bit | 4.41 | 63.3 | 14 GB |
| GLM-5.1 (754B MoE) | MLX 3.6-bit | 4.37 | 6.6 | 382 GB |
| MiniMax-M2.7 | MLX 4-bit mxfp4 | 4.16 | 36.3 | 121 GB |
| Qwen3.5-122B-A10B | GGUF Q8_0 | 4.09 | 22.9 | 121 GB |
| Qwen3.5-397B-A17B | MLX Q4 | 2.78 | 26.0 | 224 GB |
| GLM-4.7-Flash | GGUF Q4_K_M | 2.14 | 62.3 | 18 GB |
Read the two columns against each other. Even the winner tops out at 63 tokens/sec, and every model with frontier-scale parameter counts decodes below 37. Nothing is within shouting distance of 100. And these are short-context numbers, the flattering case: GLM-5.1's decode falls from 17.5 tokens/sec with almost nothing in context to 9.9 by the time just 5,460 tokens are filled, because every generated token attends over everything before it. My own notes conclude that for real long-context work you should expect 10 to 15 tokens/sec. An agentic coding session fills that much context almost immediately.
Tuning bought 3.7x. Physics kept the rest.
I want to be fair to the stack, because the difference between a default setup and a tuned one is enormous. My first GLM-5.1 run, on a fork with default-ish flags, produced 0.83 tokens/sec prefill and 4.62 decode. By the sixth experiment of an eleven-experiment campaign, the production config did 46.26 prefill and 17.29 decode: a 55.7x prefill and 3.7x decode improvement from configuration alone. The big levers were -ub 2048 (raising the microbatch from 512 keeps the GPU busy during prompt processing), a single server slot instead of four, and mainline llama.cpp over both performance forks I tried (the TurboQuant fork was 13% slower on decode, ik_llama.cpp 8% slower on decode and 63% slower on prefill for this architecture).
# The config that survived eleven experiments:
llama-server -m GLM-5.1-UD-IQ3_XXS-00001-of-00007.gguf \
-fa on -ngl 99 -c 202752 -b 2048 -ub 2048 -t 8 -np 1 --mlock
# 46.3 tok/s prefill, 17.3 tok/s decode, 200K native context And that is the point. 17.3 tokens/sec is not a lazy setup. It is the end state of a tuning campaign on the best-supported runtime this hardware has, and it is one sixth of the bar.
The 1M context mirage
The context story is worse, because it looks like it works. GLM-5.1 uses multi-head latent attention, which makes the KV cache almost comically cheap: about 87MB per 1,000 tokens. A 200K context costs 17GB. A million-token context costs 84GB. On a 512GB machine that is nothing. So I asked llama.cpp for a million tokens of context, and it said yes. Sort of:
# llama-server, GLM-5.1, -c 1000000 on a 512GB M3 Ultra:
print_info: n_ctx_train = 202752
llama_context: n_ctx_seq (1000192) > n_ctx_train (202752)
-- possible training context overflow
llama_kv_cache: size = 85709.81 MiB (1000192 cells, 78 layers, ...)
srv load_model: the slot context (1000192) exceeds the
training context of the model (202752) - capping
slot load_model: id 0 | ... | new slot, n_ctx = 202752 It allocated the 84GB cache, warned me, then capped the slot back to 202,752 tokens: the model's training window. That number is the real ceiling, and no flag moves it. Running past it without a configured extension method means RoPE extrapolation, which degrades accuracy exactly when the long context would matter. I marked the 1M config DISCARD in my own experiment log and pinned the server at 202,752.
Even if the window were real, the clock would kill it. Prefill on this model peaked at 185 tokens/sec at its best batch point and fell to 86 tokens/sec by 5,460 prompt tokens. My worst measured case was brutal in absolute terms: one request needed 12,810 prompt tokens evaluated, which took 226 seconds of prefill, nearly four minutes of silence before the first output token, and then decoded at 2.92 tokens/sec on the fork I was testing that day. Extrapolate prefill to a million tokens and you are booking hours, not minutes, before the model says a word. Memory made 1M cheap. Nothing made it fast, and the training window made it fake.
Everything I tried to cheat the wall
I did not take this quietly. Over several months I threw most of the published inference tricks at the problem, on a 27B dense model (where 100 tokens/sec looked reachable) and on the giant MoEs (where it obviously wasn't):
| Technique | Target | Result | Verdict |
|---|---|---|---|
| Baseline (MLX) | Qwen3.5-27B distill, Q4 | 37.0 tok/s | the wall |
| Speculative decoding (4B draft) | Qwen3.5-27B distill, Q4 | 56.1 tok/s | best general case, 1.52x |
| Prompt-lookup decoding | Qwen3.5-27B distill, Q4 | 68.3 tok/s | 1.85x, grounded tasks only |
| KV cache quantization | Qwen3.5-27B distill, Q4 | 36.3 tok/s | no gain at short context |
| Custom EAGLE-3 port to MLX | Qwen3-32B 4-bit | 34.9 tok/s | 1.16x, lost to generic draft's 1.34x |
| Draft-model speculation | GLM-5.1 754B | 15.3 tok/s | 12% slower than no draft |
| Sparse attention (DSA) on Metal | GLM-5.1 754B | 10.0 tok/s | 42-45% slower below 16K, coherence never verified |
Some of these deserve a sentence of respect. Prompt-lookup decoding, the single best result, is free: it drafts from n-grams already in your prompt, hits 86% acceptance on summarization-shaped tasks, and does nothing at all for open-ended generation. The EAGLE-3 port meant hand-rolling a draft head and verify loop in MLX, and it produced my favorite negative result: profiling showed the draft head is only 4 to 6% of each step, so all the clever loop optimization in the world had nothing to optimize. The published EAGLE speedups come from tree drafting, which I never built.
The sparse attention experiment went furthest into the weeds. I wrote a custom Metal SCATTER kernel to bring DeepSeek-style dynamic sparse attention to GLM on Apple Silicon. The kernel passed all 8 backend tests. The full server segfaulted at graph scheduling, and when a later fork finally ran end to end, the per-token indexer cost more than the attention it saved at any context I actually use. My own commit message admits the GLM routing was "unverified for coherent output". That is the state of the art of long-context attention on this platform: not slow, absent. The sparse paths that make 1M context real in the frontier labs simply do not exist in llama.cpp Metal or MLX today.
The napkin math
None of those results are bad luck. Single-stream decode is memory-bandwidth-bound: every generated token re-reads the active weights. That gives you the ceiling from first principles, and my measurements landed at 38 to 72% of it, which is normal once dequantization, attention, and scheduling take their cut.
Autoregressive decode reads the active weights once per token:
ceiling = memory bandwidth / bytes read per token
M3 Ultra: ~800 GB/s effective
27B dense at Q4 (16 GB): 800 / 16 = ~50 tok/s
GLM-5.1, 40B active at IQ3_XXS: 800 / ~17.5 = ~45 tok/s
Measured: 37 tok/s (72% of ceiling) and 17.3 tok/s (38%).
The ceiling is the good news. You never reach the ceiling. Play with that equation and the trap becomes obvious. To decode at 100+ tokens/sec you need the bytes-per-token small: a tiny model, or a very sparse MoE, or a brutal quant. To be worth 1M tokens of your codebase you need the model big and the quant gentle. The two numbers pull in opposite directions on fixed bandwidth, and every model on my leaderboard is one compromise or another between them. Speculative decoding cheats the wall a little by amortizing weight reads across accepted draft tokens. On the 27B it bought 1.5x of the 2.7x the bar demanded. On the 754B, where the bar demanded 6x, it was net negative.
The hardware itself tells you where this goes next. Every server boot prints this:
ggml_metal_device_init: tensor API disabled for
pre-M5 and pre-A19 devices
ggml_metal_device_init: has tensor = false The M3 Ultra predates the tensor units that llama.cpp looks for on M5-class hardware. A machine that costs as much as several years of frontier-model subscriptions is already a hardware generation behind the workload it was bought for.
What a 15 tok/s model is actually for
Here is the part that keeps this from being a teardown. I use GLM-5.2 on this machine constantly, and I am happy with it. My install notes record 14.7 to 16.2 tokens/sec decode, and it does not matter, because the model never talks to me while I watch.
It runs as a local OpenAI-compatible server behind an agent gateway, loaded by launchd, with a script that waits for the multi-hundred-gigabyte download, steals port 1234 from LM Studio, points the agent's config at the new server, and messages me on Telegram when it is live. The jobs it gets are the invisible ones: long agentic runs that grind overnight, bulk triage and research sweeps, summarization pipelines over private data that I would rather not ship to anyone's API. For that class of work, tokens per second is nearly irrelevant, the electricity is cheaper than API credits, and the data never leaves the room.
That is the honest split: cron jobs, not copilots. The moment I tried to make GLM my coding tool, every number above turned into a papercut. The moment I stopped asking it to be interactive, the same machine became the most useful background worker I own.
What would change my mind
The bar is not permanent, and I keep a watchlist. Hardware: M5-generation tensor units reach the Mac Studio class, and the "has tensor = false" line flips. Runtimes: someone implements the sparse attention paths (DSA-style indexers and friends) in Metal or MLX; today llama.cpp loads GLM's sparse-attention indexer tensors into memory and then never uses them. Models: the small-active-parameter MoEs keep improving, and one of them ships a genuinely native long window instead of a YaRN sticker; there is already a "1M" Qwen3-Coder variant sitting in my models folder, and per my research notes no Apple Silicon runtime today implements the sparse long-context paths a window like that leans on. Any two of those together probably put 100 tokens/sec and real six-figure context in the same box.
Until then, my rule stands. Below 100 tokens/sec and without native long context, a local model is not a coding tool. It is infrastructure. Buy it for the invisible work, and let the visible work stay where the tokens are fast.