These are working notes, not a guide. I run local models on Apple Silicon mostly so I can build tooling for them, and this is the stuff I wish someone had told me in a flat, unexcited list.

This post is authored as MDX rather than plain markdown, which is mostly an excuse to confirm the pipeline handles both.

Memory budget

Unified memory is the whole game. The number that matters is not your total RAM, it is total RAM minus whatever the rest of the system is using, and that second number is larger than you think. Plan for the OS, the browser you forgot to close, and a comfortable margin so the machine does not start swapping.

A rough rule that has held up for me: take the model’s on-disk size, add the working set for your context length, and treat anything past 70 percent of free memory as the point where things get unpleasant. Past that line the model still runs. It just stops being fun.

Throughput

Tokens per second is a fine headline number and a bad planning number. It moves with context length, with batch size, with whether the model is warm, and with what else is competing for memory bandwidth. A tps figure with no context attached is closer to vibes than data.

What I actually watch is the shape of it: does throughput hold steady as the context fills, or does it fall off a cliff at some length? The cliff is the useful thing to know, because it tells you the real working limit of a model on your machine, which is almost always lower than the spec sheet implies.

The closing thought, since I promised one: the constraint is freeing. A machine that cannot run everything forces you to actually choose, and choosing is good for you.