Quantization is a product decision

When people ask which quantization they should run a local model at, they usually want a number back. Use 4-bit. Use 8-bit. The question is framed as tuning, like picking a buffer size. It is not. It is a product decision, and the right answer depends entirely on what you are using the model for.

The tradeoff is not abstract

Lower precision buys you memory and speed and costs you some quality. That much is well known. What gets lost in the benchmark tables is that the quality loss is not evenly distributed. A heavily quantized model is often perfectly fine at fluent, forgiving tasks and noticeably worse at the precise, brittle ones: exact recall, structured output, long chains of reasoning where one wrong token derails the rest.

So the real question is not how much quality am I willing to lose. It is which capability am I willing to lose, for this use.

Match the level to the job

For a model that drafts text a human will read and edit, aggressive quantization is usually free in practice. The human is the error correction. For a model emitting JSON that a program will parse, or one doing multi-step reasoning with no human in the loop, the same quantization that was free becomes expensive, because there is nothing downstream to catch the error.

I run different models at different levels on the same machine for exactly this reason. The editor assistant is quantized hard and I never notice. The one that has to produce well-formed tool calls is not, and that is a deliberate spend of memory, not a failure to optimize.

The takeaway

Stop asking for the number. Ask what the model is for, what catches its mistakes, and how expensive a wrong token is in that path. The quantization level falls out of those answers. It was a product decision the whole time; the benchmark table just made it look like a technical one.