Why I built llamactl · greg mundy

For about a year my main machine has been running a small zoo of local language models. A 7B for general questions, a smaller one for quick edits, something tuned for code, occasionally a big one I spin up and forget about. Managing that by hand got old fast.

The honest version of the problem: I kept losing track. Which models were loaded? Which one was actually serving the editor plugin right now? Which 1.7B had I left running for eight hours doing nothing but holding memory? I had ps, I had Activity Monitor, and neither of them spoke the language of models.

The shape of the problem

A local model is not quite a process and not quite a service. It has a lifecycle (loading, warm, idle, unloaded), it has a memory cost that dwarfs an ordinary process, and it has a throughput you actually care about. None of the tools I had treated it as its own kind of thing.

So I wrote down what I wanted, and it was small:

One command that tells me what is loaded, what is hot, and what is idle. Everything else is a nice-to-have.

That sentence became the spec. llamactl is what came out of it.

What it does now

The core is a status view. It talks to the running model servers, collects what they will tell you, and prints it as something you can read at a glance or pipe into something else.

llamactl status --json

From there it grew the obvious neighbors: start and stop so I am not hunting for the right invocation, and pin so a model I am actively using does not get evicted when something bigger loads. None of it is clever. The point was never to be clever. The point was to make the machine legible.

What I would tell you before you build your own

Build the status command first and live with only that for a week. It will tell you which of the other commands you actually need. I had a list of fifteen subcommands in my head; the week cut it to four. The other eleven were me solving problems I did not have yet.