Running a local LLM stack inside your actual engineering pipeline

Most discussion of local LLMs in 2026 is about running them for chat purposes - a privacy alternative to cloud assistants. That’s a fine use case. It is not the one that’s been shaping my engineering for the past year. My local LLM stack - Ollama + llama.cpp + vLLM + Open WebUI + a few supporting bits - is wired into my actual engineering pipeline, alongside Postgres and Redis, doing real work on real tasks that would otherwise cost API credits or cost nothing but ship with worse quality.

The shape of the stack

The architecture is less exotic than it sounds:

Ollama is the daily driver. It runs on my homelab’s GPU, exposes an OpenAI-compatible API, and serves every locally-routed inference I do during coding and scripting. Models are pulled on demand; a handful stay hot. The interface is so simple that most of my tool integrations treat it as “another OpenAI endpoint, with a different base URL.”
llama.cpp is the CPU-class fallback. When I’m on a laptop without GPU access, or when I want to run a tiny model for a scripted task where latency is fine, llama.cpp runs it. The overlap with Ollama is deliberate - llama.cpp is lower-level, and I use it when I want precise control over quantisation, context length, or sampling parameters.
vLLM is the serving layer for anything that wants real throughput. Batch inference tasks, evaluation runs, synthetic data generation for tests - vLLM does them at a fraction of the wall time ollama takes, because that’s what it was optimized for.
Open WebUI is the human-facing surface. When I want to chat with a local model, compare outputs side by side, or run a multi-turn conversation that’s too casual for a script, that’s where I go.
LM Studio and Jan fill similar gaps on occasion; I keep them installed but use them less than the above.

The whole thing sits on a homelab machine with a GPU that cost less than a month of API spend. The economics cross over fast when you run real inference volume.

What I actually use it for

The tasks that have earned a permanent slot on the local stack, in rough order of how often they fire:

1. Review pre-pass on commits

Before any substantive commit, a local model runs a structured review pass: “read the diff, list anything that looks off, grade by severity.” The goal isn’t to replace review - it’s to catch the obvious things before I open the PR, so human review isn’t spent on stylistic mistakes or missing error-handling. Running this locally is free; running it on a cloud API for every commit would be expensive and slower.

2. Synthetic data for tests

Unit tests frequently need made-up payloads: user records, product catalogues, log lines, event shapes. A local model generates these cheaply, deterministically with a seed, and without round-tripping data to a third-party service. I keep seed scripts in the test tree next to the fixtures they produce.

3. Code explanation

When I’m exploring an unfamiliar codebase - someone else’s OSS project, a new framework - a local model gives me a first pass of “what does this file do, in plain English.” It’s not authoritative; it’s a starting map. I then read the code and correct my mental model where the LLM was wrong. The combination is faster than either alone.

4. Doc drafts

Writing a README, a runbook, or a post-mortem usually starts with a structured prompt and a first-draft generation, then significant human editing. The local model gets me past the blank-page problem in seconds; the editing is where I earn the byline. This post started that way.

5. Research distillation

I feed the model a bunch of bookmarks or links and ask for a synthesis. The synthesis is rarely good enough to act on directly - but it’s often good enough to know which three links are worth reading in full. This is the cheapest research-prioritisation tool I’ve ever had.

What it doesn’t do well

A few categories where local models still lose to the cloud frontier, and it’s not close:

Long-horizon reasoning tasks. The larger frontier models are genuinely better at holding a complex multi-step plan. Local 70B+ models can approximate this; local 30B models can’t really.
Writing in multiple languages with consistent voice. Local models tend to flatten the voice across translations. The flagship cloud models preserve it better. (This matters a lot for the trilingual version of this very site.)
Code generation on unfamiliar frameworks. Frontier models have seen more of the long tail. Local models trained on older snapshots can fumble on, say, Astro 6 conventions.

The right split is: use the local stack for high-volume, low-horizon, cost-sensitive tasks. Use the frontier API for depth and novelty. Most engineering teams in 2026 are running one or the other. Running both, intentionally, with the right split, is the real unlock.

The cost calculus

A rough month of my usage:

Local inference: thousands of requests. Electricity cost: trivial. GPU depreciation: meaningful but amortised over a year or more.
Cloud frontier API: dozens to hundreds of requests, mostly the “actually hard” ones. Cost: meaningful but bounded.

Without the local stack, the cloud spend would be 10–30× higher. With it, the cloud spend is on things where that spend is clearly earning its keep. That’s the kind of two-tier pattern that will look obvious in retrospect and is, right now, being slept on by most teams.

The Fulcrum tie-in

A fair chunk of my local-stack usage runs through my own agent control plane (Fulcrum). Agent runs that need fast feedback get routed to local models; ones that need depth get routed to cloud. The routing policy is in the control plane, not in individual tools - which means the “which model?” question is solved once, by configuration, rather than relearned per-task.

If you’re setting up your own local stack in 2026, the thing to build first is not the inference layer - that’s easy, Ollama handles it. Build the routing layer: the thing that decides, per task, which model to call. That’s where the engineering discipline lives. That’s where the 10× cost savings come from. That’s the thing nobody’s talking about yet, and it’s the thing that will matter most in 18 months.