I’ve been building a local Ollama pool to delegate small, well-scoped coding chores from a main agent. Before cabling routing rules into the agent, I wanted a defensible answer to “which model for which task family.” This post is the bench I ran, the surprises, and the methodology lessons. The full repro (bash wrappers, prompts, verifier) is single-file Python + curl + jq, so it should be easy to reproduce or extend.
## TL;DR
I ran 6 models against 3 strict, single-function prompts, auto-graded by I/O equivalence (32 test cases total). Then I ran the most discriminating prompt 3 times on every model to measure variance. The single-shot ranking and the post-variance ranking did not agree.
Headline findings:
1. The post-variance winner on narrow code-gen tasks is `gemma4:latest`. Byte-stable 22/22 across 3 runs. Single-shot ranking placed it 5th because it failed an unrelated test-scaffolding prompt that needed Python module-level reasoning.
2. `qwen2.5-coder:14b` is the right pick for prompts requiring runtime/Python semantics. Stable 20-22/22, only model that handled a stale-reference trap correctly.
3. `qwen3.5:9b` failed 2 of 3 runs on the same prompt. Produced byte-identical buggy code in two consecutive runs at `temperature=0.2`. The 21/22 score that put it #1 in the single-shot ranking was the *less common* sampling path.
4. `qwen3.5:4b` was wildly unstable. Score swung from 4/22 to 19/22 across runs at `temperature=0.2`. Useful only with a best-of-N + verifier wrapper.
5. The Qwen3 thinking variants returned empty `response` fields on 100% of constrained code-gen prompts until I set `think:false`. Default-on thinking was a complete trap.
Methodological lesson: single-shot LLM benchmarks lie in both directions. Variance flipped my “winner” and uncovered a “loser” that was actually best-in-class for a specific task family.
## Setup
- Hardware: single workstation, 16 GB VRAM (Quadro), Ollama on `127.0.0.1:11434`.
- Driver: a 60-line bash wrapper that POSTs each prompt with `temperature=0.2`, `stream=false`, and writes each response to a file.
- Verifier: a Python script that strips markdown fences, `exec()`s each model’s output, and runs a battery of valid + invalid inputs against the resulting function. Every score below is automated.
## The three prompts
All three explicitly forbid markdown fences, imports outside the function body, and any preamble.
- **P1**: pytest test generator with a stale-reference trap. The function under test rebinds the module global, so the test must re-read by attribute, not hold a local. Binary pass/fail.
- **P2**: `parse_iso_duration(s: str) → int` for `PTHMS` strings, raising `ValueError(“invalid ISO duration: …”)` on malformed input. 6 valid + 8 invalid cases.
- **P3**: `flatten(d: dict, sep: str = “.”) → dict` that recurses into nested dicts but leaves lists/tuples as-is, and drops empty nested dicts entirely. 10 cases including custom separators, depth>3, mixed types, and the “only empty subtree” edge case.
## Single-shot results (N=3 prompts, 1 run each)
Score per prompt is normalized to [0,1] (P1 is 0/1, P2 is /22, P3 is /10) and averaged.
| # | Model | Size | P1 | P2 | P3 | Score |
|—|—|—|—|—|—|—|
| 1 | qwen3.5:9b (`think:false`) | 6.6 GB | yes | 21/22 | 10/10 | 0.985 |
| 2 | qwen2.5-coder:14b | 9.0 GB | yes | 20/22 | 10/10 | 0.970 |
| 3 | qwen3.5:4b (`think:false`) | **3.4 GB** | yes | 20/22 | 8/10 | 0.903 |
| 4 | qwen3:14b (`think:false`) | 9.3 GB | yes | 8/22 | 10/10 | 0.788 |
| 5 | gemma4:latest | 9.6 GB | no | **22/22** | 10/10 | 0.667 |
| 6 | deepseek-coder-v2:16b | 8.9 GB | no | 16/22 | 9/10 | 0.542 |
This ranking turned out to be misleading. Read on.
## Variance check that flipped the ranking (3 runs of P2, all 6 models)
Same prompt, same `temperature=0.2`, three independent calls per model:
| Model | Run 1 | Run 2 | Run 3 | Mean | Stability |
|—|—|—|—|—|—|
| **gemma4:latest** | **22/22** | **22/22** | **22/22** | **22.0** | perfect x 3 |
| qwen2.5-coder:14b | 22/22 | 20/22 | 20/22 | 20.7 | tight cluster |
| qwen3:14b (`think:false`) | 17/22 | 16/22 | 17/22 | 16.7 | stable, mediocre |
| deepseek-coder-v2:16b | 16/22 | 16/22 | 12/22 | 14.7 | stable, wrong on valid inputs |
| qwen3.5:9b (`think:false`) | 9/22 | 9/22 | 21/22 | 13.0 | bimodal |
| qwen3.5:4b (`think:false`) | 4/22 | 19/22 | 16/22 | 13.0 | wild |
`gemma4` was byte-stable perfect across 3 independent runs. Not just hitting 22/22 once, but the only model where I’d trust the answer without re-checking. The single-shot ranking placed it 5th because it failed the unrelated P1 prompt.
`qwen3.5:9b` returned byte-identical buggy code in runs 1 and 2 (725 bytes each) and a different correct-ish answer in run 3. The 21/22 score that put it #1 in single-shot was the less common sampling path. Its dominant decoding mode is broken on this prompt.
`deepseek-coder-v2:16b` is stably wrong: 0/6 valid inputs across all 3 runs. Same regex bug every time. Rerunning won’t save it.
The bug that hit `qwen3.5:9b` twice in a row at temp 0.2 was a regex requiring all three letters: `^(\d+)?H(\d+)?M(\d+)?S$`. So `“PT5M”` fails because there’s no `H` and no `S` literal. Subtle, plausible-looking, and it ships unless you actually run the function.
## Gotcha: Qwen3 thinking models silently return empty
First pass on Qwen3, with the default `think:true`:
| Model | Wall time | `response` bytes |
|—|—|—|
| qwen3:14b | **1174 s** | 1 (just `\n`) |
| qwen3.5:9b | 116 s | 1 |
| qwen3.5:4b | 81 s | 1 |
Twenty minutes of GPU time on the 14B and zero output. Ollama’s `/api/generate` returns two fields for thinking-mode models: `response` and `thinking`. My script only logged `response`. When I dumped the raw JSON, the 9B’s `thinking` field was 21 KB of this:
```
* Wait, I need to check if I can use `src` if `import src.main_improved` is used.
* Yes.
* So I will use `src.main_improved`.
* Wait, I need to check if I can use `src` if `import src` is used.
* Yes.
* So I will use `src.main_improved`.
```
`done_reason: “stop”` on a 21,000-character thinking trace with no output. The model talked itself in circles and never committed to an answer.
The fix is one parameter: `“think”: false` in the request body. With it, all three Qwen3 sizes responded in 8-11 seconds and produced clean code. Worth being aware of if you’re benchmarking thinking-capable models with strict output requirements: smoke-test `think:false` first, and log both fields.
## Same model, opposite verdicts on different prompts
`gemma4:latest` scored a perfect 22/22 on the regex parser. On P1 (the test-generation prompt), it produced this:
```python
def test_invalidate_model_cache_resets_all_keys():
global \_model_cache # <-- bug
\_model_cache = {"model": "x", "cost_matrix": "y", "timestamp": "z"}
invalidate_model_cache()
assert \_model_cache\["model"\] is None
...
```
The `global` binding inside the test creates a `_model_cache` in the *test* module, not in `src.main_improved`. So `invalidate_model_cache` rebinds the source module’s dict and the assertion checks an unrelated local. The test silently passes for the wrong reason.
`deepseek-coder-v2:16b` made the same mistake. A model that handles regex flawlessly cannot necessarily reason about Python’s module-level rebinding semantics in a test scaffold. This is the strongest case I have for running at least two unrelated tasks before deciding which model to route where.
## Markdown fence compliance
Both prompts said “no markdown fences.” Compliance:
| Model | P1 fences | P2 fences |
|—|—|—|
| qwen2.5-coder:14b | yes | yes |
| deepseek-coder-v2:16b | yes | yes |
| gemma4:latest | no | no |
| qwen3:14b (`think:false`) | no | no |
| qwen3.5:9b (`think:false`) | no | no |
| qwen3.5:4b (`think:false`) | no | no |
The instruct-tuned coder models (`qwen2.5-coder`, `deepseek-coder`) wrap output in fences regardless of the instruction. The Qwen3 family and `gemma4` follow the no-fences instruction. If your delegation wrapper does not strip fences before `exec()`, you’ll see “broken” output that’s actually correct code in a string.
## Cross-prompt confirmation: gemma4 + qwen2.5-coder on P3 (3 runs each)
To check that the gemma4 specialization wasn’t a one-prompt fluke, I ran 3 more runs of P3 (the dict flatten task) on the two stable models:
| Model | P3 Run 1 | Run 2 | Run 3 | Mean |
|—|—|—|—|—|
| **gemma4:latest** | 10/10 | 10/10 | 10/10 | **10.0** |
| qwen2.5-coder:14b | 10/10 | 10/10 | 9/10 | 9.7 |
`gemma4` went 6 for 6 across both code-gen prompts: perfect, byte-stable. `qwen2.5-coder` lost a single point on P3 run 3 with `if v:` (truthy check) instead of `if v is not None`, silently dropping a `None` value. Subtle, but the kind of idiomatic Python bug a real test would catch.
## Best-of-N + verifier rescue for the unstable models
The qwen3.5 family failed variance because at `temperature=0.2` they produced byte-identical buggy outputs. Natural fix: bump temperature for diversity, sample N times, run the verifier, keep the passer.
5 samples of P2 at `temperature=0.7`:
| Model | Run scores (best to worst) | Best-of-5 | Hit rate >=18/22 | Wall time | VRAM |
|—|—|—|—|—|—|
| qwen3.5:9b (`think:false`) | 22, 21, 14, 14, 8 | **22/22** | 2/5 (40%) | 30 s | 6.6 GB |
| qwen3.5:4b (`think:false`) | 20, 20, 20, 13, 8 | **20/22** | 3/5 (60%) | 20 s | 3.4 GB |
Both produced 5 distinct hashes, so the diversity is real, not pathological. With a verifier in the loop:
- `qwen3.5:9b` best-of-5 matches `gemma4` and `qwen2.5-coder` single-shot (22/22 ceiling) at 6.6 GB and ~30s. Comparable to running `gemma4` directly. Not worth the complexity unless `gemma4` isn’t available.
- `qwen3.5:4b` best-of-5 is the real win: 20/22 ceiling at 3.4 GB and ~20s total. Fills the mini-tier slot for laptops or any machine where 9 GB of model is too much.
Caveat: best-of-N only works for tasks with a cheap automated verifier. For “draft a commit message” or “write a docstring” there’s no programmatic way to pick the best, so this strategy doesn’t help.
## Routing rules I ended up with
- Parsers, regex, recursive transformers: `gemma4:latest`. Byte-stable 22/22 across 6 runs of 2 different prompts at temp 0.2.
- Tests, fixtures, anything needing Python module/runtime semantics: `qwen2.5-coder:14b`. Stable 20-22/22, the only model that handled the test-scaffolding trap correctly.
- Mini tier (laptop, 4 GB VRAM): `qwen3.5:4b` with `think:false`, sample 5x at temp 0.7, run verifier, keep passer. 3.4 GB, ~20s total.
- Skip: `qwen3:14b` (stably mediocre, 16/22 mean) and `deepseek-coder-v2:16b` (stably wrong, 0/6 valid inputs same regex bug 3/3 runs).
Note on MCP wrappers: if you’re routing through a community Ollama MCP server, check whether it exposes `think:false`. The one I tested doesn’t, and it timed out at 120s on a prompt that the underlying model handles in 30s via direct `/api/generate`. The wrapper’s description also misreported which model it was wrapping. Verify before relying on it.
## What surprised me
The general-purpose model (`gemma4`) beat the dedicated coder model (`qwen2.5-coder:14b`) on every code-gen prompt that didn’t require Python runtime reasoning. The “coder” label means trained on code, not best at every code task. I went into this assuming the coder-tuned model was the safe default and I was wrong.
The single-shot ranking placed `qwen3.5:9b` at the top with a 0.985/1.0 score. Variance check showed 2 of 3 runs were broken with byte-identical output. If I’d shipped a routing policy off that ranking, I would have sent every parser-style task to a model that fails most of the time at temperature 0.2.
Logging only the `response` field on Ollama thinking-mode calls cost me 20 minutes of GPU debugging for what looked like crashes but were actually 21 KB infinite-loop self-arguments inside `thinking`. One missing line of logging.
## Open questions
- Which model are you using for the parser/transformer slot? I want to compare against `gemma4`. Especially curious about `granite-code:3b` and `phi-4-mini` for the same prompts.
- For the mini-tier slot, has anyone shipped `qwen3.5:4b` (or smaller) in a best-of-N + verifier loop in production? What’s your hit rate and N?
- Is anyone seeing similar bimodal behavior on `qwen3.5:9b` at low temperature on other constrained-format prompts, or is this specific to my prompt template?
## Repro
Bash wrappers, prompts, and verifiers are single-file scripts: no deps beyond `curl`, `jq`, and stdlib Python. Hardware was a 16 GB consumer GPU on WSL2. Happy to share if there’s interest.