Qwen 3.6 35B A3B did the agentic spec workflow 13 models couldn't — 174 reqs on the OpenAI SDK, no scaffolding

**TL;DR.** Pointed an MCP-driven agentic spec-writing pipeline at the [excalidraw]( GitHub - excalidraw/excalidraw: Virtual whiteboard for sketching hand-drawn like diagrams · GitHub ) codebase, ran the same brief through 13 models. The headline result for this audience: **`@Qwen/Qwen3.6-35B-A3B-Instruct`** — an MoE running locally on LM Studio at 50k-token context — produced **174 requirements across 23 top-level features**, within 12% of Claude Opus 4.7’s 197. Every other OpenAI-SDK model on the table (cloud or local) sat at 13–60. Qwen 3.6 is the only open-weights model that closed the gap. Side-by-side — pick any feature node and read what each model actually wrote: AI Specification Generation - Model Comparison | SPECLAN

**Why this is news (specifically for the local-LLM-on-agentic-flows folks here).** Two weeks ago I wrote up 7 local models running an MCP tool-calling spec workflow, and **every MoE variant I tested failed** — `@google/gemma-4-26b-a4b` looped on the same tool 16x, `@openai/gpt-oss-20b` hallucinated completion (claimed to write a file, didn’t), `@google/gemma-4-e4b` never produced a final response. The conclusion at the time was *“dense beats MoE for agentic tool-calling, regardless of parameter count.”* Qwen 3.6 35B A3B (3B active, 35B total — sparse MoE) just walked through the same kind of workflow, called dozens of tools across multiple turns, and self-terminated cleanly. Full disclosure: I’m building [SPECLAN](https://speclan.net), the VS Code extension running this pipeline; that’s where the trees live.

**The non-obvious finding (the SDK matters more than the model).** The 196–203 requirement band for the three Claude models — Opus, Sonnet, Haiku, regardless of size — is not “Anthropic models are better.” It’s a scaffolding artefact. The Anthropic SDK ships with a built-in **persistent Todo-List**, a planner primitive, and scratchpad memory as first-class agent tools. Spec authoring is a list-management problem (enumerate features, write one, cross off, next); the SDK solves half of it. When Haiku writes 16 top-level features, those become 16 TODO items in the SDK’s list, and subsequent turns just enumerate. The OpenAI SDK exposes our MCP tools only — no built-in planner, no TODO list, no scratchpad. Every other model on the table (GPT 5.4, Gemini previews, all local) starts with no list. **The clustering at 196–203 is the SDK floor, not the model ceiling. The clustering at 13–60 is the absence-of-scaffolding floor.**

What makes Qwen 3.6 the data point that matters: it ran on the OpenAI-SDK path with **zero scaffolding** and produced 174 requirements. Same harness as `@google/gemma-4-31b` dense (60 reqs) and GPT 5.4 (43 reqs). My working hypothesis: Qwen 3.6’s training mix has heavy agentic-tool-call trajectories that **internalized** the bookkeeping the Anthropic SDK externalizes as tools. I can’t prove that from the output alone, but it’s the only explanation consistent with the data — same harness, order-of-magnitude higher output.

**For an HF builder picking a local daily-driver.** If you have ~64 GB+ of unified memory and you want a local model for multi-step agentic workflows where the model has to plan, enumerate, and self-terminate (spec writing, multi-step refactors, agentic test generation), `@Qwen/Qwen3.6-35B-A3B-Instruct` via LM Studio with 50k context is the working pick. Nothing else on the open-weights side is close on this workload as of 2026-04-29.

**Open question.** Has anyone reproduced this pattern with other agentic-tuned MoEs? If you’ve found another MoE that breaks the “MoE-fails-multi-step-tool-calls” rule, drop the model + harness combo — I’d rather learn from your trace than rerun mine.

Full 13-model comparison + the SDK disclaimer in detail + the Gemini-preview anomaly (frontier cloud model authored an `Account and Billing Management` feature for a no-billing product): AI Specification Generation - Model Comparison | SPECLAN