Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP
Claude Opus + Sonnet distilled Qwen 3.6 27B optimized for higher token efficiency, preserved deep reasoning, and MTP-accelerated vLLM deployment on Blackwell-class GPUs.
This release is designed around one practical goal:
make high-quality local deployment feel fast enough, responsive enough, and efficient enough to use every day.
Highlights
- Higher token efficiency: less budget wasted on invisible long-form reasoning, more budget converted into visible answers
- Reasoning depth preserved: no obvious degradation in math, logic, or complex systems prompts in local spot checks
- MTP actually works: speculative decoding is not just included in the files, it shows measurable benefit in the tested deployment stack
- Better interactive UX: faster visible output, lower waiting time, and a much more usable local API experience
In short:
this is not about making the model think less, it is about making it spend more of its token budget on answers users can actually see.
Why This Release
The value of this repository is not just "an NVFP4 checkpoint".
It is the combination of:
- Claude Opus + Sonnet distilled response style and reasoning organization
- NVFP4 deployment efficiency
- MTP speculative decoding
- a vLLM deployment path that is already validated locally
The goal is not a benchmark-only artifact. The goal is to help high-quality local models become meaningfully deployable sooner.
Performance Snapshot
Efficiency comparison chart
Non-rigorous local spot checks on the tested deployment stack. The chart is used for README communication rather than formal benchmarking.
This chart makes one point:
for local deployment, it is not enough that a model can answer well; it also has to turn waiting time and token budget into visible value.
Against vanilla GGUF
In local comparisons, the most important improvements are user-visible:
- Short-chat visible throughput:
157.5 tok/svs54.7 tok/s, about 2.9x - Mid-generation visible throughput:
151.0 tok/svs79.6 tok/s, about 1.9x - Short-chat visible TTFT:
0.88svs9.54s, about 91% lower wait time - Mid-generation visible TTFT:
1.41svs24.96s, about 94% lower wait time
For local deployment, these differences matter more than aggregate backend token counts because users care about:
- when the first useful answer shows up
- how much of the budget becomes visible output
- whether the model feels interactive enough for real workflows
MTP result on the tested stack
In the current local environment, MTP=3 is effective:
- average acceptance: about
1.89 / 3 - speedup: about
1.73x - observed single-request decode speed: roughly
145-194 tok/s
Important note:
- the MTP figures above are a local MTP measurement
- the
NVFP4 vs vanilla GGUFnumbers are an end-to-end deployment experience comparison - they are related, but they are not the same measurement
Non-Rigorous Spot Checks
This section is intentionally labeled as non-rigorous local testing. It is here to explain the selling points, not to claim a formal benchmark victory.
Summary
- Token use is much more efficient: under similar correctness, more of the budget becomes visible answer text
- Deep reasoning does not obviously collapse: math, logic, systems design, and collaborative editing prompts remain strong
- Vanilla is often longer, not necessarily better: a lot of the extra tokens go into hidden reasoning
- Interactive quality is the big differentiator: this release feels far more usable as a local API model
Spot-check table
| Prompt | This release | lmstudio-community/Qwen3.6-27B-GGUF | Takeaway |
|---|---|---|---|
| Three-switch logic | Correct; about 6.8s; cleaner output |
Correct; about 23.5s; much longer reasoning |
Similar correctness, much better visible efficiency |
| Bayesian derivation | Correct; complete derivation; about 479 visible tok |
Correct; about 720 visible tok + 2065 reasoning tok |
Reasoning quality preserved, much lower token waste |
| Hotel paradox | Correct; structured explanation; about 7.2s TTFT |
Correct; about 21.6s TTFT |
Large user-visible latency gap |
| Distributed rate limiting | Complete and practical design | More exhaustive and verbose | Vanilla is more sprawling, but this release does not show clear capability collapse |
| Collaborative recovery | Covers CRDT, version vectors, GC | Longer and more detailed recovery flow | This release is more interaction-friendly |
Quick Start
Validated environment:
vLLMRTX PRO 6000 96GBWSL2CUDA 13.0
Serve command:
python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP \
--served-model-name qwen3.6-27b-claude-opus-sonnet-distilled-nvfp4-mtp \
--quantization compressed-tensors \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--kv-cache-dtype fp8 \
--dtype auto \
--language-model-only \
--reasoning-parser qwen3 \
--chat-template /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP/chat_template.jinja \
--max-model-len 200000 \
--gpu-memory-utilization 0.92 \
--max-num-batched-tokens 50000 \
--max-num-seqs 8 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
If you want stable visible output instead of explicit thinking traces, disable thinking at request time:
"chat_template_kwargs": {
"enable_thinking": false
}
Limitations
- The current results in this card are based on the tested
Blackwell + vLLM + NVFP4 + MTPstack - This is not a universal cross-hardware conclusion; re-test on your own stack
- The comparisons in this card are non-rigorous local spot checks
- This card documents a validated text-serving path, not a production-validated multimodal release
Acknowledgements
Special thanks to Unsloth and Qwen.
They are a big part of why high-quality local deployment now feels much more within reach, and why building practical local reasoning systems is becoming increasingly accessible.
- Downloads last month
- 457
Model tree for Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP
Base model
Qwen/Qwen3.6-27B