Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP

Claude Opus + Sonnet distilled Qwen 3.6 27B optimized for higher token efficiency, preserved deep reasoning, and MTP-accelerated vLLM deployment on Blackwell-class GPUs.

This release is designed around one practical goal:

make high-quality local deployment feel fast enough, responsive enough, and efficient enough to use every day.

Highlights

  • Higher token efficiency: less budget wasted on invisible long-form reasoning, more budget converted into visible answers
  • Reasoning depth preserved: no obvious degradation in math, logic, or complex systems prompts in local spot checks
  • MTP actually works: speculative decoding is not just included in the files, it shows measurable benefit in the tested deployment stack
  • Better interactive UX: faster visible output, lower waiting time, and a much more usable local API experience

In short:

this is not about making the model think less, it is about making it spend more of its token budget on answers users can actually see.

Why This Release

The value of this repository is not just "an NVFP4 checkpoint".

It is the combination of:

  • Claude Opus + Sonnet distilled response style and reasoning organization
  • NVFP4 deployment efficiency
  • MTP speculative decoding
  • a vLLM deployment path that is already validated locally

The goal is not a benchmark-only artifact. The goal is to help high-quality local models become meaningfully deployable sooner.

Performance Snapshot

Efficiency comparison chart

NVFP4 vs lmstudio-community/Qwen3.6-27B-GGUF efficiency comparison

Non-rigorous local spot checks on the tested deployment stack. The chart is used for README communication rather than formal benchmarking.

This chart makes one point:

for local deployment, it is not enough that a model can answer well; it also has to turn waiting time and token budget into visible value.

Against vanilla GGUF

In local comparisons, the most important improvements are user-visible:

  • Short-chat visible throughput: 157.5 tok/s vs 54.7 tok/s, about 2.9x
  • Mid-generation visible throughput: 151.0 tok/s vs 79.6 tok/s, about 1.9x
  • Short-chat visible TTFT: 0.88s vs 9.54s, about 91% lower wait time
  • Mid-generation visible TTFT: 1.41s vs 24.96s, about 94% lower wait time

For local deployment, these differences matter more than aggregate backend token counts because users care about:

  • when the first useful answer shows up
  • how much of the budget becomes visible output
  • whether the model feels interactive enough for real workflows

MTP result on the tested stack

In the current local environment, MTP=3 is effective:

  • average acceptance: about 1.89 / 3
  • speedup: about 1.73x
  • observed single-request decode speed: roughly 145-194 tok/s

Important note:

  • the MTP figures above are a local MTP measurement
  • the NVFP4 vs vanilla GGUF numbers are an end-to-end deployment experience comparison
  • they are related, but they are not the same measurement

Non-Rigorous Spot Checks

This section is intentionally labeled as non-rigorous local testing. It is here to explain the selling points, not to claim a formal benchmark victory.

Summary

  • Token use is much more efficient: under similar correctness, more of the budget becomes visible answer text
  • Deep reasoning does not obviously collapse: math, logic, systems design, and collaborative editing prompts remain strong
  • Vanilla is often longer, not necessarily better: a lot of the extra tokens go into hidden reasoning
  • Interactive quality is the big differentiator: this release feels far more usable as a local API model

Spot-check table

Prompt This release lmstudio-community/Qwen3.6-27B-GGUF Takeaway
Three-switch logic Correct; about 6.8s; cleaner output Correct; about 23.5s; much longer reasoning Similar correctness, much better visible efficiency
Bayesian derivation Correct; complete derivation; about 479 visible tok Correct; about 720 visible tok + 2065 reasoning tok Reasoning quality preserved, much lower token waste
Hotel paradox Correct; structured explanation; about 7.2s TTFT Correct; about 21.6s TTFT Large user-visible latency gap
Distributed rate limiting Complete and practical design More exhaustive and verbose Vanilla is more sprawling, but this release does not show clear capability collapse
Collaborative recovery Covers CRDT, version vectors, GC Longer and more detailed recovery flow This release is more interaction-friendly

Quick Start

Validated environment:

  • vLLM
  • RTX PRO 6000 96GB
  • WSL2
  • CUDA 13.0

Serve command:

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP \
  --served-model-name qwen3.6-27b-claude-opus-sonnet-distilled-nvfp4-mtp \
  --quantization compressed-tensors \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --language-model-only \
  --reasoning-parser qwen3 \
  --chat-template /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP/chat_template.jinja \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.92 \
  --max-num-batched-tokens 50000 \
  --max-num-seqs 8 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

If you want stable visible output instead of explicit thinking traces, disable thinking at request time:

"chat_template_kwargs": {
  "enable_thinking": false
}

Limitations

  • The current results in this card are based on the tested Blackwell + vLLM + NVFP4 + MTP stack
  • This is not a universal cross-hardware conclusion; re-test on your own stack
  • The comparisons in this card are non-rigorous local spot checks
  • This card documents a validated text-serving path, not a production-validated multimodal release

Acknowledgements

Special thanks to Unsloth and Qwen.

They are a big part of why high-quality local deployment now feels much more within reach, and why building practical local reasoning systems is becoming increasingly accessible.

Downloads last month
457
Safetensors
Model size
20B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(282)
this model