Instructions to use Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP")
model = AutoModelForImageTextToText.from_pretrained("Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP

SGLang

How to use Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP
```

Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP

Claude Opus + Sonnet distilled Qwen 3.6 27B optimized for higher token efficiency, preserved deep reasoning, and MTP-accelerated vLLM deployment on Blackwell-class GPUs.

This release is designed around one practical goal:

make high-quality local deployment feel fast enough, responsive enough, and efficient enough to use every day.

Highlights

Higher token efficiency: less budget wasted on invisible long-form reasoning, more budget converted into visible answers
Reasoning depth preserved: no obvious degradation in math, logic, or complex systems prompts in local spot checks
MTP actually works: speculative decoding is not just included in the files, it shows measurable benefit in the tested deployment stack
Better interactive UX: faster visible output, lower waiting time, and a much more usable local API experience

In short:

this is not about making the model think less, it is about making it spend more of its token budget on answers users can actually see.

Why This Release

The value of this repository is not just "an NVFP4 checkpoint".

It is the combination of:

Claude Opus + Sonnet distilled response style and reasoning organization
NVFP4 deployment efficiency
MTP speculative decoding
a vLLM deployment path that is already validated locally

The goal is not a benchmark-only artifact. The goal is to help high-quality local models become meaningfully deployable sooner.

Performance Snapshot

Efficiency comparison chart

Non-rigorous local spot checks on the tested deployment stack. The chart is used for README communication rather than formal benchmarking.

This chart makes one point:

for local deployment, it is not enough that a model can answer well; it also has to turn waiting time and token budget into visible value.

Against vanilla GGUF

In local comparisons, the most important improvements are user-visible:

Short-chat visible throughput: 157.5 tok/s vs 54.7 tok/s, about 2.9x
Mid-generation visible throughput: 151.0 tok/s vs 79.6 tok/s, about 1.9x
Short-chat visible TTFT: 0.88s vs 9.54s, about 91% lower wait time
Mid-generation visible TTFT: 1.41s vs 24.96s, about 94% lower wait time

For local deployment, these differences matter more than aggregate backend token counts because users care about:

when the first useful answer shows up
how much of the budget becomes visible output
whether the model feels interactive enough for real workflows

MTP result on the tested stack

In the current local environment, MTP=3 is effective:

average acceptance: about 1.89 / 3
speedup: about 1.73x
observed single-request decode speed: roughly 145-194 tok/s

Important note:

the MTP figures above are a local MTP measurement
the NVFP4 vs vanilla GGUF numbers are an end-to-end deployment experience comparison
they are related, but they are not the same measurement

Non-Rigorous Spot Checks

This section is intentionally labeled as non-rigorous local testing. It is here to explain the selling points, not to claim a formal benchmark victory.

Summary

Token use is much more efficient: under similar correctness, more of the budget becomes visible answer text
Deep reasoning does not obviously collapse: math, logic, systems design, and collaborative editing prompts remain strong
Vanilla is often longer, not necessarily better: a lot of the extra tokens go into hidden reasoning
Interactive quality is the big differentiator: this release feels far more usable as a local API model

Spot-check table

Prompt	This release	lmstudio-community/Qwen3.6-27B-GGUF	Takeaway
Three-switch logic	Correct; about `6.8s`; cleaner output	Correct; about `23.5s`; much longer reasoning	Similar correctness, much better visible efficiency
Bayesian derivation	Correct; complete derivation; about `479` visible tok	Correct; about `720` visible tok + `2065` reasoning tok	Reasoning quality preserved, much lower token waste
Hotel paradox	Correct; structured explanation; about `7.2s` TTFT	Correct; about `21.6s` TTFT	Large user-visible latency gap
Distributed rate limiting	Complete and practical design	More exhaustive and verbose	Vanilla is more sprawling, but this release does not show clear capability collapse
Collaborative recovery	Covers CRDT, version vectors, GC	Longer and more detailed recovery flow	This release is more interaction-friendly

Quick Start

Validated environment:

vLLM
RTX PRO 6000 96GB
WSL2
CUDA 13.0

Serve command:

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP \
  --served-model-name qwen3.6-27b-claude-opus-sonnet-distilled-nvfp4-mtp \
  --quantization compressed-tensors \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --kv-cache-dtype fp8 \
  --dtype auto \
  --language-model-only \
  --reasoning-parser qwen3 \
  --chat-template /mnt/d/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP/chat_template.jinja \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.92 \
  --max-num-batched-tokens 50000 \
  --max-num-seqs 8 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

If you want stable visible output instead of explicit thinking traces, disable thinking at request time:

"chat_template_kwargs": {
  "enable_thinking": false
}

Limitations

The current results in this card are based on the tested Blackwell + vLLM + NVFP4 + MTP stack
This is not a universal cross-hardware conclusion; re-test on your own stack
The comparisons in this card are non-rigorous local spot checks
This card documents a validated text-serving path, not a production-validated multimodal release

Acknowledgements

Special thanks to Unsloth and Qwen.

They are a big part of why high-quality local deployment now feels much more within reach, and why building practical local reasoning systems is becoming increasingly accessible.

Downloads last month: 457

Safetensors

Model size

20B params

Tensor type

F32

BF16

F8_E4M3

Model tree for Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-Distilled-NVFP4-MTP

Base model

Qwen/Qwen3.6-27B

Quantized

(282)

this model