z-lab/Qwen3-0.6B-PARO

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Paper Blog Models PyPI

ParoQuant is the state-of-the-art INT4 quantization for LLMs. It closes the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX). For more information, see https://github.com/z-lab/paroquant.

z-lab/Qwen3-0.6B-PARO is a 4-bit Qwen/Qwen3-0.6B quantized with ParoQuant. Check out other ParoQuant models from the Hugging Face collection.

Quick Start

Installation

# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"

# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.19.1" \
  --extra-index-url https://wheels.vllm.ai/0.19.1/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130

# Apple Silicon
pip install "paroquant[mlx]"

Interactive Chat

python -m paroquant.cli.chat --model z-lab/Qwen3-0.6B-PARO

OpenAI-Compatible API Server

For vLLM, you can directly use vllm serve to serve ParoQuant models:

vllm serve $MODEL --port 8000

For other frameworks:

python -m paroquant.cli.serve --model $MODEL --port 8000

Docker (NVIDIA GPU)

The following commands map the local cache directory to the container in order to persist kernel cache across runs. Remove -v ... to disable this behaviour.

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-0.6B-PARO

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3-0.6B-PARO

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}
Downloads last month
315
Safetensors
Model size
0.2B params
Tensor type
I32
·
F16
·
I16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for z-lab/Qwen3-0.6B-PARO

Finetuned
Qwen/Qwen3-0.6B
Quantized
(297)
this model

Collection including z-lab/Qwen3-0.6B-PARO

Paper for z-lab/Qwen3-0.6B-PARO