Issues with Pruning and Quantization of Hugging Face LLMs on CPU

Hi everyone,

I’m experimenting with pruning and quantizing LLMs (like unclothed/gemma-2-2b-it) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:


:one: Model size increases after pruning

  • After structured pruning (~30%), the model size doubled instead of decreasing.

  • I suspect this is due to PyTorch mask tensors added during pruning.


:two: Accuracy and inference time unchanged

  • After pruning, accuracy and response time remain almost identical.

  • Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.


:three: 4-bit quantization on CPU

  • Attempting 4-bit quantization fails on CPU.

  • bitsandbytes library is GPU-optimized, so CPU-only systems aren’t supported.


:four: INT8 quantization issues

  • INT8 quantization sometimes crashes when saving with save_pretrained().

  • Seems Transformers’ serialization does not fully support int8 tensors on CPU.


:five: Package / environment issues

  • Missing bitsandbytes → 4-bit quantization fails.

  • Missing sentencepiece → tokenizer fails.

  • Missing langchain.text_splitter → ingestion fails.


:six: Saving pruned + quantized model

  • Pruned + quantized model sometimes fails to save or size doubles.

:seven: GPU vs CPU differences

  • On CPU, cannot benefit from 4-bit quantization or speed-up.

  • GPU-only optimized kernels are needed for memory and inference improvements.


Questions:

  1. Is there a recommended way to prune and quantize models on CPU without increasing size?

  2. How do people typically handle saving pruned + quantized models?

  3. Any tips to get speed/memory benefits on CPU?

  4. Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?

Thanks in advance for any guidance!

While it is possible to avoid the pruning problem itself, the performance degradation caused by pruning is significantly greater compared to quantization. Recovering performance requires extensive fine-tuning using a GPU after pruning, making it a very difficult path. It is best avoided unless for research or specialized purposes.

When aiming for memory savings and speed on CPUs, simply using a smaller LLM and quantization methods & backends optimized for CPUs yields faster speeds and higher accuracy.

Hi community,
I am pruning the Gemma 2B model using PyTorch’s nn.utils.prune. After pruning ~20–30% of the weights, I notice some performance degradation.

  • What are the best practices to minimize accuracy loss during pruning?

  • Is structured pruning recommended over unstructured pruning for LLMs like Gemma?

  • After pruning, is LoRA fine-tuning sufficient to recover performance?


Title: Pruning + Fine-Tuning Workflow for Gemma 2B
Body:
Hello,
I want to prune Gemma 2B and then fine-tune it on a small dataset.

  • Should I prune first and then fine-tune with full parameters, or is LoRA/PEFT tuning enough?

  • Any advice on learning rate, batch size, or number of epochs for fine-tuning after pruning?

  • Has anyone successfully pruned + fine-tuned Gemma 2B? Any tips?


:two: Quantization Questions

Title: CPU Inference for Gemma 2B After Pruning + Fine-Tuning
Body:
Hi,
I plan to perform pruning + LoRA fine-tuning + quantization for CPU inference on Gemma 2B.

  • Which quantization method works best for CPU: 8-bit or 4-bit?

  • Can I combine pruning + LoRA fine-tuning + 4-bit quantization without losing significant accuracy?

  • Does BitsAndBytes fully support CPU-only quantization for a 2B parameter model?


:three: Hardware / Workflow Questions

Title: Minimum GPU Requirements for Gemma 2B Workflow
Body:
Hello Hugging Face community,
I am planning a workflow for Gemma 2B:

  1. Download the model

  2. Prune ~20–30% weights

  3. Fine-tune with LoRA

  4. Quantize for CPU inference

  • What is the minimum GPU VRAM required for this workflow?

  • Can pruning be done entirely on CPU, or is GPU strongly recommended?

  • Are there any example scripts for pruning + LoRA fine-tuning + quantization for 2B+ LLMs?


:light_bulb: Tips before posting:

  • Include your environment (PyTorch version, GPU/CPU, RAM)

  • Include code snippet if possible, e.g., how you’re pruning or fine-tuning