Optimum library optimization and quantization fails

ddahlmeier · February 10, 2024, 9:00am

I used optimum and onnxruntime to optimize and quantize a roberta squad QA model using the example from this blog

huggingface/blog/blob/main/optimum-inference.md

---
title: 'Accelerated Inference with Optimum and Transformers Pipelines'
thumbnail: /blog/assets/66_optimum_inference/thumbnail.png
authors:
- user: philschmid
---

# Accelerated Inference with Optimum and Transformers Pipelines


> Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.

The adoption of BERT and Transformers continues to grow. Transformer-based models are now not only achieving state-of-the-art performance in Natural Language Processing but also for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤 ⏳

Companies are now moving from the experimentation and research phase to the production phase in order to use Transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complex models compared to traditional Machine Learning algorithms.

To solve this challenge, we created [Optimum](https://huggingface.co/blog/hardware-partners-program) –  an extension of [Hugging Face Transformers](https://github.com/huggingface/transformers) to accelerate the training and inference of Transformer models like BERT.

In this blog post, you'll learn:

This file has been truncated. show original

The code used to work fine until this week when it started to break. I suddenly get an error when quantizing the model which suggest that the model cannot lookup the right data type of the attention layer.
RuntimeError: Unable to find data type for weight_name=‘/roberta/encoder/layer.0/attention/output/dense/MatMul_output_0’

Here is a minimal code example to reproduce the error.

! pip -q install optimum[exporters,onnxruntime]

from pathlib import Path
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

model_id = "deepset/roberta-base-squad2"
onnx_path = Path("onnx")
task = "question-answering"

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig

# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(onnx_path)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
#quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model.onnx")
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)


# apply the quantization configuration to the model
quantizer.quantize(save_dir=onnx_path, quantization_config=qconfig)

Error message:

Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: True)
Quantizing model...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 11
      7 qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
     10 # apply the quantization configuration to the model
---> 11 quantizer.quantize(save_dir=onnx_path, quantization_config=qconfig)

File /opt/conda/lib/python3.11/site-packages/optimum/onnxruntime/quantization.py:417, in ORTQuantizer.quantize(self, quantization_config, save_dir, file_suffix, calibration_tensors_range, use_external_data_format, preprocessor)
    389     quantizer = quantizer_factory(
    390         model=onnx_model,
    391         static=quantization_config.is_static,
   (...)
    413         },
    414     )
    416 LOGGER.info("Quantizing model...")
--> 417 quantizer.quantize_model()
    419 suffix = f"_{file_suffix}" if file_suffix else ""
    420 quantized_model_path = save_dir.joinpath(f"{self.onnx_model_path.stem}{suffix}").with_suffix(".onnx")

File /opt/conda/lib/python3.11/site-packages/onnxruntime/quantization/onnx_quantizer.py:403, in ONNXQuantizer.quantize_model(self)
    401 number_of_existing_new_nodes = len(self.new_nodes)
    402 op_quantizer = CreateOpQuantizer(self, node)
--> 403 op_quantizer.quantize()
    404 for i in range(number_of_existing_new_nodes, len(self.new_nodes)):
    405     for output_name in self.new_nodes[i].output:

File /opt/conda/lib/python3.11/site-packages/onnxruntime/quantization/operators/matmul.py:78, in MatMulInteger.quantize(self)
     76 # Add cast operation to cast matmulInteger output to float.
     77 cast_op_output = matmul_integer_output + "_cast_output"
---> 78 otype = self.quantizer.get_tensor_type(node.output[0], mandatory=True)
     79 cast_node = onnx.helper.make_node(
     80     "Cast",
     81     [matmul_integer_output],
   (...)
     84     to=otype,
     85 )
     86 nodes.append(cast_node)

File /opt/conda/lib/python3.11/site-packages/onnxruntime/quantization/onnx_quantizer.py:461, in ONNXQuantizer.get_tensor_type(self, tensor_name, mandatory)
    459 if (not self.enable_subgraph_quantization) or (self.parent is None):
    460     if mandatory:
--> 461         raise RuntimeError(f"Unable to find data type for weight_name={tensor_name!r}")
    462     return None
    463 otype = self.parent.is_valid_quantize_weight(tensor_name)

RuntimeError: Unable to find data type for weight_name='/roberta/encoder/layer.0/attention/output/dense/MatMul_output_0'

The full notebook code is on github here: sutd-mlops-course-code/03_optimize_onnx.ipynb at main · ddahlmeier/sutd-mlops-course-code · GitHub

ddahlmeier · February 10, 2024, 9:05am

Quantizing from the non-optimized model works

ddahlmeier · February 10, 2024, 9:41am

Forcing fp16 in the optimization step allows the quantizer to run through without errors but later the model cannot be loaded anymore

Changing

optimization_config = OptimizationConfig(optimization_level=99)

to

optimization_config = OptimizationConfig(optimization_level=99, fp16=True)

make the optimization + quanitzation step work.

But when I try to load the model again it cannot be read:

model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model_optimized_quantized.onnx")

Error message:

The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
[..]
InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from onnx/model_optimized_quantized.onnx failed:This is an invalid model. Type Error: Type 'tensor(float16)' of input parameter (roberta.embeddings.position_embeddings.weight_scale) of operator (DequantizeLinear) in node (/roberta/embeddings/position_embeddings/Gather_output_0_DequantizeLinear) is invalid.

prudant · February 21, 2024, 1:34am

same problem here, can you solved?

ddahlmeier · February 22, 2024, 1:43am

No, I have not found a solution yet

whadupapp · March 6, 2024, 11:52am

downgrading onnxruntime to 1.16 seems to do the trick for me.

ddahlmeier · March 9, 2024, 1:28pm

Unfortunately, downgrading to 1.16 did not resolve the problem for me

vdkhvb · June 9, 2024, 10:00am

Downgrading worked at my end.

ddahlmeier · February 22, 2025, 6:50am

Somehow quantising and then optimizing works (with optimum 1.24)

Topic		Replies	Views
Optimize AND quantize with Optimum 🤗Optimum	11	3490	February 10, 2024
Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version? 🤗Optimum	10	2501	July 27, 2022
Optimum & T5 for inference 🤗Optimum	18	6018	February 8, 2023
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1608	July 14, 2022
Getting ValueError when exporting model to ONNX using optimum 🤗Optimum	16	5306	November 25, 2022

Optimum library optimization and quantization fails

Related topics