News

Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

September 24, 2025

Accelerating Transformer Models with Hugging Face Optimum and ONNX Runtime

This guide demonstrates how to enhance the performance of Transformer-based models using Hugging Face’s Optimum library, focusing on speed improvements without sacrificing accuracy. We start by fine-tuning DistilBERT on the SST-2 sentiment analysis dataset, then evaluate various execution backends including native PyTorch, PyTorch with torch.compile, ONNX Runtime, and quantized ONNX Runtime. The entire process is conducted within a Google Colab environment, providing practical insights into model export, optimization, quantization, and benchmarking.

Setting Up the Environment and Dependencies

First, we install the necessary Python packages and configure the environment to leverage Hugging Face Optimum with ONNX Runtime. We define key parameters such as model paths, batch size, maximum sequence length, and device selection (GPU if available, otherwise CPU). This setup ensures reproducibility and efficient resource utilization.

!pip install -q "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "evaluate>=0.4" accelerate

import os
import time
import numpy as np
import torch
from pathlib import Path
from datasets import loaddataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig

os.environ["OMPNUMTHREADS"] = "1"
os.environ["MKLNUMTHREADS"] = "1"

MODELID = "distilbert-base-uncased-finetuned-sst-2-english"
ONNXDIR = Path("onnx-distilbert")
QUANTDIR = Path("onnx-distilbert-quant")
DEVICE = "cuda" if torch.cuda.isavailable() else "cpu"
BATCHSIZE = 16
MAXSEQLEN = 128
WARMUPSTEPS = 3
BENCHMARKITERS = 8

print(f"Running on device: {DEVICE} | PyTorch version: {torch.version}")

Data Preparation and Utility Functions

We load a subset of the SST-2 validation dataset for evaluation and initialize the tokenizer and accuracy metric. Helper functions are created to batch the input texts, evaluate model accuracy, and benchmark inference latency. These utilities enable consistent and fair comparisons across different model execution strategies.

dataset = loaddataset("glue", "sst2", split="validation[:20%]")
sentences, labels = dataset["sentence"], dataset["label"]
accuracymetric = evaluate.load("accuracy")
tokenizer = AutoTokenizer.frompretrained(MODELID)

def batchtokenize(texts, maxlength=MAXSEQLEN, batchsize=BATCHSIZE):
    for i in range(0, len(texts), batchsize):
        yield tokenizer(texts[i:i+batchsize], padding=True, truncation=True,
                        maxlength=maxlength, returntensors="pt")

def evaluateaccuracy(predictfn, texts, labels):
    predictions = []
    for batch in batchtokenize(texts):
        predictions.extend(predictfn(batch))
    return accuracymetric.compute(predictions=predictions, references=labels)["accuracy"]

def benchmarkinference(predictfn, texts, warmup=WARMUPSTEPS, iterations=BENCHMARKITERS):
    for  in range(warmup):
        for batch in batchtokenize(texts[:BATCHSIZE2]):
            predictfn(batch)
    timings = []
    for  in range(iterations):
        start = time.time()
        for batch in batchtokenize(texts):
            predictfn(batch)
        elapsedms = (time.time() - start)  1000
        timings.append(elapsedms)
    return float(np.mean(timings)), float(np.std(timings))

Baseline PyTorch Model Evaluation and JIT Compilation

We load the DistilBERT model fine-tuned on SST-2 and define a prediction function that runs inference on tokenized batches. The model is benchmarked for latency and accuracy. Next, we attempt to optimize the model using PyTorch’s torch.compile feature, which applies just-in-time graph optimizations to reduce overhead. If successful, we benchmark the compiled model to compare performance gains.

pytorchmodel = AutoModelForSequenceClassification.frompretrained(MODELID).to(DEVICE).eval()

@torch.nograd()
def pytorchpredict(batch):
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    outputs = pytorchmodel(batch).logits
    return outputs.argmax(dim=-1).cpu().tolist()

ptmean, ptstd = benchmarkinference(pytorchpredict, sentences)
ptaccuracy = evaluateaccuracy(pytorchpredict, sentences, labels)
print(f"[PyTorch eager] Latency: {ptmean:.1f}±{ptstd:.1f} ms | Accuracy: {ptaccuracy:.4f}")

compiledmodel = pytorchmodel
compilesuccess = False
try:
    compiledmodel = torch.compile(pytorchmodel, mode="reduce-overhead", fullgraph=False)
    compilesuccess = True
except Exception as e:
    print(f"torch.compile not available or failed: {e}")

@torch.nograd()
def compiledpredict(batch):
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    outputs = compiledmodel(batch).logits
    return outputs.argmax(dim=-1).cpu().tolist()

if compilesuccess:
    cmean, cstd = benchmarkinference(compiledpredict, sentences)
    caccuracy = evaluateaccuracy(compiledpredict, sentences, labels)
    print(f"[torch.compile] Latency: {cmean:.1f}±{cstd:.1f} ms | Accuracy: {caccuracy:.4f}")

ONNX Runtime Integration and Dynamic Quantization

To further accelerate inference, we export the model to ONNX format and run it using ONNX Runtime, which often delivers faster execution especially on CPU. We then apply dynamic quantization using Optimum’s ORTQuantizer to reduce model size and latency while maintaining accuracy. Both the standard and quantized ONNX models are benchmarked and evaluated.

executionprovider = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
onnxmodel = ORTModelForSequenceClassification.frompretrained(
    MODELID, export=True, provider=executionprovider, cachedir=ONNXDIR
)

@torch.nograd()
def onnxpredict(batch):
    batch = {k: v.cpu() for k, v in batch.items()}
    outputs = onnxmodel(batch).logits
    return outputs.argmax(dim=-1).cpu().tolist()

onnxmean, onnxstd = benchmarkinference(onnxpredict, sentences)
onnxaccuracy = evaluateaccuracy(onnxpredict, sentences, labels)
print(f"[ONNX Runtime] Latency: {onnxmean:.1f}±{onnxstd:.1f} ms | Accuracy: {onnxaccuracy:.4f}")

QUANTDIR.mkdir(parents=True, existok=True)
quantizer = ORTQuantizer.frompretrained(ONNXDIR)
quantconfig = QuantizationConfig(approach="dynamic", perchannel=False, reducerange=True)
quantizer.quantize(modelinput=ONNXDIR, quantizationconfig=quantconfig, savedir=QUANTDIR)

quantizedonnxmodel = ORTModelForSequenceClassification.frompretrained(QUANTDIR, provider=executionprovider)

@torch.nograd()
def quantizedpredict(batch):
    batch = {k: v.cpu() for k, v in batch.items()}
    outputs = quantizedonnxmodel(batch).logits
    return outputs.argmax(dim=-1).cpu().tolist()

quantmean, quantstd = benchmarkinference(quantizedpredict, sentences)
quantaccuracy = evaluateaccuracy(quantizedpredict, sentences, labels)
print(f"[Quantized ONNX] Latency: {quantmean:.1f}±{quantstd:.1f} ms | Accuracy: {quantaccuracy:.4f}")

Comparing Predictions and Summarizing Results

To verify consistency, we run sentiment analysis on sample sentences using both PyTorch and ONNX Runtime pipelines, comparing their outputs side by side. Finally, we compile a summary table that contrasts latency and accuracy metrics across all tested execution engines, including the compiled PyTorch model if available.

ptpipeline = pipeline("sentiment-analysis", model=pytorchmodel, tokenizer=tokenizer,
                       device=0 if DEVICE == "cuda" else -1)
onnxpipeline = pipeline("sentiment-analysis", model=onnxmodel, tokenizer=tokenizer, device=-1)

testsamples = [
    "An outstanding film with superb acting!",
    "Totally disappointing and boring.",
    "I have mixed feelings about this one."
]

print("Sample predictions (PyTorch | ONNX Runtime):")
for text in testsamples:
    ptlabel = ptpipeline(text)[0]["label"]
    onnxlabel = onnxpipeline(text)[0]["label"]
    print(f"- {text}n  PyTorch: {ptlabel} | ONNX: {onnxlabel}")

import pandas as pd

results = [
    ["PyTorch eager", ptmean, ptstd, ptaccuracy],
    ["ONNX Runtime", onnxmean, onnxstd, onnxaccuracy],
    ["Quantized ONNX", quantmean, quantstd, quantaccuracy]
]

if compilesuccess:
    results.insert(1, ["torch.compile", cmean, cstd, caccuracy])

df = pd.DataFrame(results, columns=["Engine", "Mean Latency (ms ↓)", "Std Dev (ms)", "Accuracy"])
display(df)

Additional Insights and Best Practices

Deprecation Notice: The BetterTransformer optimization is deprecated in transformers version 4.49 and above, so it is excluded from this workflow.
GPU Optimization: For further acceleration on GPUs, consider leveraging FlashAttention2 or FP8 precision with TensorRT-LLM.
CPU Tuning: Adjust thread settings via OMPNUMTHREADS and MKLNUMTHREADS, and explore NUMA pinning for improved CPU performance.
Static Quantization: To apply calibrated static quantization, use QuantizationConfig(approach='static') along with a representative calibration dataset.

Conclusion

This tutorial highlights how Hugging Face Optimum bridges the gap between research-grade PyTorch models and production-ready, optimized deployments. By combining ONNX Runtime and quantization techniques, we achieve significant speedups while preserving model accuracy. Additionally, PyTorch’s torch.compile offers promising performance improvements within the native framework. This comprehensive approach provides a solid foundation for deploying Transformer models efficiently, with potential extensions to other backends like OpenVINO or TensorRT for even greater acceleration.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

Accelerating Transformer Models with Hugging Face Optimum and ONNX Runtime

Setting Up the Environment and Dependencies

Data Preparation and Utility Functions

Baseline PyTorch Model Evaluation and JIT Compilation

ONNX Runtime Integration and Dynamic Quantization

Comparing Predictions and Summarizing Results

Additional Insights and Best Practices

Conclusion

RELATED ARTICLES

The AI lab revolving door spins ever faster

A Coding Guide to Build a Procedural Memory Agent That Learns,...

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI...