I am trying to run this notebook locally: notebooks/examples/summarization.ipynb at main · huggingface/notebooks · GitHub
I am running on mac m2 with python 3.14.
The only change I did is to install additional dependencies (pip install datasets transformers torch torchvision torchaudio accelerate).
During the train process I see the memory goes up like a memory leak. After some run time (something like 1300 steps) I get an error of out of memory for the device.
“RuntimeError: MPS backend out of memory (MPS allocated: 4.20 GiB, other allocations: 43.49 GiB, max allowed: 47.74 GiB). Tried to allocate 51.44 MiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).”
I tried to run torch.mps.empty_cache() but it didn’t change
What can be my issue? How can I fix this memory issue so I will be able to train the model?
I don’t want to set PYTORCH_MPS_HIGH_WATERMARK_RATIO to 0 as it seems like just a hack, and to fix my core issue about the memory leak
Thank for you helping!
There seems to be an issue related to MPS that looks like a memory leak:
MPS out-of-memory while running the Hugging Face summarization notebook on Mac M2
Direct diagnosis
This does not look like a simple “t5-small is too large” problem, and it also does not look like something torch.mps.empty_cache() is expected to fix.
The key clue is this part of the error:
RuntimeError: MPS backend out of memory
MPS allocated: 4.20 GiB
other allocations: 43.49 GiB
max allowed: 47.74 GiB
Tried to allocate 51.44 MiB on private pool.
Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations
The important part is the split:
MPS allocated: 4.20 GiB
other allocations: 43.49 GiB
That strongly suggests the model’s live PyTorch tensor memory is not the whole problem. The large number is in MPS/Metal-driver/backend-side allocations.
PyTorch has separate MPS memory counters:
torch.mps.current_allocated_memory()reports current GPU memory occupied by tensors and does not include cached allocations in MPSAllocator pools.torch.mps.driver_allocated_memory()reports total GPU memory allocated by the Metal driver and includes cached MPSAllocator pools plus allocations from MPS/MPSGraph frameworks.torch.mps.empty_cache()releases unoccupied cached memory held by the caching allocator; it does not promise to clear all MPSGraph, driver, or backend allocations.
So the likely diagnosis is:
A long, variable-shape, sequence-to-sequence Hugging Face
Trainerrun is causing MPS driver/backend memory to grow until it hits the MPS allocation limit.
That is different from ordinary model OOM. Ordinary model OOM usually means “the live tensors for this batch do not fit.” Your error looks more like “backend/driver allocations have grown over time.”
Why this notebook is a bad fit for a default Mac M2 MPS run
The Hugging Face summarization notebook is a teaching notebook, not a carefully constrained Apple Silicon training recipe.
It uses the XSum summarization task and t5-small. Relevant links:
- Hugging Face summarization notebook
- Hugging Face summarization task guide
- Hugging Face course: summarization
- T5-small model page
- XSum dataset page
The notebook’s default-style setup is roughly:
model_checkpoint = "t5-small"
raw_datasets = load_dataset("xsum")
metric = load("rouge")
max_input_length = 1024
max_target_length = 128
batch_size = 16
Seq2SeqTrainingArguments(
...,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
predict_with_generate=True,
fp16=True,
push_to_hub=True,
)
Those settings are heavy for local MPS training because summarization is an encoder-decoder sequence-to-sequence task:
- the input document can be long;
- the target summary is generated token by token;
- training stores encoder activations, decoder activations, gradients, optimizer state, attention intermediates, labels, and temporary tensors;
- evaluation with
predict_with_generate=Trueuses generation, which is more memory-heavy than plain loss evaluation; - ROUGE evaluation requires generated predictions and decoded text;
- dynamic padding creates changing tensor shapes from batch to batch.
Even though t5-small is small compared with modern LLMs, this workload is not small in the memory-behavior sense. A long-input seq2seq model with batch size 16 and source length up to 1024 is quite aggressive for MPS.
Why “around 1300 steps” matters
The fact that memory grows over time and fails after something like 1300 steps is important.
If the batch were simply too large, I would expect failure very early, often on the first few steps. A late failure suggests one of these:
- backend memory accumulation;
- allocator cache growth;
- shape-specific graph/kernel/resource accumulation;
- fragmentation;
- a real MPS backend leak;
- retained objects in a notebook process;
- evaluation/checkpointing side effects if the failure occurs near those events.
Your specific error strongly points to MPS backend/driver allocations because the other allocations number is enormous.
There are similar public reports:
-
PyTorch issue: MPS memory leak in training with
transformers Trainer
This is the closest match. It reportstransformers Traineron MPS hitting OOM after several hundred iterations, especially with varying data lengths. It also notes that MPS allocated memory appears unchanged while backend memory runs out. -
PyTorch issue: MPS memory leak with variable batch size / sequence length
This is relevant because summarization datasets naturally produce variable sequence lengths. -
Hugging Face Transformers issue: Trainer class causes massive memory leak when using MPS
This reports continuously growing process memory withTraineron MPS. -
PyTorch forum: MPS backend OOM with small allocated memory and large other allocations
This has the same error shape: modestMPS allocated, hugeother allocations. -
PyTorch issue: MPS LSTM loop leaks despite
torch.mps.empty_cache()
This helps explain whyempty_cache()does not solve this class of problem. -
PyTorch issue:
driver_allocated_memory()grows unrestricted
This shows another MPS case where driver memory grows until an OOM with hugeother allocations.
This is why I would treat your issue as likely MPS-backend-related, not just a notebook typo.
Why dynamic padding is suspicious
The notebook intentionally defers padding to the data collator. That is usually good practice because each batch is padded only to the longest example in that batch, not to the global maximum.
The downside is that every batch may have a different shape.
For example, step shapes may look conceptually like this:
step 1: input shape [16, 742], labels [16, 68]
step 2: input shape [16, 1018], labels [16, 114]
step 3: input shape [16, 523], labels [16, 51]
step 4: input shape [16, 895], labels [16, 103]
...
On CUDA, dynamic padding is usually a good memory/speed tradeoff. On MPS, public issues suggest that changing batch/sequence shapes can contribute to backend memory growth.
That makes dynamic padding one of the strongest suspects in your case.
The important tradeoff:
| Strategy | Benefit | Risk on MPS |
|---|---|---|
| Dynamic padding | Less padding compute per batch | Many distinct shapes |
| Fixed padding | Fewer distinct shapes | More padding tokens |
| Length bucketing | Fewer distinct shapes with less wasted padding | More setup |
For your specific issue, I would test fixed padding even though it is less elegant.
Why PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 is not the right fix
You are right to avoid this as the main answer.
The PyTorch MPS environment-variable docs describe:
PYTORCH_MPS_HIGH_WATERMARK_RATIOas the hard allocation limit for the MPS allocator.- Setting it to
0.0disables the high-watermark limit. - The docs warn that disabling the limit may cause system failure if system-wide OOM occurs.
PYTORCH_MPS_LOW_WATERMARK_RATIOis the softer limit used for adaptive commit / garbage-collection behavior.
So this:
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
may postpone the error, but it does not fix the memory-growth slope. It removes the guardrail and can push your whole system into memory pressure or system OOM.
A safer diagnostic, not a real fix, is something like:
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=1.0
export PYTORCH_MPS_LOW_WATERMARK_RATIO=0.8
This may make failure happen earlier, but it can help show whether low-watermark cleanup behavior changes the memory curve. I would not treat it as the primary solution.
Why torch.mps.empty_cache() did not help
torch.mps.empty_cache() is not a general reset button.
It can release unoccupied cached memory held by the caching allocator, but it does not necessarily free:
- live tensors;
- retained Python references;
- MPSGraph framework allocations;
- driver allocations still considered active;
- shape-specific backend resources;
- command-buffer-related resources;
- actual backend leaks.
That matches the public MPS issues where memory growth continues even when empty_cache() is called repeatedly.
So this is not surprising:
torch.mps.empty_cache()
It may help in some allocator-cache situations, but it is not expected to fix a long-run MPS backend memory growth issue.
What I would do first
1. Instrument MPS memory correctly
Add a callback that logs both tensor memory and driver memory.
import torch
from transformers import TrainerCallback
def gb(x):
return x / 1024**3
def print_mps_memory(tag=""):
if torch.backends.mps.is_available():
live = torch.mps.current_allocated_memory()
driver = torch.mps.driver_allocated_memory()
recommended = torch.mps.recommended_max_memory()
print(
f"{tag} | "
f"live_tensors={gb(live):.2f} GiB | "
f"driver={gb(driver):.2f} GiB | "
f"recommended={gb(recommended):.2f} GiB"
)
class MPSMemoryCallback(TrainerCallback):
def on_step_end(self, args, state, control, **kwargs):
if state.global_step % 50 == 0:
print_mps_memory(f"step={state.global_step}")
def on_evaluate(self, args, state, control, **kwargs):
print_mps_memory(f"after_eval step={state.global_step}")
Then:
trainer.add_callback(MPSMemoryCallback())
Interpretation:
| Observation | Likely meaning |
|---|---|
live_tensors grows steadily |
real tensor retention, too-large graph, or Python reference retention |
live_tensors stable but driver grows |
MPS allocator / MPSGraph / Metal-driver growth |
| growth jumps after evaluation | generation / metrics / prediction accumulation |
| growth appears only after notebook reruns | stale notebook references |
| CPU stable but MPS grows | MPS-specific backend issue |
| fixed padding flattens driver growth | dynamic-shape churn is probably the trigger |
For your reported error, I would expect live tensor memory to remain much smaller than driver/backend memory.
2. Start from a smaller, MPS-friendly training configuration
Do not start from the original notebook settings. Use a diagnostic configuration first:
from transformers import Seq2SeqTrainingArguments
args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-mps-debug",
# Lower per-step memory.
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
# Disable evaluation while diagnosing training memory.
eval_strategy="no",
# Disable saving / pushing while diagnosing memory.
save_strategy="no",
push_to_hub=False,
# Remove mixed precision as a variable.
fp16=False,
bf16=False,
# Trade speed for lower activation memory.
gradient_checkpointing=True,
# Keep macOS data loading simple.
dataloader_num_workers=0,
dataloader_pin_memory=False,
learning_rate=2e-5,
weight_decay=0.01,
num_train_epochs=1,
logging_steps=50,
)
Also set:
model.config.use_cache = False
Why these settings:
per_device_train_batch_size=1greatly reduces per-step memory pressure.gradient_accumulation_steps=16keeps the effective batch size near the original batch size 16.eval_strategy="no"answers the question: “Does training alone leak?”save_strategy="no"removes checkpointing as a confounder.push_to_hub=Falseremoves git/upload behavior as a confounder.fp16=Falseremoves mixed-precision ambiguity on MPS.gradient_checkpointing=Truereduces activation memory by recomputing activations during backward.dataloader_num_workers=0anddataloader_pin_memory=Falsesimplify data loading on macOS.
Relevant docs:
3. Reduce sequence lengths first
Change:
max_input_length = 1024
max_target_length = 128
to:
max_input_length = 512
max_target_length = 64
This reduces:
- encoder activation memory;
- decoder activation memory;
- attention memory;
- temporary tensors;
- generation memory later;
- shape variety.
For a Mac M2 diagnostic run, 1024/128 is too aggressive as the first attempt.
4. Test fixed padding
This is the most important diagnostic for your case.
Instead of dynamic padding, try fixed padding:
max_input_length = 512
max_target_length = 64
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(
inputs,
max_length=max_input_length,
padding="max_length",
truncation=True,
)
labels = tokenizer(
text_target=examples["summary"],
max_length=max_target_length,
padding="max_length",
truncation=True,
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
Then rebuild the tokenized dataset:
tokenized_datasets = raw_datasets.map(
preprocess_function,
batched=True,
load_from_cache_file=False,
)
If fixed padding makes memory stable or much flatter, then your main trigger is probably dynamic shape churn on MPS.
If fixed padding does not help, the issue is more likely a broader MPS backend / Trainer / seq2seq loop memory-growth problem.
5. Use a subset first
Do not debug on the full XSum training set.
small_train = tokenized_datasets["train"].select(range(10_000))
small_eval = tokenized_datasets["validation"].select(range(500))
Then:
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=small_train,
eval_dataset=small_eval,
data_collator=data_collator,
processing_class=tokenizer,
)
If your installed Transformers version does not accept processing_class, use:
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=small_train,
eval_dataset=small_eval,
data_collator=data_collator,
tokenizer=tokenizer,
)
Run:
trainer.add_callback(MPSMemoryCallback())
trainer.train()
A complete first-pass MPS-safe training cell
This is the sort of configuration I would try first.
import torch
from transformers import (
AutoModelForSeq2SeqLM,
DataCollatorForSeq2Seq,
Seq2SeqTrainingArguments,
Seq2SeqTrainer,
)
model_checkpoint = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model.config.use_cache = False
if torch.backends.mps.is_available():
model.to("mps")
max_input_length = 512
max_target_length = 64
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(
inputs,
max_length=max_input_length,
padding="max_length",
truncation=True,
)
labels = tokenizer(
text_target=examples["summary"],
max_length=max_target_length,
padding="max_length",
truncation=True,
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = raw_datasets.map(
preprocess_function,
batched=True,
load_from_cache_file=False,
)
small_train = tokenized_datasets["train"].select(range(10_000))
small_eval = tokenized_datasets["validation"].select(range(500))
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-mps-debug",
learning_rate=2e-5,
weight_decay=0.01,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
eval_strategy="no",
save_strategy="no",
push_to_hub=False,
fp16=False,
bf16=False,
dataloader_num_workers=0,
dataloader_pin_memory=False,
logging_steps=50,
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=small_train,
eval_dataset=small_eval,
data_collator=data_collator,
processing_class=tokenizer,
)
trainer.add_callback(MPSMemoryCallback())
trainer.train()
If processing_class=tokenizer fails because of your Transformers version, replace it with:
tokenizer=tokenizer
Re-enable evaluation only after training is stable
Once training alone is stable, add evaluation carefully.
For summarization, evaluation is expensive because predict_with_generate=True runs generation. Hugging Face documents predict_with_generate as using generate() to calculate generative metrics such as ROUGE/BLEU.
Also, eval_accumulation_steps matters. The Trainer docs explain that if it is unset, predictions are accumulated on the accelerator before being moved to CPU, which is faster but uses more accelerator memory.
Use:
eval_args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-mps-eval",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=64,
generation_num_beams=1,
eval_accumulation_steps=1,
fp16=False,
bf16=False,
save_strategy="no",
push_to_hub=False,
)
Recommended eval strategy:
1. train with eval disabled
2. restart the Python process
3. load the trained model
4. evaluate on 100 to 500 validation examples
5. only then try larger validation runs
This avoids forcing a long training process with already-grown MPS driver memory to run generation-heavy evaluation afterward.
Experiment matrix
Run these in order.
| Experiment | Padding | Batch | Lengths | Eval? | Purpose |
|---|---|---|---|---|---|
| A | dynamic | 1 | 512/64 | no | Does reduced training still grow memory? |
| B | fixed | 1 | 512/64 | no | Does shape stability fix the issue? |
| C | fixed | 2 | 512/64 | no | Can you safely increase speed? |
| D | fixed | 1 | 768/96 | no | Can you safely increase length? |
| E | fixed | 1 | 512/64 | tiny eval | Does generation/eval trigger memory jumps? |
| F | fixed | 1 | 512/64 | larger eval | How far can evaluation scale? |
Stop as soon as driver_allocated_memory() shows a steady upward slope.
What each result means
If fixed padding stabilizes memory
Then dynamic shape churn is probably the main trigger.
Use:
- fixed padding;
- shorter max lengths;
- batch size 1 or 2;
- gradient accumulation;
- separate train/eval processes;
- no
fp16until stable.
If fixed padding slows but does not stop memory growth
Then shape churn is one contributor, but there is probably broader MPS backend growth.
Use:
- shorter runs;
- restart process between phases;
- checkpoint only model weights;
- CPU or CUDA/cloud for full training;
- track PyTorch MPS issues.
If both dynamic and fixed padding leak similarly
Then this is likely a more general MPS backend / seq2seq / Trainer issue.
Try:
- the official script instead of the notebook;
- a no-Trainer loop;
- CPU control run;
- newer or older PyTorch version as a test;
- cloud CUDA if you need the full notebook behavior.
Relevant scripts:
- Transformers summarization script
- Transformers no-Trainer summarization script
- Transformers example scripts guide
If CPU is stable but MPS leaks
Then the problem is almost certainly MPS-specific.
A CPU control run:
args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-cpu",
use_cpu=True,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
eval_strategy="no",
save_strategy="no",
push_to_hub=False,
)
CPU will be slower, but it is useful as a control experiment.
If memory grows only in the notebook
Then notebook state is contributing.
Before rerunning:
import gc
import torch
try:
del trainer
except NameError:
pass
try:
del model
except NameError:
pass
gc.collect()
if torch.backends.mps.is_available():
torch.mps.empty_cache()
But the stronger fix is to restart the kernel or run the training as a plain script:
python train_summarization_mps.py
Python 3.14: probably not the main cause, but simplify it
Python 3.14 is not necessarily the root cause. Current PyTorch installation guidance includes modern Python versions on macOS. Still, for debugging I would use Python 3.11 or 3.12 first because they are more commonly exercised across ML packages.
A cleaner environment:
python3.12 -m venv .venv-summarization-mps
source .venv-summarization-mps/bin/activate
python -m pip install -U pip
python -m pip install -U torch torchvision torchaudio
python -m pip install -U transformers datasets evaluate accelerate rouge-score nltk
Then print versions:
import sys
import torch
import transformers
import datasets
import accelerate
print("python:", sys.version)
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("datasets:", datasets.__version__)
print("accelerate:", accelerate.__version__)
print("mps built:", torch.backends.mps.is_built())
print("mps available:", torch.backends.mps.is_available())
This does not prove Python 3.14 is bad. It just removes a variable while investigating a likely MPS backend issue.
What I would not do
Do not rely on torch.mps.empty_cache()
It is not a general leak fix.
Use it for cleanup, but do not expect it to solve driver/backend growth.
Do not set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 as the solution
That disables the hard limit and can risk system-wide OOM. It may postpone the crash, but it does not fix the underlying memory-growth pattern.
Do not start with full XSum + full ROUGE evaluation
Use subsets first.
Do not debug with fp16=True
Disable mixed precision first:
fp16=False
bf16=False
After the memory curve is understood, test mixed precision separately.
Do not assume t5-small means the workload is small
Parameter count is only one part of memory behavior. Seq2seq summarization with long inputs is memory-heavy even with a small model.
Most practical final local-MPS recipe
For actually finishing a local run on Mac M2, I would use something like this:
max_input_length = 512
max_target_length = 64
args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-mps",
learning_rate=2e-5,
weight_decay=0.01,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
eval_strategy="no",
save_strategy="epoch",
save_only_model=True,
push_to_hub=False,
fp16=False,
bf16=False,
dataloader_num_workers=0,
dataloader_pin_memory=False,
logging_steps=50,
)
Then restart the Python process and evaluate separately:
eval_args = Seq2SeqTrainingArguments(
output_dir="t5-small-xsum-mps-eval",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=64,
generation_num_beams=1,
eval_accumulation_steps=1,
save_strategy="no",
push_to_hub=False,
fp16=False,
bf16=False,
)
Best links to read
Official docs
- PyTorch MPS package docs
- PyTorch
torch.mps.current_allocated_memory() - PyTorch
torch.mps.driver_allocated_memory() - PyTorch
torch.mps.empty_cache() - PyTorch MPS environment variables
- Apple: Accelerated PyTorch training on Mac
- Hugging Face Trainer docs
- Hugging Face summarization task guide
- Hugging Face example scripts guide
Closest issues / reports
- PyTorch: MPS memory leak in training with
transformers Trainer - PyTorch: MPS memory leak with variable batch size / sequence length
- Transformers: Trainer class causes massive memory leak when using MPS
- PyTorch forum: MPS backend out of memory on Mac M2
- PyTorch: MPS LSTM leak despite
empty_cache() - PyTorch:
driver_allocated_memory()grows unrestricted - PyTorch: MPS memory leak minimal examples
Useful code references
- Hugging Face summarization notebook
- Transformers
run_summarization.py - Transformers
run_summarization_no_trainer.py
Bottom line
Your issue is most likely:
MPS backend / Metal-driver memory growth during a long, variable-shape Hugging Face seq2seq training run.
The original notebook makes that likely because it combines:
- XSum summarization;
t5-small;- dynamic padding;
max_input_length=1024;max_target_length=128;- batch size 16;
fp16=True;- generation-based evaluation;
- checkpointing / Hub-push behavior;
- a long run on Apple Silicon MPS.
The best fix is not PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0.
The best practical workaround is:
- log both
current_allocated_memory()anddriver_allocated_memory(); - use batch size 1 or 2;
- use gradient accumulation;
- reduce lengths to 512/64 first;
- disable eval/generation during the first training test;
- disable
fp16while debugging; - use gradient checkpointing;
- disable save/push during diagnosis;
- test fixed padding against dynamic padding;
- use a fresh Python process for long runs;
- evaluate separately after restarting;
- use CPU or CUDA/cloud if the full long MPS run still leaks.
Short version:
MPS allocated: 4.20 GiBmeans live tensor memory is not enormous.other allocations: 43.49 GiBpoints to backend/driver allocations.empty_cache()is not expected to fix this.- high-watermark
0.0only removes a safety guardrail. - dynamic-shape seq2seq training on MPS is the main suspect.