Running out of memory in the summary example

af5553g · May 6, 2026, 8:33am

I am trying to run this notebook locally: notebooks/examples/summarization.ipynb at main · huggingface/notebooks · GitHub
I am running on mac m2 with python 3.14.
The only change I did is to install additional dependencies (pip install datasets transformers torch torchvision torchaudio accelerate).
During the train process I see the memory goes up like a memory leak. After some run time (something like 1300 steps) I get an error of out of memory for the device.
“RuntimeError: MPS backend out of memory (MPS allocated: 4.20 GiB, other allocations: 43.49 GiB, max allowed: 47.74 GiB). Tried to allocate 51.44 MiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).”
I tried to run torch.mps.empty_cache() but it didn’t change
What can be my issue? How can I fix this memory issue so I will be able to train the model?
I don’t want to set PYTORCH_MPS_HIGH_WATERMARK_RATIO to 0 as it seems like just a hack, and to fix my core issue about the memory leak
Thank for you helping!

John6666 · May 7, 2026, 7:05am

There seems to be an issue related to MPS that looks like a memory leak:

MPS out-of-memory while running the Hugging Face summarization notebook on Mac M2

Direct diagnosis

This does not look like a simple “t5-small is too large” problem, and it also does not look like something torch.mps.empty_cache() is expected to fix.

The key clue is this part of the error:

RuntimeError: MPS backend out of memory
MPS allocated: 4.20 GiB
other allocations: 43.49 GiB
max allowed: 47.74 GiB
Tried to allocate 51.44 MiB on private pool.
Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations

The important part is the split:

MPS allocated: 4.20 GiB
other allocations: 43.49 GiB

That strongly suggests the model’s live PyTorch tensor memory is not the whole problem. The large number is in MPS/Metal-driver/backend-side allocations.

PyTorch has separate MPS memory counters:

torch.mps.current_allocated_memory() reports current GPU memory occupied by tensors and does not include cached allocations in MPSAllocator pools.
torch.mps.driver_allocated_memory() reports total GPU memory allocated by the Metal driver and includes cached MPSAllocator pools plus allocations from MPS/MPSGraph frameworks.
torch.mps.empty_cache() releases unoccupied cached memory held by the caching allocator; it does not promise to clear all MPSGraph, driver, or backend allocations.

So the likely diagnosis is:

A long, variable-shape, sequence-to-sequence Hugging Face Trainer run is causing MPS driver/backend memory to grow until it hits the MPS allocation limit.

That is different from ordinary model OOM. Ordinary model OOM usually means “the live tensors for this batch do not fit.” Your error looks more like “backend/driver allocations have grown over time.”

Why this notebook is a bad fit for a default Mac M2 MPS run

The Hugging Face summarization notebook is a teaching notebook, not a carefully constrained Apple Silicon training recipe.

It uses the XSum summarization task and t5-small. Relevant links:

The notebook’s default-style setup is roughly:

model_checkpoint = "t5-small"
raw_datasets = load_dataset("xsum")
metric = load("rouge")

max_input_length = 1024
max_target_length = 128
batch_size = 16

Seq2SeqTrainingArguments(
    ...,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Those settings are heavy for local MPS training because summarization is an encoder-decoder sequence-to-sequence task:

the input document can be long;
the target summary is generated token by token;
training stores encoder activations, decoder activations, gradients, optimizer state, attention intermediates, labels, and temporary tensors;
evaluation with predict_with_generate=True uses generation, which is more memory-heavy than plain loss evaluation;
ROUGE evaluation requires generated predictions and decoded text;
dynamic padding creates changing tensor shapes from batch to batch.

Even though t5-small is small compared with modern LLMs, this workload is not small in the memory-behavior sense. A long-input seq2seq model with batch size 16 and source length up to 1024 is quite aggressive for MPS.

Why “around 1300 steps” matters

The fact that memory grows over time and fails after something like 1300 steps is important.

If the batch were simply too large, I would expect failure very early, often on the first few steps. A late failure suggests one of these:

backend memory accumulation;
allocator cache growth;
shape-specific graph/kernel/resource accumulation;
fragmentation;
a real MPS backend leak;
retained objects in a notebook process;
evaluation/checkpointing side effects if the failure occurs near those events.

Your specific error strongly points to MPS backend/driver allocations because the other allocations number is enormous.

There are similar public reports:

PyTorch issue: MPS memory leak in training with transformers Trainer
This is the closest match. It reports transformers Trainer on MPS hitting OOM after several hundred iterations, especially with varying data lengths. It also notes that MPS allocated memory appears unchanged while backend memory runs out.
PyTorch issue: MPS memory leak with variable batch size / sequence length
This is relevant because summarization datasets naturally produce variable sequence lengths.
Hugging Face Transformers issue: Trainer class causes massive memory leak when using MPS
This reports continuously growing process memory with Trainer on MPS.
PyTorch forum: MPS backend OOM with small allocated memory and large other allocations
This has the same error shape: modest MPS allocated, huge other allocations.
PyTorch issue: MPS LSTM loop leaks despite torch.mps.empty_cache()
This helps explain why empty_cache() does not solve this class of problem.
PyTorch issue: driver_allocated_memory() grows unrestricted
This shows another MPS case where driver memory grows until an OOM with huge other allocations.

This is why I would treat your issue as likely MPS-backend-related, not just a notebook typo.

Why dynamic padding is suspicious

The notebook intentionally defers padding to the data collator. That is usually good practice because each batch is padded only to the longest example in that batch, not to the global maximum.

The downside is that every batch may have a different shape.

For example, step shapes may look conceptually like this:

step 1: input shape [16, 742], labels [16, 68]
step 2: input shape [16, 1018], labels [16, 114]
step 3: input shape [16, 523], labels [16, 51]
step 4: input shape [16, 895], labels [16, 103]
...

On CUDA, dynamic padding is usually a good memory/speed tradeoff. On MPS, public issues suggest that changing batch/sequence shapes can contribute to backend memory growth.

That makes dynamic padding one of the strongest suspects in your case.

The important tradeoff:

Strategy	Benefit	Risk on MPS
Dynamic padding	Less padding compute per batch	Many distinct shapes
Fixed padding	Fewer distinct shapes	More padding tokens
Length bucketing	Fewer distinct shapes with less wasted padding	More setup

For your specific issue, I would test fixed padding even though it is less elegant.

Why `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` is not the right fix

You are right to avoid this as the main answer.

The PyTorch MPS environment-variable docs describe:

PYTORCH_MPS_HIGH_WATERMARK_RATIO as the hard allocation limit for the MPS allocator.
Setting it to 0.0 disables the high-watermark limit.
The docs warn that disabling the limit may cause system failure if system-wide OOM occurs.
PYTORCH_MPS_LOW_WATERMARK_RATIO is the softer limit used for adaptive commit / garbage-collection behavior.

So this:

export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

may postpone the error, but it does not fix the memory-growth slope. It removes the guardrail and can push your whole system into memory pressure or system OOM.

A safer diagnostic, not a real fix, is something like:

export PYTORCH_MPS_HIGH_WATERMARK_RATIO=1.0
export PYTORCH_MPS_LOW_WATERMARK_RATIO=0.8

This may make failure happen earlier, but it can help show whether low-watermark cleanup behavior changes the memory curve. I would not treat it as the primary solution.

Why `torch.mps.empty_cache()` did not help

torch.mps.empty_cache() is not a general reset button.

It can release unoccupied cached memory held by the caching allocator, but it does not necessarily free:

live tensors;
retained Python references;
MPSGraph framework allocations;
driver allocations still considered active;
shape-specific backend resources;
command-buffer-related resources;
actual backend leaks.

That matches the public MPS issues where memory growth continues even when empty_cache() is called repeatedly.

So this is not surprising:

torch.mps.empty_cache()

It may help in some allocator-cache situations, but it is not expected to fix a long-run MPS backend memory growth issue.

What I would do first

1. Instrument MPS memory correctly

Add a callback that logs both tensor memory and driver memory.

import torch
from transformers import TrainerCallback

def gb(x):
    return x / 1024**3

def print_mps_memory(tag=""):
    if torch.backends.mps.is_available():
        live = torch.mps.current_allocated_memory()
        driver = torch.mps.driver_allocated_memory()
        recommended = torch.mps.recommended_max_memory()
        print(
            f"{tag} | "
            f"live_tensors={gb(live):.2f} GiB | "
            f"driver={gb(driver):.2f} GiB | "
            f"recommended={gb(recommended):.2f} GiB"
        )

class MPSMemoryCallback(TrainerCallback):
    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % 50 == 0:
            print_mps_memory(f"step={state.global_step}")

    def on_evaluate(self, args, state, control, **kwargs):
        print_mps_memory(f"after_eval step={state.global_step}")

Then:

trainer.add_callback(MPSMemoryCallback())

Interpretation:

Observation	Likely meaning
`live_tensors` grows steadily	real tensor retention, too-large graph, or Python reference retention
`live_tensors` stable but `driver` grows	MPS allocator / MPSGraph / Metal-driver growth
growth jumps after evaluation	generation / metrics / prediction accumulation
growth appears only after notebook reruns	stale notebook references
CPU stable but MPS grows	MPS-specific backend issue
fixed padding flattens driver growth	dynamic-shape churn is probably the trigger

For your reported error, I would expect live tensor memory to remain much smaller than driver/backend memory.

2. Start from a smaller, MPS-friendly training configuration

Do not start from the original notebook settings. Use a diagnostic configuration first:

from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-mps-debug",

    # Lower per-step memory.
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,

    # Disable evaluation while diagnosing training memory.
    eval_strategy="no",

    # Disable saving / pushing while diagnosing memory.
    save_strategy="no",
    push_to_hub=False,

    # Remove mixed precision as a variable.
    fp16=False,
    bf16=False,

    # Trade speed for lower activation memory.
    gradient_checkpointing=True,

    # Keep macOS data loading simple.
    dataloader_num_workers=0,
    dataloader_pin_memory=False,

    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_steps=50,
)

Also set:

model.config.use_cache = False

Why these settings:

per_device_train_batch_size=1 greatly reduces per-step memory pressure.
gradient_accumulation_steps=16 keeps the effective batch size near the original batch size 16.
eval_strategy="no" answers the question: “Does training alone leak?”
save_strategy="no" removes checkpointing as a confounder.
push_to_hub=False removes git/upload behavior as a confounder.
fp16=False removes mixed-precision ambiguity on MPS.
gradient_checkpointing=True reduces activation memory by recomputing activations during backward.
dataloader_num_workers=0 and dataloader_pin_memory=False simplify data loading on macOS.

Relevant docs:

3. Reduce sequence lengths first

Change:

max_input_length = 1024
max_target_length = 128

to:

max_input_length = 512
max_target_length = 64

This reduces:

encoder activation memory;
decoder activation memory;
attention memory;
temporary tensors;
generation memory later;
shape variety.

For a Mac M2 diagnostic run, 1024/128 is too aggressive as the first attempt.

4. Test fixed padding

This is the most important diagnostic for your case.

Instead of dynamic padding, try fixed padding:

max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        padding="max_length",
        truncation=True,
    )

    labels = tokenizer(
        text_target=examples["summary"],
        max_length=max_target_length,
        padding="max_length",
        truncation=True,
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Then rebuild the tokenized dataset:

tokenized_datasets = raw_datasets.map(
    preprocess_function,
    batched=True,
    load_from_cache_file=False,
)

If fixed padding makes memory stable or much flatter, then your main trigger is probably dynamic shape churn on MPS.

If fixed padding does not help, the issue is more likely a broader MPS backend / Trainer / seq2seq loop memory-growth problem.

5. Use a subset first

Do not debug on the full XSum training set.

small_train = tokenized_datasets["train"].select(range(10_000))
small_eval = tokenized_datasets["validation"].select(range(500))

Then:

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    data_collator=data_collator,
    processing_class=tokenizer,
)

If your installed Transformers version does not accept processing_class, use:

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Run:

trainer.add_callback(MPSMemoryCallback())
trainer.train()

A complete first-pass MPS-safe training cell

This is the sort of configuration I would try first.

import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)

model_checkpoint = "t5-small"

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model.config.use_cache = False

if torch.backends.mps.is_available():
    model.to("mps")

max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        padding="max_length",
        truncation=True,
    )

    labels = tokenizer(
        text_target=examples["summary"],
        max_length=max_target_length,
        padding="max_length",
        truncation=True,
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(
    preprocess_function,
    batched=True,
    load_from_cache_file=False,
)

small_train = tokenized_datasets["train"].select(range(10_000))
small_eval = tokenized_datasets["validation"].select(range(500))

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-mps-debug",

    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,

    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,

    gradient_checkpointing=True,

    eval_strategy="no",
    save_strategy="no",
    push_to_hub=False,

    fp16=False,
    bf16=False,

    dataloader_num_workers=0,
    dataloader_pin_memory=False,

    logging_steps=50,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    data_collator=data_collator,
    processing_class=tokenizer,
)

trainer.add_callback(MPSMemoryCallback())
trainer.train()

If processing_class=tokenizer fails because of your Transformers version, replace it with:

tokenizer=tokenizer

Re-enable evaluation only after training is stable

Once training alone is stable, add evaluation carefully.

For summarization, evaluation is expensive because predict_with_generate=True runs generation. Hugging Face documents predict_with_generate as using generate() to calculate generative metrics such as ROUGE/BLEU.

Also, eval_accumulation_steps matters. The Trainer docs explain that if it is unset, predictions are accumulated on the accelerator before being moved to CPU, which is faster but uses more accelerator memory.

Use:

eval_args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-mps-eval",

    per_device_eval_batch_size=1,

    predict_with_generate=True,
    generation_max_length=64,
    generation_num_beams=1,
    eval_accumulation_steps=1,

    fp16=False,
    bf16=False,

    save_strategy="no",
    push_to_hub=False,
)

Recommended eval strategy:

1. train with eval disabled
2. restart the Python process
3. load the trained model
4. evaluate on 100 to 500 validation examples
5. only then try larger validation runs

This avoids forcing a long training process with already-grown MPS driver memory to run generation-heavy evaluation afterward.

Experiment matrix

Run these in order.

Experiment	Padding	Batch	Lengths	Eval?	Purpose
A	dynamic	1	512/64	no	Does reduced training still grow memory?
B	fixed	1	512/64	no	Does shape stability fix the issue?
C	fixed	2	512/64	no	Can you safely increase speed?
D	fixed	1	768/96	no	Can you safely increase length?
E	fixed	1	512/64	tiny eval	Does generation/eval trigger memory jumps?
F	fixed	1	512/64	larger eval	How far can evaluation scale?

Stop as soon as driver_allocated_memory() shows a steady upward slope.

What each result means

If fixed padding stabilizes memory

Then dynamic shape churn is probably the main trigger.

Use:

fixed padding;
shorter max lengths;
batch size 1 or 2;
gradient accumulation;
separate train/eval processes;
no fp16 until stable.

If fixed padding slows but does not stop memory growth

Then shape churn is one contributor, but there is probably broader MPS backend growth.

Use:

shorter runs;
restart process between phases;
checkpoint only model weights;
CPU or CUDA/cloud for full training;
track PyTorch MPS issues.

If both dynamic and fixed padding leak similarly

Then this is likely a more general MPS backend / seq2seq / Trainer issue.

Try:

the official script instead of the notebook;
a no-Trainer loop;
CPU control run;
newer or older PyTorch version as a test;
cloud CUDA if you need the full notebook behavior.

Relevant scripts:

If CPU is stable but MPS leaks

Then the problem is almost certainly MPS-specific.

A CPU control run:

args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-cpu",
    use_cpu=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    eval_strategy="no",
    save_strategy="no",
    push_to_hub=False,
)

CPU will be slower, but it is useful as a control experiment.

If memory grows only in the notebook

Then notebook state is contributing.

Before rerunning:

import gc
import torch

try:
    del trainer
except NameError:
    pass

try:
    del model
except NameError:
    pass

gc.collect()

if torch.backends.mps.is_available():
    torch.mps.empty_cache()

But the stronger fix is to restart the kernel or run the training as a plain script:

python train_summarization_mps.py

Python 3.14: probably not the main cause, but simplify it

Python 3.14 is not necessarily the root cause. Current PyTorch installation guidance includes modern Python versions on macOS. Still, for debugging I would use Python 3.11 or 3.12 first because they are more commonly exercised across ML packages.

A cleaner environment:

python3.12 -m venv .venv-summarization-mps
source .venv-summarization-mps/bin/activate

python -m pip install -U pip
python -m pip install -U torch torchvision torchaudio
python -m pip install -U transformers datasets evaluate accelerate rouge-score nltk

Then print versions:

import sys
import torch
import transformers
import datasets
import accelerate

print("python:", sys.version)
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("datasets:", datasets.__version__)
print("accelerate:", accelerate.__version__)
print("mps built:", torch.backends.mps.is_built())
print("mps available:", torch.backends.mps.is_available())

This does not prove Python 3.14 is bad. It just removes a variable while investigating a likely MPS backend issue.

What I would not do

Do not rely on `torch.mps.empty_cache()`

It is not a general leak fix.

Use it for cleanup, but do not expect it to solve driver/backend growth.

Do not set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` as the solution

That disables the hard limit and can risk system-wide OOM. It may postpone the crash, but it does not fix the underlying memory-growth pattern.

Do not start with full XSum + full ROUGE evaluation

Use subsets first.

Do not debug with `fp16=True`

Disable mixed precision first:

fp16=False
bf16=False

After the memory curve is understood, test mixed precision separately.

Do not assume `t5-small` means the workload is small

Parameter count is only one part of memory behavior. Seq2seq summarization with long inputs is memory-heavy even with a small model.

Most practical final local-MPS recipe

For actually finishing a local run on Mac M2, I would use something like this:

max_input_length = 512
max_target_length = 64

args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-mps",

    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,

    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,

    gradient_checkpointing=True,

    eval_strategy="no",

    save_strategy="epoch",
    save_only_model=True,
    push_to_hub=False,

    fp16=False,
    bf16=False,

    dataloader_num_workers=0,
    dataloader_pin_memory=False,

    logging_steps=50,
)

Then restart the Python process and evaluate separately:

eval_args = Seq2SeqTrainingArguments(
    output_dir="t5-small-xsum-mps-eval",

    per_device_eval_batch_size=1,

    predict_with_generate=True,
    generation_max_length=64,
    generation_num_beams=1,
    eval_accumulation_steps=1,

    save_strategy="no",
    push_to_hub=False,

    fp16=False,
    bf16=False,
)

Best links to read

Official docs

Closest issues / reports

Useful code references

Bottom line

Your issue is most likely:

MPS backend / Metal-driver memory growth during a long, variable-shape Hugging Face seq2seq training run.

The original notebook makes that likely because it combines:

XSum summarization;
t5-small;
dynamic padding;
max_input_length=1024;
max_target_length=128;
batch size 16;
fp16=True;
generation-based evaluation;
checkpointing / Hub-push behavior;
a long run on Apple Silicon MPS.

The best fix is not PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0.

The best practical workaround is:

log both current_allocated_memory() and driver_allocated_memory();
use batch size 1 or 2;
use gradient accumulation;
reduce lengths to 512/64 first;
disable eval/generation during the first training test;
disable fp16 while debugging;
use gradient checkpointing;
disable save/push during diagnosis;
test fixed padding against dynamic padding;
use a fresh Python process for long runs;
evaluate separately after restarting;
use CPU or CUDA/cloud if the full long MPS run still leaks.

Short version:

MPS allocated: 4.20 GiB means live tensor memory is not enormous.
other allocations: 43.49 GiB points to backend/driver allocations.
empty_cache() is not expected to fix this.
high-watermark 0.0 only removes a safety guardrail.
dynamic-shape seq2seq training on MPS is the main suspect.

Topic		Replies	Views
AutoTrain Python automatically using MPS on Mac - How to switch to CPU Beginners	1	761	November 9, 2024
TPU Memory problem when saving model checkpoint Beginners	0	580	April 7, 2022
Why is the tensor produced by inference so big? Beginners	2	473	April 17, 2023
Continous increase in Memory usage 🤗Transformers	17	2248	April 14, 2026
Finetune bart for text summary has nan loss Amazon SageMaker	5	981	October 8, 2021

Running out of memory in the summary example

MPS out-of-memory while running the Hugging Face summarization notebook on Mac M2

Direct diagnosis

Why this notebook is a bad fit for a default Mac M2 MPS run

Why “around 1300 steps” matters

Why dynamic padding is suspicious

Why PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 is not the right fix

Why torch.mps.empty_cache() did not help

What I would do first

1. Instrument MPS memory correctly

2. Start from a smaller, MPS-friendly training configuration

3. Reduce sequence lengths first

4. Test fixed padding

5. Use a subset first

A complete first-pass MPS-safe training cell

Re-enable evaluation only after training is stable

Experiment matrix

What each result means

If fixed padding stabilizes memory

If fixed padding slows but does not stop memory growth

If both dynamic and fixed padding leak similarly

If CPU is stable but MPS leaks

If memory grows only in the notebook

Python 3.14: probably not the main cause, but simplify it

What I would not do

Do not rely on torch.mps.empty_cache()

Do not set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 as the solution

Do not start with full XSum + full ROUGE evaluation

Do not debug with fp16=True

Do not assume t5-small means the workload is small

Most practical final local-MPS recipe

Best links to read

Official docs

Closest issues / reports

Useful code references

Bottom line

Related topics

Why `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` is not the right fix

Why `torch.mps.empty_cache()` did not help

Do not rely on `torch.mps.empty_cache()`

Do not set `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` as the solution

Do not debug with `fp16=True`

Do not assume `t5-small` means the workload is small