SuperOffload: Unleashing the Power of
Large-Scale LLM Training on Superchips

Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single GPU and Llama3-70B on four GPUs, achieving up to 600 TFLOPS

Xinyu Lian¹, Masahiro Tanaka², Olatunji Ruwase³, Minjia Zhang¹

¹University of Illinois Urbana-Champaign ²Anyscale ³Snowflake

Paper Code Tutorial Blog

News

2025-10 SuperOffload will be featured in the Ray x DeepSpeed Meetup: AI at Scale.
2025-10 SuperOffload will be featured in the DeepSpeed & vLLM keynote at this year's PyTorch Conference.
2025-06 SuperOffload has been accepted at ASPLOS 2026!

Recent models, especially MoE, at the scale of tens to hundreds of billions of parameters, making fine-tuning on limited GPUs difficult. Offloading to CPU memory helps reduce GPU demand but typically assumes GPU-CPU connections over PCIe, which is bandwidth-limited (e.g., 32 GB/s on PCIe-Gen4). Thus, prior work mainly optimizes data transfers to avoid PCIe becoming a major performance bottleneck. However, hardware vendors are introducing a new class of tightly coupled architectures—such as NVIDIA GH200, GB200, and AMD MI300A—that challenge these long-standing assumptions.

The open-source release of SuperOffload addresses this gap by providing a set of modular techniques for efficient large-model training. With SuperOffload, models such as GPT-OSS-20B, Qwen3-14B, and Phi-4 can be fully fine-tuned on a single GH200, achieving 600 TFLOPS under modest settings (sequence length 4k, batch size 4). This delivers up to 4x higher throughput compared to ZeRO-Offload.

Built on top of ZeRO Stage 3, SuperOffload enables scaling to even larger models, including Qwen3-30B-A3B, Seed-OSS-36B on two GH200s and Llama-70B on four GH200s. All of this is supported natively through Hugging Face Transformers and DeepSpeed, with no need for custom modeling code.

SuperOffload system overview — Figure 1: SuperOffload delivers up to 4x higher throughput than ZeRO-Offload for large-model fine-tuning across varying sequence lengths and batch sizes, achieving a peak throughput of 600 TFLOPS.

SuperOffoad Highlight

Single GH200: Full fine-tuning of GPT-OSS-20B, Qwen3-14B, achieving ~600 TFLOPS (seq len 4K, batch size 4).
Scales Further: Qwen3-30B-A3B & Seed-OSS-36B on 2x GH200; Llama-70B on 4x GH200.
Throughput Gains: Up to 4x vs ZeRO-Offload under modest settings.
Built On: DeepSpeed ZeRO Stage 3 + native Hugging Face Transformers integration.
No Custom Modeling Code: Drop-in configuration driven.

How SuperOffload works

1. Speculation-then-Validation (STV): Overlap CPU-Adam with backward propagation on the GPU

As shown in the Figure 2 (ZeroOffload), the clipping of the gradient norm requires calculating the global gradient norm, and mixed precision training requires a global check of NAN and INF values. Which requires the CPU to wait until all gradients have been received before the optimizer step and weight updates. As illustrated by the idle block in Figure 2 (ZeroOffload), this dependency exposes the optimizer step to the critical path, preventing it from overlapping with the backward pass.

To address this limitation, we propose a speculation-then-validation schedule, which largely bypasses these synchronizations while preserving the exact convergence property of the training. Our mechanism is based on a key observation: most of the time the global states have no effects. For example, gradient clipping is rarely triggered, especially after the initial warm-up phase when gradient variance significantly reduces. As shown in Figure 3, in BLOOM (176B) training, after iteration 1000, when training becomes more stable, gradient clipping rarely happens - occurring only 93 times between steps 1000 and 80000, which represents 0.12% of the total iterations. Similarly, mixed precision training rarely encounters NAN and INF, as a healthy training run should not have numerical instability issues. The situation improves further with BF16 training and during fine-tuning, where the process is considerably more stable compared to FP16 and large-scale pre-training.

Therefore, instead of waiting for all gradients to arrive, the CPU initiates the optimizer step speculatively using the gradients available at that moment. Once the update is complete, the new parameters are copied back to the GPU and replace the old ones. During the validation phase (1) if NaNs and INFs are detected, the iteration is skipped; (2) if gradients exceed clipping thresholds (e.g., after finishing computing the global gradient norm across all parameter gradients), SuperOffload reverts the previous optimizer update and re-executes it using the clipped gradients. We implemented the in-place rollback as one function of the CPU-Adam.

2. Partial Offloading with Fine-Grained Bucketization

In contrast to traditional ZeRO-Offload, where the forward pass of the next iteration waits for all updated parameters to return from CPU, SuperOffload reduces this synchronization bubble. It does so by keeping the optimizer states and gradients of the last few buckets directly in GPU memory (when Hopper HBM allows). At the same time, it ensures that the final offloaded bucket finishes its optimization step early enough so the next iteration can begin without stalling. The number of these “tail buckets” kept on the GPU is denoted as n'. Adjusting n' lets us trade a small amount of extra memory use for an throughput improvement in overlap and less idle time at the end of each iteration.

3. Superchip-Aware Casting

In DL training frameworks like PyTorch and DeepSpeed, the mixed precision training is implemented through a graph rewriting process. The default precision of all ops is float32 (FP32). Mixed precision training casts certain model states (e.g., weights, gradients) from FP32 to float16 (FP16,BF16), or vice versa. For example, the gradients in the backward pass are produced in FP16/BF16/FP8, and the optimizer computes the updates using FP32 gradients. when considering offloading strategies, the cost is from transfer tensors between the GPU and CPU but also involve converting tensor data types.

Existing offloading-based solutions often adopt a minimum edge cut algorithm to computation graph for minimal edge cut on a computation graph assuming casting + transfer costs are dominated by bandwidth. On Superchips, the high-bandwidth CPU↔GPU link shifts the cost balance and casting becomes non-negligible. As illustrated in Figure 4, SuperOffload improves efficiency by performing casting on the GPU and transferring high-precision tensors to the CPU.

# Experience and Insights

NUMA Binding: NUMA Binding is required for efficient training on Nvidia GH200. Each GPU is paired with a CPU to ensure that the training process is launched on the CPU directly associated with that GPU. This pairing improves affinity, delivering higher CPU-GPU bandwidth and greater throughput. In DeepSpeed, we provide a simple interface to enable NUMA binding: simply add the `--bind_cores_to_rank` flag when launching the DeepSpeed engine.
MPAM: Memory System Resource Partitioning and Monitoring (MPAM) is essential for achieving optimal throughput performance. In SuperOffload, GPU execution is overlapped with CPU-based Adam execution. MPAM helps reduce interference between these two processes, leading to smoother execution and better performance.

How to enable MPAM on Nvidia Superchips:

Install the kernel from NVIDIA NV-Kernels.

Check that MPAM is supported and enabled on the system:

grep MPAM /boot/config-$(uname -r)

Expected output:

CONFIG_ARM64_MPAM=y
CONFIG_ACPI_MPAM=y
CONFIG_ARM64_MPAM_DRIVER=y
CONFIG_ARM64_MPAM_RESCTRL_FS=y

Optional: Verify resctrl filesystem:

ls -ld /sys/fs/resctrl

Mount resctrl

mount -t resctrl resctrl /sys/fs/resctrl

Create partition p1, p2

mkdir /sys/fs/resctrl/p1 /sys/fs/resctrl/p2

Set cpu cores list and cache partition and memory partition for p1 and p2

Recommended config based on our experiments:

/sys/fs/resctrl/p1/cpus_list:
0-6
/sys/fs/resctrl/p2/cpus_list:
7-71
/sys/fs/resctrl/p1/schemata:
MB:1=100
L3:1=ff0
/sys/fs/resctrl/p2/schemata:
MB:1=20
L3:1=f

Status & Availability

SuperOffload is released as modular extensions atop ZeRO Stage 3 inside DeepSpeed with native configuration hooks exposed to Hugging Face Transformers (no model code changes). Community feedback & contributions are welcome.

To enable SuperOffload, simply add the following one line in while box to your DeepSpeed configuration file:

Acknowledgements

This work is the result of a close collaboration between University of Illinois Urbana-Champaign (UIUC) and DeepSpeed team.

We also gratefully acknowledge William Gropp, Brett Bode, and Gregory H. Bauer from the National Center for Supercomputing Applications (NCSA), as well as Dan Ernst, Ian Karlin, Giridhar Chukkapalli, Kurt Rago, and others from NVIDIA for their valuable discussions and guidance on MPAM support on Grace CPU.

BibTeX

@inproceedings{superoffload,
    author = {Xinyu Lian and Masahiro Tanaka and Olatunji Ruwase and Minjia Zhang},
    title = "{SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips}",
    year = {2026},
    booktitle = {Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS'26)}
}

We thank the authors of Nerfies that kindly open sourced the template of this website. It is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

SuperOffload: Unleashing the Power ofLarge-Scale LLM Training on Superchips