News
- 2025-10 SuperOffload will be featured in the Ray x DeepSpeed Meetup: AI at Scale.
- 2025-10 SuperOffload will be featured in the DeepSpeed & vLLM keynote at this year's PyTorch Conference.
- 2025-06 SuperOffload has been accepted at ASPLOS 2026!
Recent models, especially MoE, at the scale of tens to hundreds of billions of parameters, making fine-tuning on limited GPUs difficult. Offloading to CPU memory helps reduce GPU demand but typically assumes GPU-CPU connections over PCIe, which is bandwidth-limited (e.g., 32 GB/s on PCIe-Gen4). Thus, prior work mainly optimizes data transfers to avoid PCIe becoming a major performance bottleneck. However, hardware vendors are introducing a new class of tightly coupled architectures—such as NVIDIA GH200, GB200, and AMD MI300A—that challenge these long-standing assumptions.
The open-source release of SuperOffload addresses this gap by providing a set of modular techniques for efficient large-model training. With SuperOffload, models such as GPT-OSS-20B, Qwen3-14B, and Phi-4 can be fully fine-tuned on a single GH200, achieving 600 TFLOPS under modest settings (sequence length 4k, batch size 4). This delivers up to 4x higher throughput compared to ZeRO-Offload.
Built on top of ZeRO Stage 3, SuperOffload enables scaling to even larger models, including Qwen3-30B-A3B, Seed-OSS-36B on two GH200s and Llama-70B on four GH200s. All of this is supported natively through Hugging Face Transformers and DeepSpeed, with no need for custom modeling code.
As shown in the Figure 2 (ZeroOffload), the clipping of the gradient norm requires calculating the global gradient norm, and mixed precision training requires a global check of NAN and INF values. Which requires the CPU to wait until all gradients have been received before the optimizer step and weight updates. As illustrated by the idle block in Figure 2 (ZeroOffload), this dependency exposes the optimizer step to the critical path, preventing it from overlapping with the backward pass.
To address this limitation, we propose a speculation-then-validation schedule, which largely bypasses these synchronizations while preserving the exact convergence property of the training. Our mechanism is based on a key observation: most of the time the global states have no effects. For example, gradient clipping is rarely triggered, especially after the initial warm-up phase when gradient variance significantly reduces. As shown in Figure 3, in BLOOM (176B) training, after iteration 1000, when training becomes more stable, gradient clipping rarely happens - occurring only 93 times between steps 1000 and 80000, which represents 0.12% of the total iterations. Similarly, mixed precision training rarely encounters NAN and INF, as a healthy training run should not have numerical instability issues. The situation improves further with BF16 training and during fine-tuning, where the process is considerably more stable compared to FP16 and large-scale pre-training.
Therefore, instead of waiting for all gradients to arrive, the CPU initiates the optimizer step speculatively using the gradients available at that moment. Once the update is complete, the new parameters are copied back to the GPU and replace the old ones. During the validation phase (1) if NaNs and INFs are detected, the iteration is skipped; (2) if gradients exceed clipping thresholds (e.g., after finishing computing the global gradient norm across all parameter gradients), SuperOffload reverts the previous optimizer update and re-executes it using the clipped gradients. We implemented the in-place rollback as one function of the CPU-Adam.
In contrast to traditional ZeRO-Offload, where the forward pass of the next iteration waits for all updated parameters to return from CPU, SuperOffload reduces this synchronization bubble. It does so by keeping the optimizer states and gradients of the last few buckets directly in GPU memory (when Hopper HBM allows). At the same time, it ensures that the final offloaded bucket finishes its optimization step early enough so the next iteration can begin without stalling. The number of these “tail buckets” kept on the GPU is denoted as n'
. Adjusting n'
lets us trade a small amount of extra memory use for an throughput improvement in overlap and less idle time at the end of each iteration.
In DL training frameworks like PyTorch and DeepSpeed, the mixed precision training is implemented through a graph rewriting process. The default precision of all ops is float32 (FP32). Mixed precision training casts certain model states (e.g., weights, gradients) from FP32 to float16 (FP16,BF16), or vice versa. For example, the gradients in the backward pass are produced in FP16/BF16/FP8, and the optimizer computes the updates using FP32 gradients. when considering offloading strategies, the cost is from transfer tensors between the GPU and CPU but also involve converting tensor data types.
Existing offloading-based solutions often adopt a minimum edge cut algorithm to computation graph for minimal edge cut on a computation graph assuming casting + transfer costs are dominated by bandwidth. On Superchips, the high-bandwidth CPU↔GPU link shifts the cost balance and casting becomes non-negligible. As illustrated in Figure 4, SuperOffload improves efficiency by performing casting on the GPU and transferring high-precision tensors to the CPU.
grep MPAM /boot/config-$(uname -r)
Expected output:
CONFIG_ARM64_MPAM=y
CONFIG_ACPI_MPAM=y
CONFIG_ARM64_MPAM_DRIVER=y
CONFIG_ARM64_MPAM_RESCTRL_FS=y
Optional: Verify resctrl filesystem:
ls -ld /sys/fs/resctrl
mount -t resctrl resctrl /sys/fs/resctrl
mkdir /sys/fs/resctrl/p1 /sys/fs/resctrl/p2
Recommended config based on our experiments:
/sys/fs/resctrl/p1/cpus_list:
0-6
/sys/fs/resctrl/p2/cpus_list:
7-71
/sys/fs/resctrl/p1/schemata:
MB:1=100
L3:1=ff0
/sys/fs/resctrl/p2/schemata:
MB:1=20
L3:1=f
SuperOffload is released as modular extensions atop ZeRO Stage 3 inside DeepSpeed with native configuration hooks exposed to Hugging Face Transformers (no model code changes). Community feedback & contributions are welcome.
To enable SuperOffload, simply add the following one line in while box to your DeepSpeed configuration file:
This work is the result of a close collaboration between University of Illinois Urbana-Champaign (UIUC) and DeepSpeed team.
We also gratefully acknowledge William Gropp, Brett Bode, and Gregory H. Bauer from the National Center for Supercomputing Applications (NCSA), as well as Dan Ernst, Ian Karlin, Giridhar Chukkapalli, Kurt Rago, and others from NVIDIA for their valuable discussions and guidance on MPAM support on Grace CPU.
@inproceedings{superoffload,
author = {Xinyu Lian and Masahiro Tanaka and Olatunji Ruwase and Minjia Zhang},
title = "{SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips}",
year = {2026},
booktitle = {Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS'26)}
}
We thank the authors of Nerfies that kindly open sourced the template of this website. It is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.