ASPLOS 2026 Tutorial

Building Efficient Large-Scale Model Systems with DeepSpeed: From Open-Source Foundations to Emerging Research

Organizers

Olatunji Ruwase, Snowflake
Minjia Zhang, University of Illinois Urbana-Champaign
Masahiro Tanaka, Anyscale
Zhipeng Wang, LinkedIn

Overview

Large foundation models such as ChatGPT, Gemini, and DeepSeek have redefined the frontier of AI systems, yet their massive scale exposes significant challenges in distributed pre-training/post-training (e.g. reinforcement learning and supervised fine-tuning), efficiency, and hardware utilization. The community increasingly relies on open-source software to bridge these gaps, which enables researchers and practitioners to experiment, prototype, and optimize at unprecedented scale. Among these, DeepSpeed (https://www.deepspeed.ai) has become one of the most widely adopted open-source frameworks for large-model training, which empowers both academic research and industrial production deployments.

In this tutorial, we will present the system, compiler, and hardware co-design techniques that extend DeepSpeed into a powerful platform for scalable and efficient training of large foundation models. We will cover how DeepSpeed’s runtime architecture supports new forms of distributed and heterogeneous execution, and how software-hardware co-design drives innovation in parallelism, offloading, and memory optimizations. Through a series of concrete systems and hands-on system insights, such as DeepSpeed-SuperOffload for training LLMs on emerging GPU-CPU Superchips, combining DeepSpeed’s high-performance training with Ray’s flexibility for complex distributed workloads including RL, DeepCompile for compiler-driven distributed optimizations, and DeepSpeed on TPU for heterogeneous hardware, we will connect core system design principles to real-world system implementation. By the end, participants will leave with both conceptual understanding of large-model system design and concrete techniques to apply in their own research, helping to build the next generation of efficient, scalable, and open AI infrastructures.

Tutorial Content and Tentative Schedule

1. Introduction and Motivation (45 minutes)

Tunji Ruwase

Challenges in scaling large foundation models, including the memory wall, communication bottlenecks, and heterogeneous hardware environments
Overview of DeepSpeed, covering core components such as runtime architecture, usability, and key system optimizations
Open challenges and future roadmap:
- Cross-platform runtime design
- Energy-efficient training
- Fine-grained scheduling
- Integration with ML compilers

2. Ray + DeepSpeed for LLM Training (45 minutes)

Masahiro Tanaka

Integrating Ray and DeepSpeed for reinforcement learning:
- Flexible orchestration of distributed training and inference within a unified controller architecture
Disaggregated hybrid parallelism for multimodal models:
- Applying optimal parallelization strategies to different components (e.g., encoders and LLMs) enabled by the flexible integration of Ray and DeepSpeed
Compiler-level optimization with DeepCompile:
- Automating communication scheduling and sequence parallelism via compiler analysis and distributed graph transformation
Live demos and practical examples

Coffee Break (30 minutes)

3. DeepSpeed-Based Systems Optimization for LLM Training (45 minutes)

Minjia Zhang

Hardware–software co-design for emerging architectures:
- SuperOffload — novel offloading and scheduling techniques for superchip platforms
Resilience in large-scale training:
- Universal Checkpointing for flexible checkpointing, reconfigurable parallelism, and resilient training under hardware and software failures
DeepSpeed for sparse model training (optional):
- X-MoE for fine-grained Mixture-of-Experts training
DeepSpeed4Science (optional):
- MegaFold — system-level optimizations for scaling biomolecular modeling workloads

4. Training LLMs on Alternative Accelerators and Novel Optimizers (45 minutes)

Zhipeng Wang

Hardware–software co-design for emerging model architectures on non-GPU accelerators (using TPU as an example)
Scaling LLM distributed training on non-GPU accelerators with DeepSpeed
DeepSpeed support for scalable LLM training with novel optimizers (e.g., Muon Optimizer)

Target Audience and Prerequisites

Target Audience

Researchers and practitioners working on large-scale LLM training
Systems and architecture researchers interested in AI workloads
Engineers building distributed training and serving systems

Prerequisites

Basic familiarity with deep learning and transformer models
Working knowledge of PyTorch or similar frameworks
Prior experience with distributed training is helpful but not required

Additional Information

This page serves as a placeholder for the ASPLOS 2026 tutorial website.
More details will be posted as the program is finalized.

The final tutorial program will be available by February 17, 2026.