Welcome to SSAIL Lab

RecScale: System-Aware Scaling Laws for Deep Learning Recommendation Models

2025-10-12

This blog presents the motivation, design principles, and key results behind RecScale, a system-aware approach to scaling Deep Learning Recommendation Models (DLRMs) that addresses critical memory and communication bottlenecks in distributed training.

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

2025-10-07

Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single NVIDIA GH200 and Llama3-70B on four NVIDIA GH200 Superchips, while delivering up to 600 TFLOPS training throughput.

MegaFold: an Open-Sourced AlphaFold-3 Training System

2025-10-03

This blog presents a deep analysis of Alpha-Fold 3 (AF3) training pipelines, pinpointing their inefficiencies and introduces MegaFold: an end-to-end training system for AF3 that addresses the aforementioned issues.

VoltanaLLM: Feedback-Driven Frequency Control and Routing for Energy-Efficient LLM Serving

2025-09-14

This blog presents the motivation, insights, and key optimizations behind VoltanaLLM, our system for energy-efficient LLM inference. We’ll walk through why energy matters, how conventional GPU frequency scaling falls short, the surprising behaviors we uncovered when profiling LLM serving, how P/D disaggregated serving creates unique opportunities, and how VoltanaLLM’s co-design of frequency control and routing achieves up to 36.3% GPU energy savings while maintaining near-perfect Service Level Objective (SLO) attainment.

X-MoE: Scaling DeepSeek-style MoEs on Frontier—what broke, what we fixed, and what to learn

2025-08-24

This blog presents the background and key optimizations behind X-MoE, along with our hands-on experience scaling MoE model training on Frontier, the AMD GPU supercomputer.