Blog

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single NVIDIA GH200 and Llama3-70B on four NVIDIA GH200 Superchips, while delivering up to 600 TFLOPS training throughput.

Efficient Deep Learning Training Heterogeneous Memory Superchip

By:

Xinyu Lian

Minjia Zhang

SSAIL Lab

October 7, 2025 5 minutes

MegaFold: an Open-Sourced AlphaFold-3 Training System

This blog presents a deep analysis of Alpha-Fold 3 (AF3) training pipelines, pinpointing their inefficiencies and introduces MegaFold: an end-to-end training system for AF3 that addresses the aforementioned issues.

protein-folding gpu-optimization deep-learning-systems

By:

October 3, 2025 10 minutes

VoltanaLLM: Feedback-Driven Frequency Control and Routing for Energy-Efficient LLM Serving

This blog presents the motivation, insights, and key optimizations behind VoltanaLLM, our system for energy-efficient LLM inference. We’ll walk through why energy matters, how conventional GPU frequency scaling falls short, the surprising behaviors we uncovered when profiling LLM serving, how P/D disaggregated serving creates unique opportunities, and how VoltanaLLM’s co-design of frequency control and routing achieves up to 36.3% GPU energy savings while maintaining near-perfect Service Level Objective (SLO) attainment.

llm-inference gpu-optimization gpu

By:

September 14, 2025 9 minutes

X-MoE: Scaling DeepSeek-style MoEs on Frontier—what broke, what we fixed, and what to learn

This blog presents the background and key optimizations behind X-MoE, along with our hands-on experience scaling MoE model training on Frontier, the AMD GPU supercomputer.

moe training scaling frontier amd gpu-optimization distributed-training

By:

Yueming Yuan

SSAIL Lab

August 24, 2025 10 minutes

Filter by Tags

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

MegaFold: an Open-Sourced AlphaFold-3 Training System

VoltanaLLM: Feedback-Driven Frequency Control and Routing for Energy-Efficient LLM Serving

X-MoE: Scaling DeepSeek-style MoEs on Frontier—what broke, what we fixed, and what to learn