Gpu-Optimization

RecScale: System-Aware Scaling Laws for Deep Learning Recommendation Models

This blog presents the motivation, design principles, and key results behind RecScale, a system-aware approach to scaling Deep Learning Recommendation Models (DLRMs) that addresses critical memory and communication bottlenecks in distributed training.

October 12, 2025 10 minutes

MegaFold: an Open-Sourced AlphaFold-3 Training System

This blog presents a deep analysis of Alpha-Fold 3 (AF3) training pipelines, pinpointing their inefficiencies and introduces MegaFold: an end-to-end training system for AF3 that addresses the aforementioned issues.

October 3, 2025 10 minutes

VoltanaLLM: Feedback-Driven Frequency Control and Routing for Energy-Efficient LLM Serving

This blog presents the motivation, insights, and key optimizations behind VoltanaLLM, our system for energy-efficient LLM inference. We’ll walk through why energy matters, how conventional GPU frequency scaling falls short, the surprising behaviors we uncovered when profiling LLM serving, how P/D disaggregated serving creates unique opportunities, and how VoltanaLLM’s co-design of frequency control and routing achieves up to 36.3% GPU energy savings while maintaining near-perfect Service Level Objective (SLO) attainment.

September 14, 2025 9 minutes

X-MoE: Scaling DeepSeek-style MoEs on Frontier—what broke, what we fixed, and what to learn

This blog presents the background and key optimizations behind X-MoE, along with our hands-on experience scaling MoE model training on Frontier, the AMD GPU supercomputer.

August 24, 2025 10 minutes