X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

News

2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination! 🎉

Abstract

Expert-specialized Mixture-of-Experts (MoEs) represent a significant advancement in large language models, employing fine-grained experts with large top-k routing to enhance expert specialization. However, training these emerging MoE architectures poses significant challenges for existing off-the-shelf MoE training solutions, especially on heterogeneous HPC platforms. These challenges include inefficient cross-platform kernels, shifted memory bottlenecks from model parameters to activations, and expensive all-to-all communication on hierarchical networks.

To address these issues, we present X-MoE, a comprehensive training system designed specifically for expert-specialized MoEs on HPC platforms. X-MoE introduces three key innovations: (1) a padding-free sparse MoE training pipeline with cross-platform kernels that eliminates zero-padding overhead, (2) a hierarchical redundancy-bypassing dispatch algorithm that reduces communication redundancy on hierarchical networks, and (3) a hybrid parallelism strategy with sequence-sharded MoE blocks that addresses the shifted memory bottleneck. Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables training of models up to 545B parameters on 1024 AMD GPUs—10× larger than existing solutions—while achieving up to 1.42× higher training throughput.

Background

Mixture-of-Experts (MoE) models have emerged as a powerful approach to scale neural networks efficiently by activating only a subset of parameters per token. Traditional MoE architectures typically employ coarse-grained experts with relatively large hidden dimensions and small routing values (e.g., top-1 or top-2).

In contrast, expert-specialized MoEs represent a paradigm shift toward more fine-grained expertise. These architectures feature:

Fine-grained experts with smaller hidden dimensions that encourage specialization
Large top-k routing (e.g., top-8) that activates multiple specialized experts per token
Enhanced expert specialization where each expert learns to handle specific types of linguistic patterns or knowledge domains

Recent models like DeepSeek-v3 and Qwen3-MoE have demonstrated the effectiveness of this approach, achieving superior performance while maintaining computational efficiency. However, this architectural shift introduces significant new challenges for training systems:

Lack of Efficient Cross-Platform Training Pipeline

Existing MoE training frameworks rely on dense, CUDA-specific implementations that are inefficient for expert-specialized MoEs and difficult to port to non-NVIDIA platforms, leading to inflated memory usage and degraded performance.

Memory Bottleneck Shift: High Activation Memory Usage

As expert-specialized MoEs increase the number of routed experts and shrink expert hidden dimensions, per-device memory bottlenecks shift from model parameters to activations, particularly in the dispatch and combine stages.

Expensive All-to-All Communication

Expert-specialized MoEs increase the number of routed experts per token, leading to significant communication duplication. On HPC platforms with hierarchical networks, this results in inefficient use of inter-node bandwidth.

X-MoE Design

Padding-Free Sparse MoE Training

X-MoE introduces PFT (Padding-Free Token buffers), a novel sparse data structure that eliminates zero-padding through MoE computation and communication stages. Cross-platform Triton-based kernels handle sparse and irregular workloads efficiently.

Redundancy-Bypassing Dispatch (RBD)

A hierarchical multi-stage dispatching process that eliminates redundant inter-node communication by using pilot tokens and local replicas, reducing communication overhead on repeated tokens.

Sequence-Sharded MoE Blocks (SSMB)

A hybrid parallelism strategy that combines tensor-slicing with sequence-sharded execution for MoE blocks, reducing activation memory by a factor of the TP group size while maintaining compatibility with standard MoE routing.

Evaluation Results

We evaluate X-MoE on the Frontier supercomputer using up to 1024 AMD MI250X GPUs , on DeepSeek-style MoE models (Small, Medium, Large, Super model configurations are created with reference to DeepSeek-MoE, v2 and v3). Our results demonstrate significant improvements in both model scale and training efficiency:

📊 Scale: X-MoE enables training of models up to 545B parameters, which is 10× larger than existing solutions under the same hardware budget.

⚡ Throughput: Up to 1.42× higher training throughput compared to state-of-the-art MoE systems like Tutel and DeepSpeed-MoE.

💾 Memory Efficiency: Significant reduction in activation memory usage through padding-free pipeline and sequence sharding.

🔧 Integration: X-MoE is built on top of DeepSpeed, which is compatible with the DeepSpeed-Megatron training options.

BibTeX

@misc{yuan2025xmoeenablingscalabletraining,
title={X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms},
author={Yueming Yuan and Ahan Gupta and Jianping Li and Sajal Dash and Feiyi Wang and Minjia Zhang},
year={2025},
eprint={2508.13337},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.13337},
}