MegaFold

News

2025-06 MegaFold paper and code are released!

Abstract

Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science-informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute- and memory-intensive operators, 2D attention mechanisms, and retrieval-augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23x and improves per-iteration training time by up to 1.73x and 1.62x respectively. More importantly, MegaFold enables training on 1.35x longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models.

Method

MegaFold introduces a series of optimizations that reduce memory consumption and increase performance. First, we introduce an ahead-of-the-time cache-based dataloader in Section IV-A. Next, we introduce memory-efficient optimizations to the EvoAttention operation in Section IV-B. Finally, we introduce deep fusion optimizations to small but common AF3 operators in Section IV-C.

Ahead-of-Time Cache-based Data-Loading: avoids repeating expensive deterministic preprocessing steps by precomputing the deterministic features for each protein complex and memoizing them in storage once prior to training.
Memory-Efficient EvoAttention for Heterogeneous Devices: avoids materializing large intermediate attention logits in memory and instead incrementally materializes them in fast scratchpad memory during the forward pass, recomputing them on-the-fly to calculate gradients in the backward pass.
DeepFusion for Common AlphaFold Operators: a set of fused Triton kernels for frequent operators in AF3, including LayerNorm, linear-layers, and Swiglu activations, which are commonly composed in the transition layer.

Evaluation

We evaluate MegaFold through extensive experiments, demonstrating its ability to significantly reduce memory consumption, accelerate training speeds, and support longer sequence lengths—all while maintaining computational precision. Our evaluation suggests that MegaFold reduces peak memory usage of AF3 training by up to 1.23x and improves per-iteration training time by up to 1.73x and 1.62x respectively. More importantly, MegaFold enables training on 1.35x longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models. Indeed, our results show robust scalability across multi-GPU systems on both NVIDIA and AMD hardware.

The attached figures demonstrate end-to-End comparison of a training iteration (averaged over 100 iterations of training) between MegaFold and different compiler backends. The first figure shows the peak memory consumption of AF3 training on NVIDIA H200, whilst the second figure and the third figure show the per-iteration end-to-end execution time on NVIDIA H200 and AMD MI250 hardware respectively.

BibTeX

@misc{la2025megafoldsystemleveloptimizationsaccelerating,
    title={MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models}, 
    author={Hoa La and Ahan Gupta and Alex Morehead and Jianlin Cheng and Minjia Zhang},
    year={2025},
    eprint={2506.20686},
    archivePrefix={arXiv},
    primaryClass={q-bio.BM},
    url={https://arxiv.org/abs/2506.20686}, 
}