MegaFold

News

2026-03 Officially accepted to ISC High Performance 2026 🎉🥳
2025-06 MegaFold paper and code are released!

Abstract

Recent advances in biomolecular modeling have been catalyzed by models such as AlphaFold3 (AF3), which introduce science-informed changes to the transformer architecture. Unlike transformers, a defining characteristic of AF3-style models is their 3D attention over 2D pairwise representations which produces tensors whose computation and memory costs scale cubically with sequence length. As a result, despite moderate parameter counts, AF3-style models are far more expensive to train than size-equivalent transformers, and are severely constrained by GPU memory capacity. Our characterization shows 3D attention fundamentally changes the training workload, causing massive 3D attention maps, complex inter-operator dependencies, kernel fragmentation, and heavy host-side data pipelines which differ substantially from LLM training, leading to poor utilization on modern GPU systems. Moreover, existing GPU optimizations do not adequately address these challenges due to complex cross-layer inter-operator dependencies introduced by 3D attention. Motivated by these challenges, we introduce MegaFold, a novel cross-platform system for efficient training of next-generation 3D-attention protein models. MegaFold combines a memory-efficient 3D-attention kernel, a communication-efficient sharding strategy for quadratic representations, fused operator implementations for critical execution paths, and a determinism-aware host-device pipeline that eliminates preprocessing stalls. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold enables training with up to 3.36x longer sequence lengths on 32 GPUs while reducing end-to-end execution time by up to 1.73x (NVIDIA) and 1.62x (AMD).

Method

MegaFold introduces a series of optimizations that reduce memory consumption and increase performance:

(1) EvoFlash-3D, a Triton-based kernel for 3D attention over 2D pairwise representations to achieve high memory and bandwidth efficiency on cross-platform GPUs (§4.1). Instead of explicitly materializing the enormous cubic-sized attention tensors in GPU memory, EvoFlash-3D computes attention incrementally using a FlashAttention-style online softmax algorithm and GPU scratchpad/shared memory. This reduces memory complexity from O(N³) to O(N²), significantly lowering memory bandwidth demands and enabling training on much longer protein sequences.
(2) EvoSP-3D, a communication-efficient sharding strategy that supports alternating attention axes from 2D pairwise representations (§4.2). Standard sequence parallelism used in transformers cannot directly handle AF3 because attention alternates between row-wise and column-wise operations over a 2D matrix. EvoSP-3D dynamically repartitions these tensors across GPUs using efficient communication primitives (e.g., all_to_all) so that each attention layer receives the appropriate data layout while minimizing communication overhead and avoiding repeated tensor reshaping.
(3) EvoFusion, a fused operator stack that aligns attention and transition layers with GPU shared-memory hierarchies (§4.3). AF3 repeatedly executes small operations such as layer normalization, linear projections, and SwiGLU activations, resulting in thousands of inefficient GPU kernel launches. EvoFusion combines these operators into larger fused kernels and restructures their computations to better match GPU shared-memory hierarchies, reducing intermediate memory accesses, kernel launch overhead, and overall execution time.
(4) EvoPipe, a determinism-aware host-device pipeline that eliminates CPUside preprocessing stalls (§4.4). AlphaFold3 requires expensive biological preprocessing steps, such as multiple sequence alignment (MSA) processing and feature generation, which can leave GPUs idle while waiting for the CPU. EvoPipe identifies deterministic preprocessing stages, computes and caches them offline, and only performs stochastic operations (e.g., random cropping and augmentation) during training, thereby improving GPU utilization and reducing data pipeline stalls.

Evaluation

We evaluate MegaFold through extensive experiments, demonstrating its ability to significantly reduce memory consumption, accelerate training speeds, and support longer sequence lengths—all while maintaining computational precision. Our results show that MegaFold reduces peak memory usage by up to 1.23x, while improving per-iteration training time by up to 1.73x on NVIDIA and 1.62x on AMD. Moreover, MegaFold enables scalable training with up to 3.36x longer input sequence lengths on 32 GPUs. Indeed, our results show robust scalability across multi-GPU systems on both NVIDIA and AMD hardware.

First figure: Maximum trainable sequence length before OOMs for different systems as GPU count increases. Second figure: Per-device peak memory consumption prior to OOMs across sequence lengths on NVIDIA GPUs. Third and fourth figure: Average per-iteration execution time of AF3 training on a single GPU across sequence lengths.

BibTeX

@INPROCEEDINGS{11520503,
  author={La, Hoa and Gupta, Ahan and Morehead, Alex and Cheng, Jianlin and Zhang, Minjia},
  booktitle={ISC High Performance 2026 Research Paper Proceedings (41st International Conference)}, 
  title={MegaFold: Efficient Training of Next-Generation 3D Attention Protein Models on Cross-Platform GPUs}, 
  year={2026},
  volume={},
  number={},
  pages={1-16},
  keywords={Modeling;Training;Kernel;Optimization;Memory;Sequences;Sequential analysis;Tensors;Timing;Graphics processing units;High performance computing;Bioinformatics;Parallel algorithms;Performance analysis},
  doi={10.23919/ISC.2026.11520503}}