SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

1University of Illinois Urbana-Champaign

News

2026-01-26: SuperInfer has been accepted at MLSys 2026 ! πŸŽ‰

Abstract

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs.

To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces (1) RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, (2) DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C.

Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Background

The Memory Wall: During autoregressive generation, each request maintains a growing KV cache that quickly exhausts GPU memory under high loads, leading to SLO violations.

The Interconnect Bottleneck: Existing KV offloading systems are crippled by slow PCIe bandwidth (~32-64 GB/s), causing severe head-of-line (HOL) blocking and SLO violations.

  • Increasing swap bandwidth beyond the PCIe Gen5x16 uni-directional limit significantly reduces both TTFT and TBT.
  • PCIe's Low swap bandwidth creates two major obstacles: request backlogging and HOL blocking to reducing tail latencies.
Background1
Background2

Superchip Opportunity: Emerging tightly-coupled CPU-GPU Superchips provides highspeed CPU-GPU interconnects to break the PCIe bottleneck. As an example, NVIDIA GH200 integrates a Hopper GPU and a Grace CPU via NVLink-C2C with 900 GB/s interconnection bandwidth.

Background3

Software Bottlenecks: Existing serving stacks fall short on two fronts:

SLO-unaware

React to memory pressure, not latency urgency. Static Waiting-First / Swapped-First policies bias one SLO (TTFT or TBT) at the expense of the other.

Under-utilized C2C

Exploit < 5% of NVLink-C2C bandwidth β€” PagedAttention fragments KV cache into tiny pieces.

Background3
Background4

SuperInfer Design

Overview

RotaSched: Proactive Rotation

Rotates requests between running (HBM) and a novel transient rotary (DRAM) state by latency urgency β€” OS-style time-slicing for LLM serving.

  • Virtual Lag Time (VLT): Per-request metric of deviation from TTFT/TBT SLOs, defined for running / rotary / waiting requests β€” the scheduling currency that drives RotaSched.
  • Largest-VLT-First (LVF): Prioritizes requests with the largest positive VLT β€” those most vulnerable to SLO violation β€” while preempting long-running requests with smaller VLT to the rotary state to free HBM.
  • Design1
    Design2

DuplexKV: Full-Duplex Engine

  • Eager Block Rotation: KV cache fills its blocks incrementally; fully-written blocks are eagerly copied to DRAM in the background and marked "synced". On preemption, synced HBM blocks are simply discarded β†’ swap-in/out data race broken, full-duplex transfers enabled.
  • Block-first Layout + Batched DMA: Prioritizes requests with the largest positive VLT β€” those most vulnerable to SLO violation β€” while preempting long-running requests with smaller VLT to the rotary state to free HBM.
  • Design3
    Design4

Evaluation Results

SuperInfer Main Results

The system was evaluated on a GH200 Superchip (144GB HBM, 480GB DRAM) using models like LLaMA-3-8B, Qwen2.5-32B, and Mixtral-8x7B against baselines such as vLLM, TensorRT-LLM, LightLLM, LTR, and NEO.

Versatility: SuperInfer demonstrates significant performance improvements on both Dense and MoE models.
SLO Attainment : Up to 74.7% higher TTFT SLO attainment with comparable TBT SLO, vs. vLLM at high request rates.
Integration: SuperInfer is built on top of vLLM, a wildly used open-source inference engine.

BibTeX

@misc{yu2026superinfer,
title={SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips},
author={Jiahuan Yu and Mingtao Hu and Zichao Lin and Minjia Zhang},
year={2026},
eprint={2601.20309},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.20309},
}