VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

1University of Illinois Urbana-Champaign 2Tsinghua University

Abstract

Modern Large Language Model (LLM) serving systems increasingly support interactive applications such as real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment.

We introduce VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, designed from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained, phase-specific control. It consists of (1) a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and (2) a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints.

We implement VoltanaLLM in SGLang and evaluate its performance across multiple state-of-the-art LLMs and real-world datasets. Our results show that VoltanaLLM achieves up to 36.3% energy savings while maintaining a near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving.

Relevance and Early Observation

LLMs are deployed at unprecedented scale, making inference a major driver of energy consumption and total cost of ownership (TOC). Recent studies show inference can account for 90% of AI infrastructure utilization, pushing datacenter power and thermal limits. Large datacenters today can consume electricity equivalent to millions of households.

At the same time, latency-sensitive applications like chat assistants and agent pipelines rely on strict Service Level Objectives (SLOs), such as Time-To-First-Token (TTFT) and Inter-Token Latency (ITL). Violating these SLOs degrades user experience and downstream responsiveness.

The central challenge: how can we serve LLMs under tight SLOs while reducing their energy footprint?

key-observation

Our empirical profiling of LLM inference reveals a non-monotonic energy–frequency relationship . As shown above, while reducing GPU frequency from 1410 MHz to 1005 MHz (by ~28.7%) does increase execution time, the increase is sub-linear. Consequently, the total energy follows a U-shaped curve with respect to GPU frequency. This trend indicates that at low frequencies, execution time dominates energy , whereas at high frequencies, power dominates ; in the middle lies an energy sweet point .

Background

Numerous systems have been proposed to improve LLM serving efficiency. These include advanced batching strategies for throughput optimization, memory management techniques like PagedAttention, CPU offloading, and GPU kernel-level optimizations (e.g., FlashAttention). Parallelism frameworks and parameter-sharing mechanisms further reduce bottlenecks, while speculative decoding and preemptive scheduling improve tail latency and job completion times.

  • Batching & Memory Optimizations – PagedAttention, CPU offloading, and GPU kernel improvements.
  • Parallelism & Sharing – model parallelism, pipelining, and parameter reuse.
  • Latency Techniques – speculative decoding and preemptive scheduling for multi-tenant settings.

Several recent efforts have also begun addressing energy-efficient LLM serving. For instance, DynamoLLM explores GPU frequency control based on request characteristics, while μ-Serve optimizes power by co-serving multiple models. EcoServe considers operational and embodied carbon emissions, TAPAS exploits datacenter thermal slack, and Heron places GPUs closer to renewable sources.

To better manage compute heterogeneity, recent systems have introduced prefill/decode (P/D) disaggregation, separating the two phases across GPU nodes. Projects like SplitWise, TetriInfer, Llumnix, and DistServe show improvements in goodput, time-to-first-token, and SLO attainment. Popular inference libraries such as vLLM and SGLang have also added runtime support.

These efforts primarily optimize latency and throughput, but overlook the energy implications. P/D disaggregation creates unique opportunities for phase-specific frequency scaling and energy-aware routing—opportunities that VoltanaLLM systematically exploits.

These works highlight early signals for sustainable AI. However, they focus on coarse-grained control. VoltanaLLM is the first system to explore fine-grained frequency and routing control in prefill/decode disaggregated serving, with SLO-aware feedback loops.

VoltanaLLM Design

EcoInfer overall architecture: prefill→decode disaggregation with Governor, Router, and Predictor.

VoltanaLLM runs on a Prefill/Decode (P/D) disaggregated serving stack and couples feedback-driven frequency control with state-space routing, steered by a lightweight latency predictor for SLO-aware, energy-efficient inference.

EcoFreq Governor — Feedback-based, Phase-Specific Frequency Control

A per-iteration controller that scales GPU frequency to the lowest level meeting SLOs. It runs as a separate process and communicates with the engine; decisions are made in <1 ms and applied via pyNVML (≈3 ms), keeping the full control loop <4 ms well below typical prefill and decode iteration SLOs.

EcoRouter — State-Space Navigation for Decode Routing

Instead of naive round-robin, the router performs “what-if” analysis in the decode state space (requests × KV tokens, with frequency contours) to avoid "batch-size boundaries" that force higher frequencies. It selects asymmetric placement of requests when helpful, preferring assignments that keep a GPU instance below a batch-size boundary and thus at a lower frequency.

EcoPred — Load-Aware Latency Predictor

A lightweight model that predicts TTFT/ITL from frequency and load metrics: prefill uses batched tokens; decode uses (#requests, KV tokens). These models capture the compute vs. memory regimes and the staircase effects at batch boundaries enabling millisecond, transparent decisions without heavy online profiling.

Evaluation Results

Main Results for VoltanaLLM

We evaluate VoltanaLLM on three models (Ministral-3B, LLaMA-3.1-8B, Qwen3-32B) and two real-world datasets under a 2P2D serving setup on NVIDIA A100 GPUs. Our results show that VoltanaLLM achieves energy savings without sacrificing latency SLO attainment:

📉 Energy Savings: Reduces GPU energy consumption by up to 36.3% while preserving SLOs for requests.
⚡ Latency SLOs: Maintains comparable TTFT/ITL SLO attainment to GPU serving set at maximum frequency (in this case 1410 MHz).
🔄 Adaptivity: Operates at low frequency when load permits, scaling up dynamically at higher request rates to sustain SLOs in near real-time with negligible overhead.

✅ Key Takeaway: VoltanaLLM enables SLO-aware, energy-efficient LLM inference, improving sustainability without compromising user experience.

BibTeX


  @misc{yu2025voltanallmfeedbackdrivenfrequencycontrol,
      title={VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving}, 
      author={Jiahuan Yu and Aryan Taneja and Junfeng Lin and Minjia Zhang},
      year={2025},
      eprint={2509.04827},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2509.04827}, 
}