VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs

1 University of Illinois Urbana-Champaign 2 Nvidia 3 Microsoft
*Both authors contributed equally to this research. Work done while intern at UIUC.

News

  • 🎉 2025-6-2 The Arxiv version is available at arXiv.
  • 🎉 2025-5-31 VecFlow is open-sourced at GitHub.
  • 🎉 2025-5-23 VecFlow is accepted by SIGMOD 2026.

Abstract

Vector search and database systems have become a keystone component in many AI applications. While many prior research has investigated how to accelerate the performance of generic vector search, emerging AI applications require running more sophisticated vector queries efficiently, such as vector search with attribute filters. Unfortunately, recent filtered-ANNS solutions are primarily designed for CPUs, with few exploration and limited performance of filtered-ANNS that take advantage of the massive parallelism offered by GPUs. In this paper, we present VecFlow, a novel high-performance vector data management system that achieves unprecedented high throughput and recall while obtaining low latency for filtered-ANNS on GPUs. We propose a novel label-centric indexing and search algorithm that significantly improves the selectivity of ANNS with filters. In addition to algorithmic level optimization, we provide architecture-aware optimizations for VecFlow's functional modules, effectively supporting both small batch and large batch queries, and single-label and multi-label query processing. Experimental results on NVIDIA A100 GPU over several public available datasets validate that VecFlow achieves 5 million QPS for recall 90%, outperforming state-of-the-art CPU-based solutions such as Filtered-DiskANN by up to 135 times. Alternatively, VecFlow can easily extend its support to high recall 99% regime, whereas strong GPU-based baselines plateau at around 80% recall.

Method

  • VecFlow introduces the concept of "label specificity" - the number of data points associated with a particular label. Using a configurable specificity threshold T, it builds a dual-structured index: an IVF-CAGRA index for data points with labels that appear frequently (high specificity, ≥ T points), and an IVF-BFS index with interleaved vector storage for rare data points (low specificity, < T points). This dual-index approach optimizes GPU memory access patterns and achieves high performance across varying label distributions.
  • System overview.
  • For high-specificity data, VecFlow employs a redundancy-bypassing IVF-Graph approach. When building separate graph indices for each label's posting list, memory consumption can increase dramatically (10.8× for the YFCC dataset) due to data points with multiple labels being replicated across different inverted lists. To solve this problem, VecFlow maintains a single global embedding table shared across all label-specific graphs rather than duplicating vector embeddings. For each posting list, it builds a local virtual graph that uses local vertex IDs and a local-global mapping table to reference vectors in the global embedding table. All virtual graphs are compacted into a single continuous memory space organized by label IDs, making the structure compatible with existing graph-based algorithms while enabling efficient filtered search.
  • Redundancy Bypassing layout.
  • For low-specificity data, VecFlow implements an interleaved scan-based IVF-BFS method that maximizes GPU memory bandwidth utilization. Vectors are organized in groups of 32 (matching GPU warp size) with components interleaved to enable both coalesced memory access and vectorized loading with 128-bit instructions. During search, each CUDA block processes one query for its corresponding label-specific posting list, with the query vector loaded into shared memory. Each warp handles 32 interleaved vectors, with each thread computing one distance. The approach also fuses an optimized block-select-k operation for efficient nearest neighbor identification.
  • Interleaved memory layout.
  • The persistent kernel design in VecFlow addresses the challenge of efficiently processing streaming small batch queries on GPUs, which is common in real-world scenarios. Unlike traditional approaches that suffer from launch overhead, VecFlow implements a continuously running kernel that remains active on the GPU. This design uses atomic ring buffers to manage job and worker queues, allowing queries to be processed immediately as they arrive in the stream. Each thread block handles individual queries independently, maximizing GPU utilization even with small batch sizes. This approach eliminates kernel launch overhead and avoids costly synchronization, resulting in significantly higher throughput and lower latency for practical vector search applications.
  • Persistent kernel design.

Evaluation

We evaluate VecFlow on several public datasets, including semi-synthetic SIFT-1M and DEEP-50M with Zipf-distributed labels, real-world YFCC-10M, and WIKI-ANN for multi-label AND queries. VecFlow achieves million-scale QPS at 90% recall, which is one to two orders of magnitude higher than both CPU and GPU baselines. For multi-label search, VecFlow delivers 150K QPS at 90% recall where competing methods fail to achieve meaningful recall. For small batch queries, VecFlow's persistent kernel achieves up to 7.08× higher throughput and 1.82× lower latency. Our IVF-BFS approach delivers 26M QPS for low-specificity labels in YFCC dataset, which is thousands of times faster than baseline methods. With carefully designed redundancy-bypassing optimization, VecFlow maintains a memory footprint comparable to single-index methods while substantially outperforming them.

BibTeX

@article{vecflow2025,
  author    = {Xi, Jingyi and Mo, Chenghao and Karsin, Ben and Chirkin, Artem and Li, Mingqin and Zhang, Minjia},
  title     = {VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs},
  journal   = {arXiv preprint arXiv:2506.00812},
  year      = {2025},
}