Retrieval at Scale | Drop for 2025-09-13

TL;DR

Five noteworthy changes since your last drop: (1) Faiss v1.12.0 lands sizable ANN infra wins (RaBitQ SIMD, Binary CAGRA, FP16 CAGRA, cuVS tooling); (2) Elastic makes filtered HNSW (ACORN‑1) mainstream and switches high‑dim vectors to Better Binary Quantization by default in Elasticsearch 9.1; (3) mainstream engines add filtered‑ANN knobs and late‑interaction support (Weaviate ACORN filtering + multi‑vector/ColBERT; Vespa ACORN‑1 and adaptive beam‑search with tunable thresholds); (4) LSM‑VEC proposes an LSM‑tree–backed, disk‑based dynamic vector index with better update/query trade‑offs at billion scale; (5) DF‑FLOPS regularization targets production latency of learned‑sparse (SPLADE‑Doc), cutting posting‑list hot spots ~10× in a Solr‑grade setup while largely preserving quality.

Faiss v1.12.0: performance and GPU/quantization features that matter at scale

Key facts and current state of the topic
- Faiss remains the de‑facto ANN workhorse for CPU/GPU with IVF/HNSW/graph, PQ/BQ, and growing support for GPU graph (CAGRA) and quantization variants. (github.com)
Important context and background information
- Prior versions added RaBitQ and broader CAGRA support; many teams still struggle with filtered search and mixed‑precision throughput on GPUs. (github.com)
Recent developments or changes
- v1.12.0 (Aug 12, 2025) adds: SIMD optimization for RaBitQ; Binary CAGRA and NN‑Descent; FP16 CAGRA and IndexIDMap support; cuVS interop (examples + filter conversion); exposure of Binary IVF to C API, among others—useful for higher recall at lower latency and memory on GPU and CPU. Wheels were published mid‑Aug (CPU/GPU). Consider A/Bs on RaBitQ vs. prior PQ and FP16 CAGRA for ads‑scale recall/latency. (github.com)

Filtered ANN and vector compression go mainstream in Elasticsearch

Key facts and current state of the topic
- Filtered kNN historically degraded HNSW performance with selective filters; ACORN‑1 integrates filters in graph traversal to recover speed. Elastic also pushes binary quantization for high‑dim vectors. (elastic.co)
Important context and background information
- Production search often requires metadata filters; quantization can lift throughput by letting you probe more candidates under a fixed budget. (elastic.co)
Recent developments or changes
- On July 30, 2025 Elastic announced ACORN‑1‑style filtered HNSW speedups (up to ~5×) and made Better Binary Quantization (BBQ) the default for ≥384‑dim vectors in Elasticsearch 9.1, reporting lower latency at equal or better top‑10 ranking on BEIR while compressing ~32×. If you run Lucene/Elasticsearch‑based stacks (or depend on Lucene in OpenSearch pipelines), expect material filtered‑ANN wins and more cost‑efficient recall. (ir.elastic.co)

Engines adopt filtered ANN and late interaction: Weaviate + Vespa updates

Key facts and current state of the topic
- Vendors are rolling out filtered‑ANN strategies and multi‑vector (late‑interaction) support so you can combine structured filters with token‑level matching at production QPS. (weaviate.io)
Important context and background information
- ACORN‑inspired filtered HNSW stabilizes latency under selective filters; multi‑vector support enables ColBERT‑style MaxSim inside databases. Tunables matter to balance recall vs. cost. (weaviate.io)
Recent developments or changes
- Weaviate 1.27 introduced an ACORN‑inspired filtered search strategy; 1.29 added multi‑vector/ColBERT support and BlockMax‑WAND BM25 speedups, and docs now cover multi‑vector configuration and compression. Vespa (Sep 4, 2025) added ACORN‑1 and adaptive beam‑search, plus query/rank‑profile parameters (filterFirstThreshold, filterFirstExploration, explorationSlack) with a tuning guide. These make filtered vector + late‑interaction more deployable and tunable out‑of‑the‑box. (weaviate.io)

LSM‑VEC: dynamic, disk‑based vector search with LSM‑tree storage

Key facts and current state of the topic
- Disk‑based ANN (e.g., DiskANN) lowers RAM but is typically batch‑built and update‑heavy to maintain; dynamic workloads pay a tax on recall/latency under frequent inserts/deletes. (techcommunity.microsoft.com)
Important context and background information
- SPFresh‑style clustering helps but can lose recall; production ads/search systems need sustained updates with predictable tail latencies. (arxiv.org)
Recent developments or changes
- LSM‑VEC (May 22, 2025) shards a proximity graph across LSM levels to support out‑of‑place updates, adds sampling‑based probabilistic search and connectivity‑aware reordering, and reports higher recall with lower query and update latency and >66% lower memory vs. disk‑based baselines at billion scale. Worth tracking for fresh‑data, on‑disk indexing scenarios. (arxiv.org)

DF‑FLOPS: productionizing SPLADE‑Doc by penalizing high‑DF terms

Key facts and current state of the topic
- Learned‑sparse retrieval (e.g., SPLADE) leverages inverted indexes but query latency can spike when models emit very high‑DF terms that blow up posting‑list work in engines like Solr/Lucene. (arxiv.org)
Important context and background information
- Prior FLOPS regularization targets within‑vector sparsity but not term‑frequency distribution; inference‑time stopword removal bluntly harms relevance. (arxiv.org)
Recent developments or changes
- DF‑FLOPS (May 21, 2025) trains with a document‑frequency‑aware penalty, shortening hot posting lists. Authors report ~10× latency reduction in a production‑grade engine with MRR@10 largely maintained in‑domain and improved cross‑domain on 12/13 tasks—bringing LSR latencies closer to BM25. For tight tail‑latency budgets, DF‑FLOPS is a practical tweak to evaluate. (arxiv.org)