Retrieval at Scale | Drop for 2025-10-16

TL;DR

Four fresh items since the Oct 6 drop: (1) Weaviate 1.33 makes 8‑bit Rotational Quantization the default and previews 1‑bit RQ, with extra HNSW memory cuts that are especially helpful for multi‑vector/late‑interaction setups; (2) Milvus 2.6.3 lands practical hybrid‑search knobs (BM25 stats preloading, query‑time “requery” control) plus sparse‑filter support; (3) a new arXiv paper proposes α‑Convergent proximity graphs with stronger guarantees and promising ANN speedups; (4) Elastic expands Cloud Serverless to more AWS regions—Vector‑optimized profiles are easier to deploy globally.

Weaviate 1.33: compression by default (+1‑bit RQ preview) and HNSW memory cuts

  • Key facts and current state of the topic
    • Weaviate v1.33 enables 8‑bit Rotational Quantization (RQ) by default when creating new collections; RQ targets ~4× vector compression while keeping ~98–99% recall in Weaviate’s internal tests. A preview of 1‑bit RQ offers near‑32× compression with moderate recall trade‑offs. (docs.weaviate.io)
  • Important context and background information
    • Quantization directly improves vector footprint/throughput and lets you probe more candidates under a fixed latency budget—useful for large, filtered retrieval and multi‑vector (late‑interaction) pipelines. Weaviate also documents that RQ is HNSW‑only and now the default (can be overridden via env). (docs.weaviate.io)
  • Recent developments or changes
    • 1.33 ships “compression by default” and adds a 1‑bit RQ preview; 1.32 recently introduced “optimized HNSW connections,” cutting HNSW graph memory (up to ~20% in multi‑vector collections). If you’re piloting ColBERT/ColPali‑style embeddings, expect lower RAM per shard with RQ+optimized HNSW. (weaviate.io)

Milvus 2.6.3 (Oct 11): hybrid‑search knobs, sparse filters, and BM25 preloads

  • Key facts and current state of the topic
    • Milvus 2.6.3 focuses on operational and hybrid retrieval improvements: BM25 statistics preloaded for sealed segments, nullable‑field inputs for BM25, and new parameters to control hybrid search “requery” policy. (blog.milvus.io)
  • Important context and background information
    • At scale, hybrid (lexical + vector) is common for ads/search. Preloading BM25 stats reduces head‑of‑line penalties; tunable requery helps balance recall/latency when rescoring ANN candidates lexically. (blog.milvus.io)
  • Recent developments or changes
    • Added “sparse filters in queries,” storage/telemetry improvements, and new auto‑indexing options for int8 vectors. If you run attribute‑heavy retrieval with re‑ranking, these reduce variance and improve predictability under QPS spikes. (blog.milvus.io)

Fast‑Convergent Proximity Graphs (α‑CNG): theory‑backed ANN with fewer steps

  • Key facts and current state of the topic
    • A new proximity‑graph family (α‑CG/α‑CNG) introduces an edge‑pruning rule with theoretical guarantees; under bounded intrinsic dimensionality, it achieves polylogarithmic query time for exact NN up to a radius τ, else returns an ANN. (arxiv.org)
  • Important context and background information
    • Many graph ANN indexes (e.g., HNSW) are empirically strong but lack worst‑case guarantees. The α‑CNG variant targets better scalability with practical build refinements. (arxiv.org)
  • Recent developments or changes
    • Authors report >15% fewer distance computations and >45% fewer search steps vs. strong PG baselines on real datasets. Worth tracking for billion‑scale deployments where step count and distance evals dominate tail latency. (arxiv.org)

Elastic Cloud Serverless expands (Oct 15): easier global rollout of vector‑optimized projects

  • Key facts and current state of the topic
    • Elastic Cloud Serverless is now available on AWS eu‑west‑2 (London) and ap‑northeast‑1 (Tokyo), in addition to existing regions, with “Optimized for Vectors” as an out‑of‑the‑box Elasticsearch project profile. (elastic.co)
  • Important context and background information
    • For teams standardizing on Lucene/Elasticsearch features like ACORN‑1 filtered HNSW and BBQ quantization, Serverless reduces ops overhead and enables regional proximity for latency/SLA needs. (elastic.co)
  • Recent developments or changes
    • If you’ve held back on late‑interaction/hybrid pilots due to provisioning friction, the expanded regional coverage plus vector‑optimized projects makes staged A/Bs simpler across geos. (elastic.co)