Boosting vector search performance with concurrent segment search

In OpenSearch, data is stored in shards, which are further divided into segments. When you execute a search query, it runs sequentially across all segments of each shard involved in the query. As the number of segments increases, this sequential execution can increase query latency (the time it takes to retrieve the results) because the query has to wait for each segment run to complete before moving on to the next one. This delay becomes especially noticeable if some segments take longer to process queries than others.

table {
font-size: 16px;
}

h3 {
font-size: 22px;
}

h4 {
font-size: 20px;
}

th {
background-color: #f5f7f7;
}​

Introduced in OpenSearch version 2.12, concurrent segment search addresses this issue by enabling parallel execution of queries across multiple segments within a shard. By using available computing resources, this feature reduces overall query latency, particularly for larger datasets with many segments. Concurrent segment search is designed to provide more consistent and predictable latencies. It achieves this consistency by reducing the impact of variations in segment performance or the number of segments on query execution time.

In this blog post, we’ll explore the impact of concurrent segment search on vector search workloads.

Enabling concurrent segment search

By default, concurrent segment search is disabled in OpenSearch. For our experiments, we enabled it for all indexes in the cluster by using the following dynamic cluster setting:

PUT _cluster/settings
{
   "persistent": {
      "search.concurrent_segment_search.enabled": true
   }
}

To achieve concurrent segment searches, OpenSearch divides the segments within each shard into multiple slices, with each slice processed in parallel on a separate thread. The number of slices determines the degree of parallelism that OpenSearch can provide. You can either use Lucene’s default slicing mechanism or set the maximum slice count manually. For detailed instructions on updating the slice count, see Slicing mechanisms.

Performance results

We performed our tests on an OpenSearch 2.15 cluster using the OpenSearch Benchmark vector search workload. We used the Cohere dataset with two different configurations to evaluate the performance improvements of vector search queries when running the workload with concurrent segment search disabled, enabled with default settings, and enabled with different max slice counts.

Cluster setup

  • 3 data nodes (r5.4xlarge: 128 GB RAM, 16 vCPUs, 250 GB disk space)
  • 3 cluster manager nodes (r5.xlarge: 32 GB RAM, 4 vCPUs, 50 GB disk space)
  • 1 OpenSearch workload client (c5.4xlarge: 32 GB RAM, 16 vCPUs)
  • 1 and 4 search clients
  • index_searcher thread pool size: 32

Index settings

m ef_construction ef_search Number of shards Replica count Space type
16 100 100 6 1 inner product

Configuration

Dimension Vector count Search query count Refresh interval
768 10M 10K 1s (default)

Service time comparison

We conducted the following experiments:

  1. Concurrent search disabled
  2. Concurrent search enabled:

The following sections present the results of these experiments.

Experiment 1: Concurrent search disabled

k-NN engine Segment count Num search clients Service time (ms) Max CPU % % JVM heap used Recall
p50 p90 p99
Lucene 381 1 30 37 45 11 53.48 0.97
4 36 43 51 38 42 0.97
NMSLIB 383 1 28 35 41 10 47.5 0.97
4 35 41 46 36 48.06 0.97
Faiss 381 1 29 37 42 10 47.85 0.97
4 36 40 44 38 46.38 0.97

Experiment 2: Concurrent search enabled, max slice count = 0 (default)

k-NN engine Segment count Num search clients Service time (ms) Max CPU % % JVM heap used Recall
p50 p90 p99
Lucene 381 1 13 15 17 47 47.99 0.97
4 27 32 37 81 45.95 0.97
NMSLIB 383 1 13 14 16 38 47.28 0.97
4 24 27 32 75 44.76 0.97
Faiss 381 1 13 14 16 34 46.04 0.97
4 25 28 33 76 47.72 0.97

Experiment 3: Concurrent search enabled, max slice count = 2

k-NN engine Segment count Num search clients Service time (ms) Max CPU % % JVM heap used Recall
p50 p90 p99
Lucene 381 1 14 16 19 41 52.91 0.97
4 28 34 42 88 51.65 0.97
NMSLIB 383 1 20 23 25 16 44.97 0.97
4 23 27 33 60 41.06 0.97
Faiss 381 1 20 22 24 19 46.42 0.97
4 23 26 32 67 37.23 0.97

Experiment 4: Concurrent search enabled, max slice count = 4

k-NN engine Segment count Num search clients Service time (ms) Max CPU % % JVM heap used Recall
p50 p90 p99
Lucene 381 1 13.6 15.9 17.6 49 53.37 0.97
4 28 33 41 86 50.12 0.97
NMSLIB 383 1 14 15 16 29 51.12 0.97
4 21 25 31 72 42.63 0.97
Faiss 381 1 14 15 17 30 41.1 0.97
4 23 28 37 77 47.19 0.97

Experiment 5: Concurrent search enabled, max slice count = 8

k-NN engine Segment count Num search clients Service time (ms) Max CPU % % JVM heap used Recall
p50 p90 p99
Lucene 381 1 14 16 18 43 45.37 0.97
4 28 34 43 87 48.79 0.97
NMSLIB 383 1 10 12 14 41 45.21 0.97
4 23 25 29 75 45.87 0.97
Faiss 381 1 15 16 17 44 48.68 0.97
4 23 26 32 79 47.19 0.97

Comparing results

For simplicity, we’ll focus on the p90 metric with a single search client because this metric captures the performance of long-running vector search queries.

Service time comparison (p90)

k-NN engine Concurrent segment search disabled Concurrent segment search enabled (Lucene default number of slices) % Improvement Concurrent segment search with max slice count = 2 % Improvement Concurrent segment search with max slice count = 4 % Improvement Concurrent segment search with max slice count = 8 % Improvement
Lucene 37 15 59.5 16 56.8 15.9 57 16 56.8
NMSLIB 35 14 60 23 34.3 15 57.1 12 65.7
Faiss 37 14 62.2 22 40.5 15 59.5 16 56.8

CPU utilization comparison

k-NN engine Concurrent segment search disabled Concurrent segment search enabled (Lucene default number of slices) % Additional CPU utilization Concurrent segment search with max slice count = 2 % Additional CPU utilization Concurrent segment search with max slice count = 4 % Additional CPU utilization Concurrent segment search with max slice count = 8 % Additional CPU utilization
Lucene 11 47 36 41 30 49 38 43 32
NMSLIB 10 38 28 16 6 29 19 41 31
Faiss 10 34 24 19 9 30 20 44 34

As demonstrated by our performance benchmarks, enabling concurrent segment search with the default slice count delivers at least a 60% improvement in vector search service time while requiring only 25–35% more CPU. This increase in CPU utilization is expected because concurrent segment search runs on more CPU threads—the number of threads is equal to twice the number of CPU cores.

We observed a similar improvement in service time when using multiple concurrent search clients. However, maximum CPU utilization also doubled, as expected, because of the increased number of active search threads running concurrently.

Conclusion

Our experiments clearly show that enabling concurrent segment search with the default slice count improves vector search query performance, albeit at the cost of higher CPU utilization. We recommend testing your workload to determine whether the additional parallelization achieved by increasing the slice count outweighs the additional processing overhead.

Before running concurrent segment search, we recommend force-merging segments into a single segment to achieve better performance. The major disadvantage of this approach is that the time required for force-merging increases as segments grow larger. Thus, we recommend reducing the number of segments in accordance with your use case.

By combining vector search with concurrent segment search, you can improve query performance and optimize search operations. To get started with concurrent segment search, explore the documentation.