Boosting vector search performance with concurrent segment search

In OpenSearch, data is stored in shards, which are further divided into segments. When you execute a search query, it runs sequentially across all segments of each shard involved in the query. As the number of segments increases, this sequential execution can increase query latency (the time it takes to retrieve the results) because the query has to wait for each segment run to complete before moving on to the next one. This delay becomes especially noticeable if some segments take longer to process queries than others.

table {
font-size: 16px;
}

h3 {
font-size: 22px;
}

h4 {
font-size: 20px;
}

th {
background-color: #f5f7f7;
}

Introduced in OpenSearch version 2.12, concurrent segment search addresses this issue by enabling parallel execution of queries across multiple segments within a shard. By using available computing resources, this feature reduces overall query latency, particularly for larger datasets with many segments. Concurrent segment search is designed to provide more consistent and predictable latencies. It achieves this consistency by reducing the impact of variations in segment performance or the number of segments on query execution time.

In this blog post, we’ll explore the impact of concurrent segment search on vector search workloads.

Enabling concurrent segment search

By default, concurrent segment search is disabled in OpenSearch. For our experiments, we enabled it for all indexes in the cluster by using the following dynamic cluster setting:

PUT _cluster/settings
{
   "persistent": {
      "search.concurrent_segment_search.enabled": true
   }
}

To achieve concurrent segment searches, OpenSearch divides the segments within each shard into multiple slices, with each slice processed in parallel on a separate thread. The number of slices determines the degree of parallelism that OpenSearch can provide. You can either use Lucene’s default slicing mechanism or set the maximum slice count manually. For detailed instructions on updating the slice count, see Slicing mechanisms.

Performance results

We performed our tests on an OpenSearch 2.15 cluster using the OpenSearch Benchmark vector search workload. We used the Cohere dataset with two different configurations to evaluate the performance improvements of vector search queries when running the workload with concurrent segment search disabled, enabled with default settings, and enabled with different max slice counts.

Cluster setup

3 data nodes (r5.4xlarge: 128 GB RAM, 16 vCPUs, 250 GB disk space)
3 cluster manager nodes (r5.xlarge: 32 GB RAM, 4 vCPUs, 50 GB disk space)
1 OpenSearch workload client (c5.4xlarge: 32 GB RAM, 16 vCPUs)
1 and 4 search clients
index_searcher thread pool size: 32

Index settings

`m`	`ef_construction`	`ef_search`	Number of shards	Replica count	Space type
16	100	100	6	1	inner product

Configuration

Dimension	Vector count	Search query count	Refresh interval
768	10M	10K	1s (default)

Service time comparison

We conducted the following experiments:

Concurrent search disabled
Concurrent search enabled:

The following sections present the results of these experiments.

Experiment 1: Concurrent search disabled

k-NN engine	Segment count	Num search clients	Service time (ms)			Max CPU %	% JVM heap used	Recall
			p50	p90	p99
Lucene	381	1	30	37	45	11	53.48	0.97
Lucene	381	4	36	43	51	38	42	0.97
NMSLIB	383	1	28	35	41	10	47.5	0.97
NMSLIB	383	4	35	41	46	36	48.06	0.97
Faiss	381	1	29	37	42	10	47.85	0.97
Faiss	381	4	36	40	44	38	46.38	0.97

Experiment 2: Concurrent search enabled, max slice count = 0 (default)

k-NN engine	Segment count	Num search clients	Service time (ms)			Max CPU %	% JVM heap used	Recall
			p50	p90	p99
Lucene	381	1	13	15	17	47	47.99	0.97
Lucene	381	4	27	32	37	81	45.95	0.97
NMSLIB	383	1	13	14	16	38	47.28	0.97
NMSLIB	383	4	24	27	32	75	44.76	0.97
Faiss	381	1	13	14	16	34	46.04	0.97
Faiss	381	4	25	28	33	76	47.72	0.97

Experiment 3: Concurrent search enabled, max slice count = 2

k-NN engine	Segment count	Num search clients	Service time (ms)			Max CPU %	% JVM heap used	Recall
			p50	p90	p99
Lucene	381	1	14	16	19	41	52.91	0.97
Lucene	381	4	28	34	42	88	51.65	0.97
NMSLIB	383	1	20	23	25	16	44.97	0.97
NMSLIB	383	4	23	27	33	60	41.06	0.97
Faiss	381	1	20	22	24	19	46.42	0.97
Faiss	381	4	23	26	32	67	37.23	0.97

Experiment 4: Concurrent search enabled, max slice count = 4

k-NN engine	Segment count	Num search clients	Service time (ms)			Max CPU %	% JVM heap used	Recall
			p50	p90	p99
Lucene	381	1	13.6	15.9	17.6	49	53.37	0.97
Lucene	381	4	28	33	41	86	50.12	0.97
NMSLIB	383	1	14	15	16	29	51.12	0.97
NMSLIB	383	4	21	25	31	72	42.63	0.97
Faiss	381	1	14	15	17	30	41.1	0.97
Faiss	381	4	23	28	37	77	47.19	0.97

Experiment 5: Concurrent search enabled, max slice count = 8

k-NN engine	Segment count	Num search clients	Service time (ms)			Max CPU %	% JVM heap used	Recall
			p50	p90	p99
Lucene	381	1	14	16	18	43	45.37	0.97
Lucene	381	4	28	34	43	87	48.79	0.97
NMSLIB	383	1	10	12	14	41	45.21	0.97
NMSLIB	383	4	23	25	29	75	45.87	0.97
Faiss	381	1	15	16	17	44	48.68	0.97
Faiss	381	4	23	26	32	79	47.19	0.97

Comparing results

For simplicity, we’ll focus on the p90 metric with a single search client because this metric captures the performance of long-running vector search queries.

Service time comparison (p90)

k-NN engine	Concurrent segment search disabled	Concurrent segment search enabled (Lucene default number of slices)	% Improvement	Concurrent segment search with max slice count = 2	% Improvement	Concurrent segment search with max slice count = 4	% Improvement	Concurrent segment search with max slice count = 8	% Improvement
Lucene	37	15	59.5	16	56.8	15.9	57	16	56.8
NMSLIB	35	14	60	23	34.3	15	57.1	12	65.7
Faiss	37	14	62.2	22	40.5	15	59.5	16	56.8

CPU utilization comparison

k-NN engine	Concurrent segment search disabled	Concurrent segment search enabled (Lucene default number of slices)	% Additional CPU utilization	Concurrent segment search with max slice count = 2	% Additional CPU utilization	Concurrent segment search with max slice count = 4	% Additional CPU utilization	Concurrent segment search with max slice count = 8	% Additional CPU utilization
Lucene	11	47	36	41	30	49	38	43	32
NMSLIB	10	38	28	16	6	29	19	41	31
Faiss	10	34	24	19	9	30	20	44	34

As demonstrated by our performance benchmarks, enabling concurrent segment search with the default slice count delivers at least a 60% improvement in vector search service time while requiring only 25–35% more CPU. This increase in CPU utilization is expected because concurrent segment search runs on more CPU threads—the number of threads is equal to twice the number of CPU cores.

We observed a similar improvement in service time when using multiple concurrent search clients. However, maximum CPU utilization also doubled, as expected, because of the increased number of active search threads running concurrently.

Conclusion

Our experiments clearly show that enabling concurrent segment search with the default slice count improves vector search query performance, albeit at the cost of higher CPU utilization. We recommend testing your workload to determine whether the additional parallelization achieved by increasing the slice count outweighs the additional processing overhead.

Before running concurrent segment search, we recommend force-merging segments into a single segment to achieve better performance. The major disadvantage of this approach is that the time required for force-merging increases as segments grow larger. Thus, we recommend reducing the number of segments in accordance with your use case.

By combining vector search with concurrent segment search, you can improve query performance and optimize search operations. To get started with concurrent segment search, explore the documentation.