Sizing
Sizing helps you determine the right instance type, number of data nodes, and storage requirement for your target environment. We recommend that you size first by the storage and then by CPUs. If you're already using Elasticsearch or OpenSearch, the sizing will generally remain the same. However, you need to identify the instance type that is equivalent to your current environment. To help determine the right size, we recommend using the following guidelines.
Storage
Sizing your cluster starts with defining the storage requirements. Identify the raw storage that you need for your cluster. This is determined by assessing the data generated by your source system (for example, servers generating logs, or product catalog raw size). After you identify how much raw data you have, use the following formula to calculate storage requirements. You can then use the result as a starting point for your PoC.
storage needed = (daily source data in bytes × 1.45)
(number_of_replicas + 1) × number of days retained
The formula takes into consideration the following:
-
The on-disk size of an index varies, but it's often 10 percent larger than the source data.
-
Operating system overhead of 5 percent is reserved by Linux for system recovery and to safeguard against disk defragmentation problems.
-
OpenSearch reserves 20 percent of the storage space of each instance for segment merges, logs, and other internal operations.
-
We recommend keeping 10 percent additional storage to help minimize the impact of node failure and Availability Zone outages.
Combined, these overheads and reservations require 45 percent additional space based on the actual raw data in the source. That's why you multiply the source data by 1.45. Next, multiply this by number of copies of data (for example, one primary plus the number of replicas you will use). The replica count depends on your resiliency and throughput requirement. For an average use case, you start with one primary and one replica. Finally, multiply by the number of days that you want to retain data in a hot-storage tier.
Amazon OpenSearch Service offers hot, warm, and cold storage tiers. The warm storage tier uses UltraWarm storage. UltraWarm provides a cost-effective way to store large amounts of read-only data on Amazon OpenSearch Service. Standard data nodes use hot storage, which takes the form of instance stores or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node. Hot storage provides the fastest possible performance for indexing and searching new data. UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) as storage and a sophisticated caching solution to improve performance. For indexes that you are not actively writing to, or query less frequently, and do not have the same performance requirements, UltraWarm offers significantly lower costs per GiB of data. For more information about UltraWarm, see the AWS documentation.
When you create an OpenSearch Service domain and use hot storage, you might need to define the EBS volume size. It depends on your choice of instance type for the data nodes. You can use the same storage-requirement formula to determine the volume size for Amazon EBS backed instances. We recommend using gp3 volumes for latest-generation T3, R5, R6G, M5, M5g, C5, and C6g instance families. Using Amazon EBS gp3 volumes, you can provision performance independent of storage capacity. Amazon EBS gp3 volumes also provide better baseline performance, at a 9.6 percent lower cost per GB than existing gp2 volumes on OpenSearch Service. With gp3, you also get denser storage on R5, R6g, M5, and M6g instance families, which can help you to further optimize your costs. You can create EBS volumes up to the supported quota. For more information on quotas, see Amazon OpenSearch Service quotas.
For data nodes that have NVM Express (NVMe) drives, such as i3 and r6gd instances, the volume size is fixed, so EBS volumes are not an option.
Number of nodes and instance types
The number of nodes is based on the number of CPUs required to operate your workload. The number of CPUs is based on the shard count. An index in OpenSearch is made up of multiple shards. When you create an index, you specify the number of shards for the index. Therefore, you need to do the following:
-
Calculate the total shard count that you intend to store in the domain.
-
Determine the CPU.
-
Find the most cost-effective node type and count that gives you the required number of CPUs and storage.
This is usually a starting point. Run tests to determine that the estimate size is meeting your functional and nonfunctional requirements.
Determining the indexing strategy and shard count
After you know the storage requirements, you can decide how many indexes you need and identify the shard count for each. Generally, search use cases have one or a few indexes, each representing a searchable entity or a catalog. For log analytics use cases, an index can represent a daily or weekly log file. After you decide how many indexes, begin with the following scale guidance, and determine appropriate shard count:
-
Search use cases – 10–30 GB/shard
-
Log analytics use cases – 50 GB/shard
You can divide the total volume of data in a single index by the shard size you are aiming for in your use case. This will give you the number of shards for the index. Identifying the total number of shards will help you find the right instance types that suit your workload. The shards shouldn't be too large or too numerous. Large shards can make it difficult for OpenSearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out-of-memory errors. Moreover, imbalance in shard allocation to data nodes can lead to skewing. When you have indexes with multiple shards, try to make the shard count an even multiple of the data node count. This helps to ensure that shards are evenly distributed across data nodes, and prevents hot nodes. For example, if you have 12 primary shards, your data node count should be 2, 3, 4, 6, or 12. However, shard count is secondary to shard size—if you have 5 GiB of data, you should still use a single shard. Balancing replica shard count evenly across the Availability Zone also helps improve resilience.
CPU utilization
The next step is to identify how many CPUs you need for your workload. We recommend starting with a CPU count 1.5 times that of your active shards. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial read or write requests. For log analytics, only the current index is generally active. For search use cases, all primary shards will be considered as active shards. Although we recommend 1.5 CPU per active shard, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.
A best practice for maintaining your CPU utilization is to make sure that the OpenSearch service domain has enough resources to perform its tasks. A cluster that has consistently high CPU utilization can degrade cluster stability. When your cluster is overloaded, OpenSearch Service will block incoming requests, which results in request rejections. This is to protect the domain from failing. General guidelines on the CPU usage will be about 60 percent average, 80 percent max CPU utilization. Occasional spikes of 100 percent are still acceptable and might not require scaling or reconfiguration.
Instance types
Amazon OpenSearch Service provides you with a choice of several instance types. You can choose the instance types that best fit your use case. Amazon OpenSearch Service supports the R, C, M, T, and I instance families. You choose an instance family based on the workload: memory optimized, compute optimized, or mixed. After you identify an instance family, choose the latest-generation instance type. Generally, we recommend Graviton and later generations because they are built to provide improved performance with lower costs compared with previous-generation instances.
Based on various testing that was performed for log analytics and search use cases, we recommend the following:
-
For log analytics use cases , a general guideline is to begin with the R family of Graviton instances for data nodes. We recommend that you run tests, establish benchmarks for your requirements, and identify the appropriate instance size for your workload.
-
For search use cases, we recommend using R and C family Graviton instances for data nodes, because search use cases require more CPU compared with log analytics use cases. For smaller workloads, you can use M family Graviton instances for both search and logs. I family instances offer NVMe drives and are used by customers with fast-indexing and low-latency search requirements.
The cluster is composed of data nodes and cluster manager nodes. Although dedicated master nodes don't process search and query requests, their size is highly correlated with the instance size and number of instances, indexes, and shards that they can manage. AWS documentation provides a matrix that recommends minimum dedicated cluster manager instance type.
AWS offers general purpose (M6g), compute optimized (C6g), and memory optimized
(R6g and R6gd) for Amazon OpenSearch Service version 7.9 or later powered by AWS Graviton2
The Graviton2 instance family reduces indexing latency by up to 50 percent and improves query performance by up to 30 percent when compared with the previous generation Intel-based instances available in OpenSearch Service (M5, C5, R5).