Operational best practices for Amazon OpenSearch Service - Amazon OpenSearch Service

Operational best practices for Amazon OpenSearch Service

This chapter addresses some best practices for operating Amazon OpenSearch Service domains and provides general guidelines that apply to many use cases. Each workload is unique, with unique characteristics, so no generic recommendation is exactly right for every use case. The overarching best practice is to deploy, test, and tune your domains in a continuous cycle to find the optimal configuration, stability, and cost for your workload.

Monitoring and alerting

The following best practices apply to monitoring your OpenSearch Service domains.

Configure CloudWatch alarms

OpenSearch Service emits performance metrics to Amazon CloudWatch. Regularly review your cluster and instance metrics and configure recommended CloudWatch alarms based on your workload performance.

Enable log publishing

OpenSearch Service exposes error logs, search slow logs, index slow logs, and audit logs in Amazon CloudWatch Logs. Search slow logs, index slow logs, and error logs are useful for troubleshooting performance and stability issues. Audit logs, which are only available if you enable fine-grained access control, track user activity.

Search slow logs and index slow logs are an important tool for understanding and troubleshooting the performance of your search and indexing operations. Enable search and index slow log delivery for all production domains. You must also configure logging thresholds, otherwise CloudWatch won't capture the logs.

Shard strategy

Shards distribute your workload across the data nodes in your OpenSearch Service domain. Properly configured indexes can help boost overall domain performance.

When you send data to OpenSearch Service, you send that data to an index. An index is analogous to a database table, with documents as the rows, and fields as the columns. When you create the index, you tell OpenSearch how many primary shards you want to create. The primary shards are independent partitions of the full data set. OpenSearch Service automatically distributes your data across the primary shards in an index. You can also configure replicas of the index. Each replica comprises a full set of copies of the primary shards for that index.

OpenSearch Service maps the shards for each index across the data nodes in your cluster. It ensures that the primary and replica shards for the index reside on different data nodes. The first replica ensures that you have two copies of the data in the index. You should always use at least one replica. Additional replicas provide additional redundancy and read capacity.

OpenSearch sends index and search requests to all of the data nodes that contain shards belonging to the index in the request. It sends indexing requests first to data nodes containing primary shards, and then to data nodes containing replica shards. For example, for an index with five primary shards and one replica, each indexing request uses 10 shards. Search queries, on the other hand, are sent to n shards, where n is the number of primary shards. For an index with five primary shards and one replica, each search query uses five shards (primary or replica) from that index.

Use the following best practices to determine shard and data node counts for your domain:

Shard size – The size of data on disk is a direct result of the size of your source data, and it changes as you index more data. The source to index ratio can vary wildly, from 1:10 to 10:1 or more, but usually it's around 1:1.10. You can use that ratio to predict the index size on disk, or index some data and retrieve the actual index sizes to determine the ratio for your workload. Once you have a predicted index size, set a shard count so that each shard will be between 10 and 30 GiB (for search workloads) or between 30 and 50 GiB (for logs workloads). 50 GiB should be the maximum; be sure to plan for growth.

Shard count – The distribution of shards to data nodes has a large impact on a domain’s performance. When you have indexes with multiple shards, try to make the shard count an even multiple of the data node count to ensure that shards are evenly distributed across data nodes, and to prevent hot nodes. For example, if you have 12 primary shards, your data node count should be 2, 3, 4, 6, or 12. However, shard count is secondary to shard size; if you have 5 GiB of data, you should still use a single shard.

Shards per data node – The total number of shards that a node can hold is proportional to the node’s JVM heap memory. Aim for 25 shards or fewer per GiB of heap memory. For example, a node with 32 GiB of heap memory should hold no more than 800 shards. Although shard distribution can vary based on your workload patterns, note that there's a limit of 1,000 shards per node. The cat/allocation API provides a quick view of the number of shards and total shard storage across data nodes.

Shard to CPU ratio – When a shard is involved in an indexing or search request, it uses a vCPU to process the request. As a best practice, use an initial scale point of 1.5 vCPU per shard. If your instance type has 8 vCPUs, set your data node count so that each node has no more than six shards. Note that this is an approximation. Be sure to test your workload and scale your cluster accordingly.

For storage volume, shard size, and instance type recommendations, see the following resources:

Avoid storage skew

Storage skew is when one or more nodes within a cluster holds a higher proportion of storage for one or more indexes than the others. Uneven CPU utilization, intermittent and uneven latency, and uneven queueing across data nodes are all indications of storage skew. To determine whether you have skew issues, see the following troubleshooting sections:

Stability

The following best practices apply to maintaining a stable and healthy OpenSearch Service domain.

Keep current with OpenSearch

Service software updates

OpenSearch Service regularly releases software updates that add features or otherwise improve your domains. Updates do not change the OpenSearch or Elasticsearch engine version. We recommend that you schedule a recurring time to run the describe domain API operation, and trigger a service software update if the UpdateStatus is ELIGIBLE. If you don't update your domain within a certain timeframe (typically two weeks), OpenSearch Service automatically perform the update.

OpenSearch version upgrades

OpenSearch Service regularly adds support for community-maintained versions of OpenSearch. Always upgrade to the latest OpenSearch versions when they're available. OpenSearch Service simultaneously upgrades both OpenSearch and OpenSearch Dashboards (or Elasticsearch and Kibana if your domain is running a legacy engine). If the cluster has dedicated master nodes, upgrades complete without downtime. Otherwise, the cluster might be unresponsive for several seconds post-upgrade while it elects a master node. OpenSearch Dashboards might be unavailable during some or all of the upgrade.

There are two ways to upgrade a domain:

Regardless of which upgrade process you use, we recommend maintaining a domain that is solely for development and testing, and upgrading it to the new version before upgrading your production domain. Choose Development and testing for the deployment type when creating the test domain. Make sure to upgrade all clients to compatible versions immediately following the domain upgrade.

Back up your data

You can take manual snapshots for cluster recovery, or to move data from one cluster to another. You have to initiate or schedule manual snapshots. Snapshots are stored in your own Amazon S3 bucket. For instructions to take and restore a snapshot, see Creating index snapshots in Amazon OpenSearch Service.

Enable dedicated master nodes

Dedicated master nodes improve cluster stability. A dedicated master node performs cluster management tasks, but does not hold index data or respond to client requests. This offloading of cluster management tasks increases the stability of your domain and allows for some configuration changes to happen without downtime.

Enable and use three dedicated master nodes for optimal domain stability across three Availability Zones. For instance type recommendations, see Choosing instance types for dedicated master nodes.

Deploy across multiple Availability Zones

To prevent data loss and minimize cluster downtime in the event of a service disruption, you can distribute nodes across two or three Availability Zones in the same AWS Region. Availability Zones are isolated locations within each Region. With a two-AZ configuration, losing one Availability Zone means that you lose half of all domain capacity. Moving to three Availability Zones further reduces the impact of losing a single Availability Zone.

Deploy mission-critical domains across three Availability Zones and two replica shards per index. This configuration lets OpenSearch Service distribute replica shards to different AZs than their corresponding primary shards. There are no cross-AZ data transfer charges for cluster communications between Availability Zones.

Control ingest flow and buffering

We recommend limiting the overall request count using the _bulk API operation. It's more efficient to send one _bulk request containing 5,000 documents than it is to send 5,000 requests containing a single document.

For optimal operational stability, it's sometimes necessary to limit or even pause the upstream flow of indexing requests. Limiting the rate of index requests is an important mechanism for dealing with unexpected or occasional spikes in requests that might otherwise overwhelm the cluster. Consider building a flow control mechanism into your upstream architecture.

The following diagram shows multiple component options for a log ingest architecture. Configure the aggregation layer to allow sufficient space to buffer incoming data for sudden traffic spikes and brief domain maintenance.

Create mappings for search workloads

For search workloads, create mappings that define how OpenSearch stores and indexes documents and their fields. Set dynamic to strict in order to prevent accidental addition of new fields:

PUT my-index { "mappings": { "dynamic": "strict", "properties": { "title": { "type" : "text" }, "author": { "type" : "integer" }, "year": { "type" : "text" } } } }

Use index templates

An index template is a way to tell OpenSearch how to configure an index when it's created. Configure index templates prior to creating indexes. Then, when you create an index, it inherits the settings and mappings from the template. You can apply more than one template to a single index, thus you can specify settings in one template and mappings in another. This strategy allows one template for common settings across multiple indexes, and separate templates for more specific settings and mappings.

The following settings are particularly helpful to configure in templates:

  • Number of primary and replica shards

  • Refresh interval (how often to refresh and make recent changes to the index available to search)

  • Dynamic mapping control

  • Explicit field mappings

The following example template contains each of these settings:

{ "index_patterns":[ "index-*" ], "order": 0, "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 1, "refresh_interval": "60s" } }, "mappings": { "dynamic": false, "properties": { "field_name1": { "type": "keyword" } } } }

Even if they rarely change, having settings and mappings defined centrally in OpenSearch is simpler to manage than updating multiple upstream clients.

Manage indexes with Index State Management

If you're managing logs or time-series data, we recommend using Index State Management (ISM). ISM lets you automate regular index lifecycle management tasks. With ISM, you can create policies that trigger index alias rollovers, take index snapshots, move indexes between storage tiers, and delete old indexes. You can even use the ISM rollover operation as an alternative data lifecycle management strategy to avoid shard skew.

First, set up an ISM policy. See Sample policies for examples. Then, attach the policy to one or more indexes. If you include an ISM template field in the policy OpenSearch Service automatically applies the policy to any index that matches the specified pattern.

Remove unused indexes

Regularly review the indexes in your cluster and identify any that aren't in use. Take a snapshot of those indexes so that they're stored in S3, then delete them. Removing unused indexes reduces the shard count and allows for more balanced storage distribution and resource utilization across nodes. Even when idle, indexes consume some resources during internal index maintenance activities.

Rather than manually deleting unused indexes, you can use ISM to automatically take a snapshot and delete indexes after a certain period of time.

Use multiple domains for high availability

To achieve high availability beyond 99.9% uptime across multiple Regions, consider using two domains. For small or slowly changing data sets, you can set up cross-cluster replication to maintain an active-passive model where only the leader domain is written to, but either domain can be read from. For larger data sets and quickly changing data, configure dual delivery in your ingest pipeline so that all data is written independently to both domains in an active-active model.

Architect your upstream and downstream applications with failover in mind. Make sure to test the failover process along with other disaster recovery processes.

Performance

The following best practices apply to tuning your domains for optimal performance.

Optimize bulk request size and compression

Bulk sizing depends on your data, analysis, and cluster configuration, but a good starting point is 3–5 MiB per bulk request.

Send requests and receive responses from your OpenSearch domains using gzip compression to reduce the payload size of requests and responses. You can use gzip compression with the OpenSearch Python client, or by including the following headers from the client side:

  • 'Accept-Encoding': 'gzip'

  • 'Content-Encoding': 'gzip'

To optimize your bulk request sizes, start with a bulk request size of 3 MiB, then slowly increase the request size until indexing performance stops improving.

Note

To enable gzip compression on domains running Elasticsearch version 6.x, you must set http_compression.enabled at the cluster level. This setting is true by default in Elasticsearch versions 7.x and all versions of OpenSearch.

Reduce the size of bulk request responses

To reduce the size of OpenSearch responses, exclude unnecessary fields with the filter_path parameter. Make sure not to filter out any fields that are required to identify or retry failed requests. For more information and examples, see Reducing response size.

Tune refresh intervals

OpenSearch indexes have eventual read consistency. A refresh operation makes all updates that are performed on an index available for search. The default refresh interval is one second, which means OpenSearch performs a refresh every second while an index is being written to.

The less frequently you refresh an index (higher refresh interval), the better the overall indexing performance is. The trade-off of increasing the refresh interval is that there’s a longer delay between an index update and when the new data is available for search. Set your refresh interval as high as you can tolerate to improve overall performance.

We recommend setting the refresh_interval parameter for all of your indexes to 30 seconds or more.

Enable Auto-Tune

Auto-Tune uses performance and usage metrics from your OpenSearch cluster to suggest changes to queue sizes, cache sizes, and Java virtual machine (JVM) settings on your nodes. These optional changes improve cluster speed and stability. You can revert to the default OpenSearch Service settings at any time. Auto-Tune is enabled by default on new domains unless you explicitly disable it.

We recommend enabling Auto-Tune on all domains, and either setting a recurring maintenance window or periodically reviewing its recommendations.

Security

The following best practices apply to securing your domains.

Enable fine-grained access control

Fine-grained access control lets you control who can access certain data within an OpenSearch Service domain. Compared to generalized access control, fine-grained access control gives each cluster, index, document, and field its own specified policy for access. Access criteria can be based on a number of factors, including the role of the person requesting access and the action they intend to perform on the data. For example, you might give one user access to write to an index, while another might be given access only to read the data on the index without making any changes.

Fine-grained access control allows data with different access requirements to exist in the same storage space without running into security or compliance issues.

We recommend enabling fine-grained access control on your domains.

Deploy domains within a VPC

Placing your OpenSearch Service domain within a Virtual Private Cloud (VPC) enables secure communication between OpenSearch Service and other services within the VPC, without the need for an internet gateway, NAT device, or VPN connection. All traffic remains securely within the AWS Cloud. Because of their logical isolation, domains that reside within a VPC have an extra layer of security compared to domains that use public endpoints.

We recommend that you create your domains within a VPC.

Apply a restrictive access policy

Even if your domain is deployed within a VPC, it's best to implement security in layers. Make sure to check the configuration of your current access policies.

Apply a restrictive resource-based access policy to your domains and follow the principle of least privilege when granting access to the configuration API and the OpenSearch APIs. As a general rule, avoid using the anonymous user principal "Principal": {"AWS": "*" } in your access policies. There are some situations, however, where it's acceptable to use an open access policy, such as when you enable fine-grained access control. An open access policy can enable you to access the domain in cases where request signing is difficult or impossible, such as from certain clients and tools.

Enable encryption at rest

OpenSearch Service domains offer encryption of data at rest to help prevent unauthorized access to your data. Encryption at rest uses AWS Key Management Service (AWS KMS) to store and manage your encryption keys, and the Advanced Encryption Standard algorithm with 256-bit keys (AES-256) to perform the encryption.

If your domain stores sensitive data, enable encryption of data at rest.

Enable node-to-node encryption

Node-to-node encryption provides an additional layer of security on top of the default security features within OpenSearch Service. It implements Transport Layer Security (TLS) for all communications between the nodes provisioned within OpenSearch. Node-to-node encryption ensures that any data sent to your OpenSearch Service domain over HTTPS remains encrypted in transit while it is being distributed and replicated between nodes.

If your domain stores sensitive data, enable node-to-node encryption.

Cost optimization

The following best practices apply to optimizing and saving on your OpenSearch Service costs.

Use the latest generation instance types

OpenSearch Service is always adopting new Amazon EC2 instances types that deliver better performance at a lower cost. We recommend always using the latest generation instances.

Avoid using T2 or t3.small instances for production domains, as they can become unstable under sustained heavy load. t3.medium instances are an option for small production workloads (both as data nodes and as dedicated master nodes).

Use UltraWarm and cold storage for time-series log data

If you're using OpenSearch for log analytics, move your data to UltraWarm or cold storage to reduce costs. Use Index State Management (ISM) to migrate data between storage tiers and manage data retention.

UltraWarm provides a cost-effective way to store large amounts of read-only data in OpenSearch Service. UltraWarm uses Amazon S3 for storage, which means that the data is immutable and only one copy is needed. You only pay for storage equivalent to the size of the primary shards in your indexes. Latencies for UltraWarm queries grow with the amount of S3 data needed to service the query. Once the data has been cached on the nodes, queries to UltraWarm indexes perform similar to queries to hot indexes.

Cold storage is also backed by S3. When you need to query cold data, you can selectively attach it to existing UltraWarm nodes. Cold data incurs the same managed storage cost as UltraWarm, but objects in cold storage do not consume UltraWarm node resources and thus provide a virtually limitless amount of storage capacity without impacting UltraWarm node size or count.

UltraWarm becomes cost-effective once you have roughly 2.5 TiB of data in hot storage. Monitor your fill rate and plan to move indexes to UltraWarm before you reach that volume of data.

Review recommendations for Reserved Instances

Consider purchasing Reserved Instances (RIs) after you have a good baseline on your performance and compute consumption. Discounts start around 30% for no upfront, 1-year reservations and can increase up to 50% for all upfront, 3-year commitments.

Once you observe stable operation for at least 14 days, review Reserved Instance recommendations in the Cost Explorer. The Amazon OpenSearch Service heading displays specific RI purchase recommendations and projected savings.