Amazon OpenSearch Ingestion - Amazon OpenSearch Service

Amazon OpenSearch Ingestion

Amazon OpenSearch Ingestion is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and OpenSearch Serverless collections.

With OpenSearch Ingestion, you no longer need to use third-party solutions like Logstash or Jaeger to ingest data into your OpenSearch Service domains and OpenSearch Serverless collections. You configure your data producers to send data to OpenSearch Ingestion. Then, it automatically delivers the data to the domain or collection that you specify. You can also configure OpenSearch Ingestion to transform your data before delivering it.

Also, with OpenSearch Ingestion, you don't need to worry about provisioning servers, managing and patching software, or scaling your cluster of servers. You provision ingestion pipelines directly within the AWS Management Console, and OpenSearch Ingestion takes care of managing and scaling them.

OpenSearch Ingestion is a subset of Amazon OpenSearch Service. It's powered by Data Prepper, which is an open source data collector that can filter, enrich, transform, normalize, and aggregate data for downstream analysis and visualization.

Key concepts

As you get started with OpenSearch Ingestion, you can benefit from understanding the following concepts:

Pipeline

From an OpenSearch Ingestion perspective, a pipeline refers to a single provisioned data collector that you create within OpenSearch Service. You can think of it as the entire YAML configuration file, which includes one or more sub-pipelines. For steps to create an ingestion pipeline, see Creating pipelines.

Sub-pipeline

You define sub-pipelines within a YAML configuration file. Each sub-pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. You can define multiple sub-pipelines in a single YAML file, each with unique sources, processors, and sinks. To aid in monitoring with CloudWatch and other services, we recommend that you specify a pipeline name that's distinct from all of its sub-pipelines.

You can string multiple sub-pipelines together within a single YAML file, so that the source for one sub-pipeline is another sub-pipeline, and its sink is a third sub-pipeline. For an example, see Using an OpenSearch Ingestion pipeline with OpenTelemetry Collector.

Source

The input component of a sub-pipeline. It defines the mechanism through which a pipeline consumes records. The source can consume events either by receiving them over HTTPS, or by reading from external endpoints such as Amazon S3. There are two types of sources: push-based and pull-based. Push-based sources, such as HTTP and OTel logs, stream records to ingestion endpoints. Pull-based sources, such as OTel trace and S3, pull data from the source.

Processors

Intermediate processing units that can filter, transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline. If you don't define a processor, records are published in the format defined in the source. You can have more than one processor. A pipeline runs processors in the order that you define them.

Sink

The output component of a sub-pipeline. It defines one or more destinations that a sub-pipeline publishes records to. OpenSearch Ingestion supports OpenSearch Service domains as sinks. It also supports sub-pipelines as sinks. This means that you can string together multiple sub-pipelines within a single OpenSearch Ingestion pipeline (YAML file). Self-managed OpenSearch clusters aren't supported as sinks.

Buffer

The part of a processor that acts as the layer between the source and the sink. You can't manually configure a buffer within your pipeline. OpenSearch Ingestion uses a default buffer configuration.

Route

The part of a processor that allows pipeline authors to only send events that match certain conditions to different sinks.

A valid sub-pipeline definition must contain a source and a sink. For more information about each of these pipeline elements, see the configuration reference.

Benefits of OpenSearch Ingestion

OpenSearch Ingestion has the following main benefits:

  • Eliminates the need for you to manually manage a self-provisioned pipeline.

  • Automatically scales your pipelines based on capacity limits that you define.

  • Keeps your pipeline up to date with security and bug patches.

  • Provides the option to connect pipelines to your virtual private cloud (VPC) for an added layer of security.

  • Allows you to stop and start pipelines in order to control costs.

  • Provides pipeline configuration blueprints for popular use cases to help you get up and running faster.

  • Allows you to interact programmatically with your pipelines through the various AWS SDKs and the OpenSearch Ingestion API.

  • Supports performance monitoring in Amazon CloudWatch and error logging in CloudWatch Logs.

Limitations

OpenSearch Ingestion has the following limitations:

  • You can only ingest data into domains running OpenSearch 1.0 or later, or Elasticsearch 6.8 or later. If you're using the OTel trace source, we recommend using Elasticsearch 7.9 or later so that you can use the OpenSearch Dashboards plugin.

  • If a pipeline is writing to an OpenSearch Service domain that's within a VPC, the pipeline must be created in the same AWS Region as the domain.

  • You can only configure a single data source within a pipeline definition.

  • You can't specify self-managed OpenSearch clusters as sinks.

  • You can't specify a custom endpoint as a sink. You can still write to a domain that has custom endpoints enabled, but you must specify its standard endpoint.

  • You can't specify resources within opt-in Regions as sources or sinks.

  • There are some constraints on the parameters that you can include in a pipeline configuration. For more information, see Configuration requirements and constraints.

Supported Data Prepper versions

OpenSearch Ingestion currently supports the following major versions of Data Prepper:

  • 2.x

When you create a pipeline, use the required version option to specify the major version of Data Prepper to use. For example, version: "2". OpenSearch Ingestion retrieves the latest supported minor version of that major version and provisions the pipeline with that version. For more information, see Specifying the pipeline version.

Currently, OpenSearch Ingestion pipelines are provisioned with version 2.7 of Data Prepper. For information, see the 2.7 release notes. For information about the features and bug fixes that are in each version of Data Prepper, see the Releases page. Not every minor version of a particular major version is supported by OpenSearch Ingestion.

When you update a pipeline's YAML configuration file, if there's support for a new minor version of Data Prepper, OpenSearch Ingestion automatically upgrades the pipeline to the latest supported minor version of the major version that's specified in the pipeline configuration. For example, you might have version: "2" in your pipeline configuration, and OpenSearch Ingestion initially provisioned the pipeline with version 2.6.0. When support for version 2.7.0 is added, and you make a change to the pipeline configuration, OpenSearch Ingestion upgrades the pipeline to version 2.7.0. This process keeps your pipeline up to date with the latest bug fixes and performance improvements. OpenSearch Ingestion can't update the major version of your pipeline unless you manually change the version option within the pipeline configuration. For more information, see Updating Amazon OpenSearch Ingestion pipelines.

Scaling pipelines

You don't need to provision and manage pipeline capacity yourself. OpenSearch Ingestion automatically scales your pipeline capacity according to your estimated workload, based on the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs) that you specify.

Each Ingestion OCU is a combination of approximately 8 GiB of memory and 2 vCPUs. You can specify the minimum and maximum OCU values for a pipeline, and OpenSearch Ingestion automatically scales your pipeline capacity based on these limits.

You can specify the following values:

  • Minimum capacity – The pipeline can reduce capacity down to this number of Ingestion OCUs. The specified minimum capacity is also the starting capacity for a pipeline.

  • Maximum capacity – The pipeline can increase capacity up to this number of Ingestion OCUs.

Make sure that the maximum capacity for a pipeline is high enough to handle spikes in workload, and the minimum capacity is low enough to minimize costs when the pipeline isn't busy. Based on your settings, OpenSearch Ingestion automatically scales the number of Ingestion OCUs for your pipeline to process the ingest workload. At any specific time, you're charged only for the Ingestion OCUs that are being actively used by your pipeline.

The capacity allocated to your OpenSearch Ingestion pipeline scales up and down based on the processing requirements of your pipeline and the load generated by your client application. When capacity is constrained, OpenSearch Ingestion scales up by allocating more compute units (GiB of memory). When your pipeline is processing smaller workloads, or not processing data at all, it can scale down to the minimum configured Ingestion OCUs.

You can specify a minimum of 1 Ingestion OCU, a maximum of 96 Ingestion OCUs for stateless pipelines, and a maximum of 48 Ingestion OCUs for stateful pipelines. We recommend a minimum of at least 2 Ingestion OCUs for push-based sources. When persistent buffering is enabled, you can specify a minimum of 2 and maximum of 384 Ingestion OCUs.

Given a standard log pipeline with a single source, a simple grok pattern, and a sink, each compute unit can support up to 2 MiB per second. For more complex log pipelines with multiple processors, each compute unit might support less ingest load. Based on pipeline capacity and resource utilization, the OpenSearch Ingestion scaling process kicks in.

To ensure high availability, Ingestion OCUs are distributed across Availability Zones (AZs). The number of AZs depends on the minimum capacity that you specify.

For example, if you specify a minimum of 2 compute units, the Ingestion OCUs that are in use at any given time are evenly distributed across 2 AZs. If you specify a minimum of 3 or more compute units, the Ingestion OCUs are evenly distributed across 3 AZs. We recommend that you provision at least two Ingestion OCUs to ensure 99.9% availability for your ingest pipelines.

You're not billed for Ingestion OCUs when a pipeline is in the Create failed, Creating, Deleting, and Stopped states.

For instructions to configure and retrieve capacity settings for a pipeline, see Creating pipelines.

OpenSearch Ingestion pricing

At any specific time, you only pay for the number of Ingestion OCUs that are allocated to a pipeline, regardless of whether there's data flowing through the pipeline. OpenSearch Ingestion immediately accommodates your workloads by scaling pipeline capacity up or down based on usage.

For full pricing details, see Amazon OpenSearch Service pricing.

Supported AWS Regions

OpenSearch Ingestion is available in a subset of AWS Regions that OpenSearch Service is available in. For a list of supported Regions, see Amazon OpenSearch Service endpoints and quotas in the AWS General Reference.

OpenSearch Ingestion quotas

For a list of default quotas for OpenSearch Ingestion resources, see Amazon OpenSearch Service quotas.