Data modeling - Amazon Timestream

Data modeling

Amazon Timestream is designed to collect, store, and analyze time series data from applications and devices emitting a sequence of data with a timestamp. For optimal performance, the data being sent to Timestream must have temporal characteristics and time must be a quintessential component of the data.

Timestream provides you the flexibility to model your data in different ways to suit your application's requirements. In this section, we cover several of these patterns and provide guidelines for you to optimize your costs and performance. Familiarize yourself with key Timestream concepts such as dimensions and measures. In this section, you will learn more about:

When deciding whether to create a single table or multiple tables to store data consider the following:

  • Which data to put in the same table vs. when you want to separate data across multiple tables and databases.

  • How to choose between Timestream's multi-measure vs. single-measure records, and the benefits of modeling using multi-measure records especially when your application is tracking multiple measurements at the same time instant.

  • Which attributes to model as dimensions or as measures.

  • How to effectively use the measure name attributes to optimize your query latency.

Single table vs. multiple tables

As you are modeling your data in application, another important aspect is how to model the data into tables and databases. Databases and tables in Timestream are abstractions for access control, specifying KMS keys, retention periods, etc. Timestream automatically partition your data and is designed to scale resources to match the ingestion, storage, and query load and requirements for your applications.

A table in Timestream can scale to petabytes of data stored, tens of gigabytes/sec of data writes, and queries can process hundreds of TBs per hour. Queries in Timestream can span multiple tables and databases, providing joins and unions to provide seamless access to your data across multiple tables and databases. So scale of data or request volumes are usually not the primary concern when deciding how to organize your data in Timestream. Below are some important considerations when deciding which data to co-locate in the same table vs. in different tables, or tables in different databases.

  • Data retention policies (memory store retention, magnetic store retention, etc.) are supported at the granularity of a table. Therefore, data that requires different retention policies need to be in different tables.

  • Amazon KMS keys used to encrypt your data is configured at the database level. Therefore, different encryption key requirements imply the data will need to be in different databases.

  • Timestream supports resource-based access control at the granularity of tables and databases. Consider your access control requirements when deciding which data you write to the same table vs. different tables.

  • Be aware of the limits on the number of dimensions, measure names, and multi-measure attribute names when deciding which data is stored in which table.

  • Consider your query workload and access patterns when deciding how you organize your data, as the query latency and ease of writing your queries will be dependent on that.

    • If you store data which you frequently query together in the same table, that will generally ease the way you write your queries such that you can often avoid having to write joins, unions, or common table expressions. This also usually results in lower query latency. You can use predicates on dimensions and measure names to filter the data that is relevant to the queries.

      For instance, consider a case where you store data from devices located in six continents. If your queries frequently access data from across continents to get a global aggregated view, then storing data from these continent in the same table will result in easier to write queries. On the other hand, if you store data on different tables, you still can combine the data in the same query, however, you will need to write a query to union the data from across tables.

    • Timestream uses adaptive partitioning and indexing on your data. So queries only get charged for data that is relevant to your queries. For instance, if you have a table storing data from a million devices across six continents, if your query has predicates of the form WHERE device_id = ‘abcdef’ or WHERE continent = ‘North Armerica’, then queries are only charged for data for the device or for the continent.

    • Wherever possible, if you use measure name to separate out data in the same table that is not emitted at the same time or not frequently queried, then using predicates such as WHERE measure_name = 'cpu' in your query, not only do you get the metering benefits, Timestream can also effectively eliminate partitions that do not have the measure name used in your query predicate. This enables you to store related data with different measure names in the same table without impacting query latency or costs, and avoids spreading the data into multiple table. The measure name is essentially used to partition the data and prune partitions irrelevant to the query.

Multi-measure records vs. single-measure records

Timestream allows you to write data with multiple measures per record (multi-measure) or single measure per record (single-measure).

Multi-measure records

In many use cases, a device or an application you are tracking may emit multiple metrics or events at the same timestamp. In such cases, you can store all the metrics emitted at the same timestamp in the same multi-measure record. That is, all the measures stored in the same multi-measure record appear as different columns in the same row of data.

Consider for instance your application is emitting metrics such as cpu, memory, disk_iops from a device measured at the same time instant. Below is an example of such a table where multiple metrics emitted at the same time instant are stored in the same row. You will see two hosts are emitting the metrics once every second.

Hostname measure_name Time cpu Memory disk_iops
host-24Gju metrics 2021-12-01 19:00:00 35 54.9 38.2
host-24Gju metrics 2021-12-01 19:00:01 36 58 39
host-28Gju metrics 2021-12-01 19:00:00 15 55 92
host-28Gju metrics 2021-12-01 19:00:01 16 50 40

Single-measure records

The single measure records are suitable when your devices emit different metrics at different time periods or you are using custom processing logic that emit metrics/events at different time periods, for instance when a device's reading/state changes. Since every measure has a unique timestamp, they can be stored in their own records in Timestream. For instance, consider an IoT sensor tracking soil temperature and moisture which emits a record only when it detects a change from the previous reported entry. The example below provides an example of such data being emitted using single measure records.

device_id measure_name Time measure_value::double measure_value::bigint
sensor-sea478 temperature 2021-12-01 19:22:32 35 NULL
sensor-sea478 temperature 2021-12-01 18:07:51 36 NULL
sensor-sea478 moisture 2021-12-01 19:05:30 NULL 21
sensor-sea478 moisture 2021-12-01 19:00:01 NULL 23

Comparing single measure and multi-measure records

Timestream provides you the flexibility to model your data as single measure or multi-measure records depending on your application's requirements and characteristics. A single table can store both single measure and multi-measure records, if your application requirements so desire. In general, when your application is emitting multiple measures/events at the same time instant, then modeling the data as multi-measure records is usually is recommended for performant data access and cost-effective data storage.

For instance, if you consider a DevOps use case tracking metrics and events from hundreds of thousands of servers, each server periodically emits 20 metrics and 5 events, where the events and metrics are emitted at the same time instant. That data can be modeled either using single measure records or using multi-measure records (see the open-sourced data generator for the resulting schema). For this use case, modeling the data using multi-measure records compared to single measure records results in:

  • Ingestion metering - Multi-measure records results in about 40% lower ingestion bytes written.

  • Ingestion batching - Multi-measure records result in bigger batches of data being sent, which implies the clients need fewer threads and lesser CPU to process the ingestion.

  • Storage metering - Multi-measure records result in about 8X lower storage, resulting in significant storage savings for both memory and magnetic store.

  • Query latency - Multi-measure records results in lower query latency for most query types when compared to single-measure records.

  • Query metered bytes - For queries scanning less than 10MB data, both single measure and multi-measure records are comparable. For queries accessing a single measure and scanning > 10MB data, single measure records usually results in lower bytes metered. For queries referencing 3 or more measures, multi-measure records result in lower bytes metered.

  • Ease of expressing multi-measure queries - When your queries reference multiple measures, modeling your data with multi-measure records results in easier to write more compact queries.

The previous factors will vary depending on how many metrics you are tracking, how many dimensions your data has, etc. While the preceding example provides some concrete data for one example, we see across many application scenarios and use cases where if your application emits multiple measures at the same instant, storing data as multi-measure records is more effective. Moreover, multi-measure records provide you the flexibility of data types and storing multiple other values as context (e.g., storing request IDs, additional timestamps, etc, which is discussed below).

Note that, a multi-measure record can also model sparse measures such as the previous example for single measure records: you can use the measure_name to store the name of the measure and use a generic multi-measure attribute name, such as value_double to store DOUBLE measures, value_bigint to store BIGINT measures, value_timestamp to store additional TIMESTAMP values, etc.

Dimensions and measures

A table in Timestream allows you to store dimensions (identifying attributes of the device/data you are storing) and measures (the metrics/values you are tracking), see Timestream concepts for more details. As you are modeling your application on Timestream, how you map your data into dimensions and measures impacts your ingestion and query latency. Below are some guidelines on how to model your data as dimensions and measures that you can apply to your use case.

Choosing dimensions

Data that identifies the source that is sending the time series data is a natural fit for dimensions, i.e., attributes that does not change over time. For instance, if you have a server emitting metrics, then the attributes identifying the server, such as host name, region, rack, availability zone, etc are candidate for dimensions. Similarly, for an IoT device with multiple sensors reporting time series data, device id, sensor id, etc. are candidate for dimensions.

If you are writing data as multi-measure records, dimensions and multi-measure attributes appear as columns in the table when you do a DESCRIBE or run a SELECT statement on the table. Therefore, when writing your queries, you can freely use the dimensions and measures in the same query. However, as you construct your write record to ingest data, keep the following in mind as you choose which attributes are specified as dimensions and which ones are measure values:

  • The dimension names, values, measure name, and timestamp uniquely identifies the time series data. Timestream uses this unique identifier to automatically de-duplicate data. That is, if Timestream receives two data points with the same values of dimension names, dimension values, measure name, and timestamp, if the values have the same version number, then Timestream deduplicates. If the new write request has a lower version than data already existing in Timestream, the write request is rejected. If the new write request has a higher version, then the new value overwrites the old value. Therefore, how you choose your dimension values will impact this de-duplication behavior.

  • Dimension names and values cannot be updated, measure value can be. So any data that might need updates is better modeled as measure values. For instance, if you have a machine in the factory floor whose color can change, you can model the color as a measure value, unless you want to use the color also as identifying attribute that is needed for deduplication. That is, measure values can be used to store attributes that only slowly change over time.

Note that a table in Timestream does not limit the number of unique combinations of dimension names and values. For instance, you can have billions of such unique value combinations stored in a table. However, as you will see with additional examples below, careful choice of dimensions and measures can significantly optimize your request latency, especially for queries.

Unique IDs in dimensions

If your application scenario requires you to store an unique identifier for every data point (e.g., a request ID, a transaction ID, or a correlation ID), modeling the ID attribute as a measure value will result in significantly better query latency. When modeling your data with multi-measure records, the ID appears in the same row in context with your other dimensions and time series data, so your queries can continue to use them effectively. For instance, considering a DevOps use case where if every data point emitted by a server has a unique request ID attribute, modeling the request ID as a measure value results in up to 4X lower query latency across different query types as opposed to modeling the unique request ID as a dimension.

You can use the similar analogy for attributes that are not entirely unique for every data point, but have hundreds of thousands or millions of unique values. You can model those attributes both as dimensions or measure values. You would want to model it as a dimension if the values are necessary for de-duplication on the write path as discussed earlier or you often use it as a predicate (i.e., in the WHERE clause with an equality predicate on a value of that attribute such as device_id = ‘abcde’ where your application is tracking millions of devices) in your queries.

Richness of data types with multi-measure records

Multi-measure records provide you the flexibility to effectively model your data. Data that you store in a multi-measure record appear as columns in the table similar to dimensions, thus providing the same ease of querying for dimensions and measure values. You saw some of these patterns in the examples discussed earlier. Below you will find additional patterns to effectively use multi-measure records to meet your application's use cases:

Multi-measure records support attributes of data types DOUBLE, BIGINT, VARCHAR, BOOLEAN, and TIMESTAMP. Therefore, it naturally fits different types of attributes:

  • Location information: For instance, if you want to track the location (expressed as latitude and longitude), then modeling it as a multi-measure attribute will result in lower query latency compared to storing them as VARCHAR dimensions, especially when you have predicates on the latitude and longitudes.

  • Multiple timestamps in a record: If your application scenario requires you to track multiple timestamps for a time series record, you can model them as additional attributes in the multi-measure record. This pattern can be used to store data with future timestamps or past timestamps. Note that every record will still use the timestamp in the time column to partition, index, and uniquely identify a record.

In particular, if you have numeric data or timestamps on which you have predicates in the query, modeling those attributes as multi-measure attributes as opposed to dimensions will result in lower query latency. This is because when you model such data using the rich data types supported in multi-measure records, you can express the predicates using native data types instead of casting values from VARCHAR to another data type if you modeled such data as dimensions.

Using measure name with multi-measure records

Tables in Timestream support a special attribute (or column) called measure name. You specify a value for this attribute for every record you write to Timestream. For single measure records, it is natural to use the name of your metric (such as cpu, memory for server metrics, or temperature, pressure, for sensor metrics). When using multi-measure records, since attributes in a multi-measure record are named, and these names become column names in the table, cpu, memory or temperature, pressure can become multi-measure attribute names. So a natural question is how to effectively use the measure name.

Timestream uses the values in the measure name attribute to partition and index the data. Therefore, if a table has multiple different measure names, and if the queries use those values as query predicates, then Timestream can use its custom partitioning and indexing to prune out data that is not relevant to queries. For instance, if your table has cpu and memory measure names, and your query has a predicate WHERE measure_name = 'cpu', Timestream can effectively prune data for measure names not relevant to the query, i.e., rows with measure name memory in this example. This pruning applies even when using measure names with multi-measure records. You can use the measure name attribute effectively as a partitioning attribute for a table. Measure name along with dimension names and values, and time are used to partition the data in a Timestream table. Please be aware of the limits on the number of unique measure names allowed in a Timestream table. Also note that a measure name is associated with a measure value data type as well, i.e., a single measure name can only be associated with one type of measure value. That type can be one of DOUBLE, BIGINT, BOOLEAN, VARCHAR, and MULTI. Multi-measure records stored with a measure name will have the data type as MULTI. Since a single multi-measure record can store multiple metrics with different data types (DOUBLE, BIGINT, VARCHAR, BOOLEAN, and TIMESTAMP), you can associate data of different types in a multi-measure record.

Below we discuss a few different examples on how the measure name attribute can be effectively used to group together different types of data in the same table.

IoT sensors reporting quality and value

Consider you have an application monitoring data from IoT sensors. Each sensor tracks different measures, such as temperature, pressure. In addition to the actual values, the sensors also report quality of the measurements, which is a measure of how accurate the reading is, and a unit for the measurement. Since quality, unit, and value are emitted together, they can be modeled as multi-measure records, as shown in the example data below where device_id is a dimension, quality, value, and unit are multi-measure attributes:

device_id measure_name Time Quality Value Unit
sensor-sea478 temperature 2021-12-01 19:22:32 92 35 c
sensor-sea478 temperature 2021-12-01 18:07:51 93 34 c
sensor-sea478 pressure 2021-12-01 19:05:30 98 31 psi
sensor-sea478 pressure 2021-12-01 19:00:01 24 132 psi

This approach allows you to combine the benefits of multi-measure records along with partitioning and pruning data using the values of measure name. If queries reference a single measure, e.g., temperature, then you can include a measure name predicate in the query. Below is an example of such a query which also projects the unit for measurements whose quality is above 90.

SELECT device_id, time, value AS temperature, unit FROM db.table WHERE time > ago(1h) AND measure_name = 'temperature' AND quality > 90

Using the measure_name predicate on the query enables Timestream to effectively prune partitions and data that is not relevant to the query, thus improving your query latency.

It is also possible to have all of the metrics to be stored in the same multi-measure record if all the metrics are emitted at the same timestamp and/or multiple metrics are queried together in the same query. For instance, you can construct as multi-measure record with attributes temperature_quality, temperature_value, temperature_unit, pressure_quality, pressure_value, pressure_unit, etc. Many of the points discussed earlier about modeling data using single measure vs. multi-measure records apply in your decision of how to model the data. Consider your query access patterns and how your data is generated to choose a model that optimizes your cost, ingestion and query latency, and ease of writing your queries.

Different types of metrics in the same table

Another use case where you can combine multi-measure records with measure name values is to model different types of data that are independently emitted from the same device. Consider the DevOps monitoring use case servers are emitting two types of data: regularly emitted metrics and irregular events. An example of this approach is the schema discussed in the data generator modeling a DevOps use case. In this case, you can store the different types of data emitted from the same server in the same table by using different measure names. For instance, all the metrics which are emitted at the same time instant are stored with measure name metrics. All the events that are emitted at a different time instant from the metrics are stored with measure name events. The measure schema for the table (i.e., output of SHOW MEASURES query) is:

measure_name data_type Dimensions
events multi [{"data_type":"varchar","dimension_name":"availability_zone"},{"data_type":"varchar","dimension_name":"microservice_name"},{"data_type":"varchar","dimension_name":"instance_name"},{"data_type":"varchar","dimension_name":"process_name"},{"data_type":"varchar","dimension_name":"jdk_version"},{"data_type":"varchar","dimension_name":"cell"},{"data_type":"varchar","dimension_name":"region"},{"data_type":"varchar","dimension_name":"silo"}]
metrics multi [{"data_type":"varchar","dimension_name":"availability_zone"},{"data_type":"varchar","dimension_name":"microservice_name"},{"data_type":"varchar","dimension_name":"instance_name"},{"data_type":"varchar","dimension_name":"os_version"},{"data_type":"varchar","dimension_name":"cell"},{"data_type":"varchar","dimension_name":"region"},{"data_type":"varchar","dimension_name":"silo"},{"data_type":"varchar","dimension_name":"instance_type"}]

In this case, you can see that the events and metrics also have different sets of dimensions, where events have different dimensions jdk_version and process_name while metrics have dimensions instance_type and os_version.

Using different measure names allow you to write queries with predicates such as WHERE measure_name = 'metrics' to get only the metrics. Also having all the data emitted from same instance in the same table implies you can also write a simpler query with the instance_name predicate to get all data for that instance. For instance, a predicate of the form WHERE instance_name = ‘instance-1234’ without a measure_name predicate will return all data for a specific server instance.