Menu
Gaming Analytics Pipeline
Gaming Analytics Pipeline

Design Considerations

Shard Count

The number of shards you need for a new Amazon Kinesis stream depends on the amount of streaming data you plan to produce. Each shard can support up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). For example, an application that produces 100 records per second with a size of 35 kilobytes per record for a total data input rate of 3.4 megabytes per second needs 4 shards.

By default, the telemetry stream is deployed with four shards, and the file stream is deployed with two shards.

While there is no upper limit to the number of shards in a stream or account, each region has a default shard limit. For information on shard limits, please visit Amazon Kinesis Data Streams Limits. To request an increase in your shard limit, please use the Stream Limits form.

Batch Configuration

By default, the solution’s S3Connector application writes batches to Amazon Simple Storage Service (Amazon S3) every 100MB, 10 minutes, or 500,000 records per shard. The RedshiftConnector application refreshes the pointers to the batch telemetry file every 10 minutes or 36 files. As a result, it can take up to 20 minutes between when event data enters the pipeline to when it is available for analysis.

Solution Mode

You can deploy the solution in one of two modes: Demo to test the solution or Prod to use in a production environment. Demo mode batches records into smaller batches to allow you to test the solution quickly. Prod mode batches records into larger batches to allow you to save money.

Amazon Redshift Database Schema

The Gaming Analytics Pipeline uses a time-series table strategy to improve maintenance operations and query times.  By default, the solution will retain data in Amazon Redshift for six months.  The data is stored in several tables (game.events_YYYY_MM). The solution also includes a view (game.events) that combines these tables together to help make querying easier.

The solution uses the following schema for game.events_YYYY_MM tables.

CREATE TABLE IF NOT EXISTS game.events_YYYY_MM ( app_name VARCHAR(64) ENCODE ZSTD, app_version VARCHAR(64) ENCODE ZSTD, event_version VARCHAR(64) ENCODE ZSTD, event_id VARCHAR(36) NOT NULL ENCODE ZSTD, event_type VARCHAR(256) NOT NULL ENCODE ZSTD, event_timestamp TIMESTAMP NOT NULL ENCODE RAW, server_timestamp TIMESTAMP NOT NULL ENCODE RAW, client_id VARCHAR(36) ENCODE ZSTD, level_id VARCHAR(64) ENCODE ZSTD, position_x FLOAT ENCODE RAW,  position_y FLOAT ENCODE RAW,  PRIMARY KEY(event_timestamp, event_id) )  DISTKEY(client_id)  COMPOUND SORTKEY(event_timestamp, event_type);

Using event_timestamp in queries will improve query times because the COMPOUND SORTKEY first indexes by the event timestamp.  Note that the server_timestamp column provides a timestamp of when the data was first received on the server. Additional fields can be found in the event definitions file.

Amazon Redshift Users

By default, the solution creates three Amazon Redshift users: a root user, an analytics_worker user that the RedshiftConnector and CronConnector use to perform database operations, and an analytics_ro (read-only) user that you can use to query data.