Creating an Amazon Chime SDK data lake - Amazon Chime SDK

Creating an Amazon Chime SDK data lake

The Amazon Chime SDK call analytics data lake allows you to stream your machine learning powered insights and any metadata from Amazon Kinesis Data Stream to your Amazon S3 bucket. For example, using the data lake to access URLs to recordings. To create the data lake, you deploy a set of AWS CloudFormation templates from either the Amazon Chime SDK console or programmatically using the AWS CLI. The data lake enables you to query your call metadata and voice analytics data by referencing AWS Glue data tables in Amazon Athena.

Prerequisites

You must have the following items in order to create an Amazon Chime SDK lake:

Data lake terminology and concepts

Use the following terms and concepts to understand how the data lake works.

Amazon Kinesis Data Firehose

An extract, transform, and load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services. For more information, see What Is Amazon Kinesis Data Firehose?

Amazon Athena

Amazon Athena is an interactive query service that enables you to analyze data in Amazon S3 using standard SQL. Athena is serverless, so you have no infrastructure to manage, and you pay only for the queries that you run. To use Athena, point to your data in Amazon S3, define the schema, and use standard SQL queries. You can also use workgroups to group users and control the resources they have access to when running queries. Workgroups enable you to manage query concurrency and prioritize query execution across different groups of users and workloads.

Glue Data Catalog

In Amazon Athena, tables and databases contain the metadata that details a schema for underlying source data. For each dataset, a table must exist in Athena. The metadata in the table tells Athena the location of your Amazon S3 bucket. It also specifies the data structure, such as column names, data types, and the table's name. Databases only hold the metadata and schema information for a dataset.

Creating multiple data lakes

Multiple data lakes can be created by providing a unique Glue database name to specify where to store call insights. For a given AWS account, there can be several call analytics configurations, each with a corresponding data lake. This means that data separation can be applied for certain use cases, such as customizing retention policy, and access policy on how the data is stored. There can be different security policies applied for access of insights, recordings, and metadata.

Data lake regional availability

The Amazon Chime SDK data lake is available in the following Regions.

Region

Glue table

Amazon QuickSight

us-east-1

Available

Available

us-west-2

Available

Available

eu-central-1

Available

Available

Data lake architecture

The following diagram shows the data lake architecture. Numbers in the drawing correspond to the numbered text below.

The program flow through a data lake.

In the diagram, once you use the AWS console to deploy the CloudFormation template from the media insights pipeline configuration setup workflow, the following data flows to the Amazon S3 bucket:

  1. The Amazon Chime SDK call analytics will start streaming real-time data to the customer's Kinesis Data Stream.

  2. The Amazon Kinesis Firehose buffers this real-time data until it accumulates 128 MB, or 60 seconds elapses, whichever is first. Firehose then uses the amazon_chime_sdk_call_analytics_firehose_schema in the Glue Data Catalog to compress the data and transforms the JSON records to a parquet file.

  3. The parquet file resides in your Amazon S3 bucket, in a partitioned format.

  4. In addition to real-time data, post-call Amazon Transcribe Call Analytics summary .wav files (redacted and non-redacted, if specified in the configuration), and call recording .wav files are also sent to your Amazon S3 Bucket.

  5. You can use Amazon Athena and standard SQL to query the data in the Amazon S3 bucket.

  6. The CloudFormation template also creates a Glue Data Catalog to query this post-call summary data through Athena.

  7. All the data on Amazon S3 bucket can also be visualized using Amazon QuickSight. QuickSight builds up a connection with an Amazon S3 bucket using Amazon Athena.

The Amazon Athena table uses the following features to optimize query performance:

Data partitioning

Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, and region. Partitions act as virtual columns. In this case, the CloudFormation template defines partitions at table creation, which helps reduce the amount of data scanned per query and improves performance. You can also filter by partition to restrict the amount of data scanned by a query. For more information, refer to Partitioning data in Athena in the Amazon Athena User Guide.

This example shows partitioning structure with a date of January 1, 2023:

  1. s3://example-bucket/amazon_chime_sdk_data_lake /serviceType=CallAnalytics/detailType={DETAIL_TYPE}/year=2023 /month=01/day=01/example-file.parquet
  2. where DETAIL_TYPE is one of the following:

    1. CallAnalyticsMetadata

    2. TranscribeCallAnalytics

    3. TranscribeCallAnalyticsCategoryEvents

    4. Transcribe

    5. Recording

    6. VoiceAnalyticsStatus

    7. SpeakerSearchStatus

    8. VoiceToneAnalysisStatus

Optimize columnar data store generation

Apache Parquet uses column-wise compression, compression based on data type, and predicate pushdown to store data. Better compression ratios or skipping blocks of data means reading fewer bytes from your Amazon S3 bucket. That leads to better query performance and reduced cost. For this optimization, data conversion from JSON to parquet is enabled in the Amazon Kinesis Data Firehose.

Partition Projection

This Athena feature automatically creates partitions for each day to improve date-based query performance.

Data lake setup

Use the Amazon Chime SDK console to complete the following steps.

  1. Start the Amazon Chime SDK console ( https://console.aws.amazon.com/chime-sdk/home) and in the navigation pane, under Call Analytics, choose Configurations.

  2. Complete Step 1, choose Next and on the Step 2 page, choose the Voice Analytics check box.

  3. Under Output details, select the Data warehouse to perform historical analysis checkbox, then choose the Deploy CloudFormation stack link.

    The system sends you to the Quick create stack page in the CloudFormation console.

  4. Enter a name for the stack, then enter the following parameters:

    1. DataLakeType – Choose Create Call Analytics DataLake.

    2. KinesisDataStreamName – Choose your stream. It should be the stream used for call analytics streaming.

    3. S3BucketURI – Choose your Amazon S3 bucket. The URI must have the prefix s3://bucket-name

    4. GlueDatabaseName – Choose a unique AWS Glue Database name. You cannot reuse an existing database in AWS account.

  5. Choose the acknowledgment checkbox, then choose Create data lake. Allow 10 minutes for the system to create the lake.

Data lake setup using AWS CLI

Use AWS CLI to create a role with permissions to call CloudFormation’s create stack. Follow the procedure below to create and setup the IAM roles. For more information, see Creating a stack in the AWS CloudFormation User Guide.

  1. Create a role called AmazonChimeSdkCallAnalytics-Datalake-Provisioning-Role and attach a trust policy to the role allowing CloudFormation to assume the role.

    1. Create an IAM trust policy using the following template and save the file in .json format.

      { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "cloudformation.amazonaws.com" }, "Action": "sts:AssumeRole", "Condition": {} } ] }
    2. Run the aws iam create-role command and pass the trust policy as a parameter.

      aws iam create-role \ --role-name AmazonChimeSdkCallAnalytics-Datalake-Provisioning-Role --assume-role-policy-document file://role-trust-policy.json
    3. Note down the role arn that is returned from the response. role arn is required in the next step.

  2. Create a policy with permission to create a CloudFormation stack.

    1. Create an IAM policy using the following template and save the file in .json format. This file is required when calling create-policy.

      { "Version": "2012-10-17", "Statement": [ { "Sid": "DeployCloudFormationStack", "Effect": "Allow", "Action": [ "cloudformation:CreateStack" ], "Resource": "*" } ] }
    2. Run aws iam create-policy and pass create stack policy as a parameter.

      aws iam create-policy --policy-name testCreateStackPolicy --policy-document file://create-cloudformation-stack-policy.json
    3. Note down the role arn that is returned from the response. role arn is required in the next step.

  3. Attach the policy to the role aws iam attach-role-policy.

    aws iam attach-role-policy --role-name {Role name created above} --policy-arn {Policy ARN created above}
  4. Create a CloudFormation stack and enter the required parameters: aws cloudformation create-stack.

    Provide parameter values for each ParameterKey using ParameterValue.

    aws cloudformation create-stack --capabilities CAPABILITY_NAMED_IAM --stack-name testDeploymentStack --template-url https://chime-sdk-assets.s3.amazonaws.com/public_templates/AmazonChimeSDKDataLake.yaml --parameters ParameterKey=S3BucketURI,ParameterValue={S3 URI} ParameterKey=DataLakeType,ParameterValue="Create call analytics datalake" ParameterKey=KinesisDataStreamName,ParameterValue={Name of Kinesis Data Stream} --role-arn {Role ARN created above}

Resources created by data lake setup

The following table lists the resources created when you create a data lake.

Resource type

Resource name and description

Service name

AWS Glue Data Catalog Database

GlueDatabaseName – Logically groups all AWS Glue Data tables belonging to call insights and voice analytics.

Call analytics, voice analytics

AWS Glue Data Catalog Tables

amazon_chime_sdk_call_analytics_firehose_schema – Combined schema for call analytics voice analytics that is fed to the Kinesis Firehose.

Call analytics, voice analytics

call_analytics_metadata – Schema for call analytics metadata. Contains SIPmetadata and OneTimeMetadata.

Call analytics

call_analytics_recording_metadata – Schema for Recording and Voice Enhancement metadata Call analytics, voice analytics

transcribe_call_analytics – Schema for TranscribeCallAnalytics Payload "utteranceEvent"

Call analytics

transcribe_call_analytics_category_events – Schema for TranscribeCallAnalytics Payload "categoryEvent"

Call analytics

transcribe_call_analytics_post_call – Schema for Post Call Transcribe Call Analytics summary payload

Call analytics

transcribe – Schema for Transcribe Payload

Call analytics

voice_analytics_status – Schema for voice analytics ready events

Voice analytics

speaker_search_status – Schema for identification matches

Voice analytics

voice_tone_analysis_status – Schema for voice tone analysis events

Voice analytics

Amazon Kinesis Data Firehose

AmazonChimeSDK-call-analytics-UUID – Kinesis Data Firehose piping data for call analytics

Call analytics, voice analytics

Amazon Athena Workgroup

GlueDatabaseName-AmazonChimeSDKDataAnalytics – Logical group of users to control the resources they have access to when running queries.

Call analytics, voice analytics