Customizing Amazon S3 ingestion - Discovering Hot Topics Using Machine Learning

Customizing Amazon S3 ingestion

One of the key features of the solution is the ability to ingest data uploaded to an Amazon S3 bucket. The source can be data exported of from internal or external services, or data in XLSX or JSON file format.

Examples of custom data that can be analyzed include:

  • Product review system, movie, or content reviews.

  • Internal or external chat forums, such as Twitch and Discord.

  • Transcriptions from call center calls as generated by Amazon Transcribe Call Analytics.

When the solution is deployed, the default implementation is configured to process transcriptions from Amazon Transcribe Call Analytics.

Key entities that the solution requires to process data regardless of source type include:

  • ID – A unique identifier for each record. If not known, set to GENERATE and the solution will generate a UUID for each file.

  • CREATED_DATE – Date associated with the record, if not known, set it to NOW and the solution will use the system's processing timestamp.

  • LANG – The language in which text is present. If you do not know, do not set it. The solution will use Amazon Comprehend Detecting the Dominant Language operation to detect the language before subjecting it for any NLP analysis.

  • TEXT – The text that should be subjected to NLP processing.

Note

The files can have additional columns with data elements, which the solution can store if defined in the schema definition AWS Glue customingestion table. Edit the schema for the customingestion table in the AWS Glue socialmediadb database in the AWS account and Region where the solution is deployed. Add Column Name and Data Type for the additional elements that need to be stored. For more information on working with AWS Glue tables, refer to Working with Tables on the AWS Glue Console in the AWS Glue Developer Guide.

Custom ingestion table schema definition in AWS Glue

Custom ingestion table schema definition in AWS Glue

The solution provides three types of file processor implementations, which can be configured with the environment variables for an AWS Lambda function:

  • Microsoft Excel files

  • JSON files

  • Transcribe Call Analytics

Microsoft Excel files

Here, each individual record of data in an Excel format file is to be analyzed. The following figure shows sample data.

Sample data in XLSX format for ingestion

Sample data in XLSX format for ingestion

The Excel processor implementation requires column numbers for certain key elements. To configure the environment variables for the Lambda CustomIngestion function, provide column numbers and remove extra keys. Using the Excel file in the previous figure as an example, ID becomes column '0', CREATED_DATE becomes '1', TEXT becomes '2', and LANG becomes '3'. Delete PROCESSOR_TYPE and LIST_SELECTOR keys, if present. The following figure displays the environment variables based on the data from the previous figure.

Depicts AWS Lambda environment variables set up for Excel-based data ingestion

AWS Lambda environment variables set up for Excel-based data ingestion

JSON files

When processing JSON documents, the solution requires keys within the JSON to query the information for analysis (set through environment variables). The solution uses jmespath to query JSON documents. The information provided through environment variables are jmespath selector expressions, which are processed by the solution.

Example JSON document containing a list of records

{ "list_contents": [ { "content": "Lorem ipsum dolor sit amet, ", "id": "id1", "lang": "en", "created_date": "11-19-2021 03:59:07" }, { "content": "consectetur adipiscing elit, ", "id": "id2", "lang": "en", "created_date": "11-19-2021 03:59:07" }, { "content": "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua", "id": "id3", "lang": "en", "created_date": "11-19-2021 03:59:07" } ] }

In this code sample, the JSON document contains a list under the "list_contents" key. In addition to setting the ID, CREATED_DATE, LANG, and TEXT environment variales , this example requires setting the LIST_SELECTOR expression as well.

If the JSON document has a list of records, the solution provides a mechanism to specify the key that contains the list using the LIST_SELECTOR environment variable. For this example, the environment variables would need to be set according to the values in the following figure.

AWS Lambda environment variables set up for JSON-based data ingestion

AWS Lambda environment variables set up for JSON-based data ingestion

Example JSON document contains a single record

{ "content": "Lorem ipsum dolor sit amet, ", "id": "id1", "lang": "en", "created_date": "11-19-2021 03:59:07" }

In this code sample, the JSON file contains a single record. Delete the LIST_SELECTOR environment variable and leave the rest of the variables the same as the JSON document containing multiple records.

Transcribe Call Analytics

This is a special case of JSON document processing. In addition to the environment variables defined in Example JSON document containing list of records, set the following two environment variables:

  • SENTIMENT – set value to sentiment

  • PROCESSOR_TYPE – set value to TRANSCRIBE_CALL_ANALYTICS