Data Ingestion Data Preparation Data quality

Data Ingestion and Preparation

Data ingestion and preparation involves processes in collecting, curating, and preparing the data for ML. Data ingestion involves collecting batch or streaming data in unstructured or structured format. Data preparation takes the ingested data and processes to a format that can be used with ML.

Identifying, collecting, and transforming data is the foundation for ML. There is widespread consensus among ML practitioners that data preparation accounts for approximately 80% of the time spent in developing a viable ML model. (Source: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=2fb540636f63) There are several challenges that public sector organizations face in this phase: First is the ability to connect to and extract data from different types of data sources. Once the data is extracted, it needs to be cataloged and organized so that it is available for consumption, and there needs to be a mechanism in place to ensure that only authorized resources have access to the data. Mechanisms are also needed to ensure that source data transformed for ML is reviewed and approved for compliance with federal government guidelines.

The AWS Cloud provides services that enable public sector customers to overcome challenges in data ingestion, data preparation, and data quality. These are further described as follows:

Data Ingestion

The AWS Cloud enables public sector customers to overcome the challenge of connecting to and extracting data from both streaming and batch data, as described in the following:

Streaming Data. For streaming data, Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) enable the collection, processing, and analysis of data in real time. Amazon Kinesis provides a suite of capabilities to collect, process, and analyze real-time, streaming data. Amazon Kinesis Data Streams (KDS) is a service that enables ingestion of streaming data. Producers of data push data directly into a stream, which consists of a group of stored data units called records. The stored data is available for further processing or storage as part of the data pipeline. Ingestion of streaming videos can be done using Amazon Kinesis Video Streams.

This service can capture streams from millions of devices, and durably store, encrypt, and index video data for use in ML models. If data does not need to be stored for real-time processing, Amazon Data Firehose is a service that can be used to deliver real-time streaming data to a chosen destination. For example, a data source could be a custom producer application and a destination could be Amazon Simple Storage Service (Amazon S3) or Amazon RedShift. If you already use Apache Kafka, you can use Amazon MSK, a fully managed service, to build and run applications that use Apache Kafka to process streaming data without needing Apache Kafka infrastructure management expertise.

Batch Data. There are a number of mechanisms available for data ingestion in batch format. With AWS Database Migration Services (AWS DMS), you can replicate and ingest existing databases while the source databases remain fully operational. The service supports multiple database sources and targets, including writing data directly to Amazon S3. AWS DataSync is a data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage systems such as network file system (NFS) and AWS storage services such as Amazon Elastic File System (EFS) and Amazon S3. You can use AWS Transfer Family for ingestion of data from flat files using secure protocols such as Secure File Transfer Protocol (SFTP), File Transfer Protocol over SSL (FTPS), and File Transfer Protocol (FTP). For large amounts of data, you can use the AWS Snow Family for transferring data in bulk using secure physical appliances.

Data Preparation

Once the data is extracted, it needs to be transformed and loaded into a data store for feeding into an ML model. It also needs to be cataloged and organized so that it is available for consumption, and also needs to enable data lineage for compliance with federal government guidelines. AWS Cloud provides three services that provide these mechanisms. They are:

AWS Glue is a fully managed ETL (extract, transform and load) service that makes it simple and cost-effective to categorize, clean, enrich, and migrate data from a source system to a data store for ML. The AWS AWS Glue Data Catalog provides the location and schema of ETL jobs as well as metadata tables (where each table specifies a single source data store). A crawler can be set to automatically take inventory of the data in your data stores.

ETL jobs in AWS Glue consist of scripts that contain the programming logic that performs the transformation. Triggers are used to initiate jobs either on a schedule or as a result of a specified event. AWS Glue Studio provides a graphical interface that enables visual composition of data transformation workflows on AWS Glue’s Apache Spark-based serverless ETL engine. AWS Glue generates the code that's required to transform the data from source to target based on the source and target information provided. Custom scripts can also be provided in the AWS Glue console or API to transform and process the data.

In addition, AWS Glue DataBrew, a visual data preparation tool, can be used to simplify the process of cleaning and normalizing the data. It comes with hundreds of data transformations that can be used quickly to prepare data for ML without having to write your own transformation scripts.

AWS Glue also features the ability to integrate with Amazon SageMaker AI. Amazon SageMaker AI is a comprehensive service that provides purpose-built tools for every step of ML development and implementation. In AWS Glue, you can create a development endpoint and then create a SageMaker notebook to help develop your ETL and ML scripts. A development endpoint allows you to iteratively develop and test your ETL scripts using the AWS Glue console or API.
Amazon SageMaker AI Data Wrangler is a service that enables the aggregation and preparation of data for ML and is directly integrated into Amazon SageMaker AI Studio. Both Amazon Data Wrangler and Amazon SageMaker AI Studio are features of the Amazon SageMaker AI service. Data Wrangler contains hundreds of built-in transformations to quickly normalize, transform, and combine features without having to write any code. Using the Data Wrangler user interface, you can view table summaries, histograms, and scatter plots.
Amazon EMR: Many organizations use Spark for data processing and other purposes such as for a data warehouse. These organizations already have a complete end-to-end pipeline in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term. In these situations, Amazon EMR, a managed service for Hadoop-ecosystem clusters, can be used to process data. Amazon EMR reduces the need to set up, tune, and maintain clusters. Amazon EMR also features other integrations with Amazon SageMaker AI, for example, to start a SageMaker model training job from a Spark pipeline in Amazon EMR.

Data quality

Data that is obsolete or inaccurate not only causes issues in developing accurate ML models, but can significantly erode stakeholder and public trust. Public sector organizations need to ensure that data ingested and prepared for ML is of the highest quality by establishing a well-defined data quality framework. See How to Architect Data Quality on the AWS Cloud for an example on how you can set up a data quality framework on the AWS Cloud.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Best Practices

Model Training and Tuning