This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Data Ingestion and Preparation
Data ingestion and preparation involves processes in collecting, curating, and preparing the data for ML. Data ingestion involves collecting batch or streaming data in unstructured or structured format. Data preparation takes the ingested data and processes to a format that can be used with ML.
Identifying, collecting, and transforming data is the foundation for ML. There is
widespread consensus among ML practitioners that data preparation accounts for approximately
80% of the time spent in developing a viable ML model. (Source: https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=2fb540636f63
The AWS Cloud provides services that enable public sector customers to overcome challenges in data ingestion, data preparation, and data quality. These are further described as follows:
Data Ingestion
The AWS Cloud enables public sector customers to overcome the challenge of connecting to and extracting data from both streaming and batch data, as described in the following:
-
Streaming Data. For streaming data, Amazon Kinesis
and Amazon Managed Streaming for Apache Kafka (Amazon MSK) enable the collection, processing, and analysis of data in real time. Amazon Kinesis provides a suite of capabilities to collect, process, and analyze real-time, streaming data. Amazon Kinesis Data Streams (KDS) is a service that enables ingestion of streaming data. Producers of data push data directly into a stream, which consists of a group of stored data units called records. The stored data is available for further processing or storage as part of the data pipeline. Ingestion of streaming videos can be done using Amazon Kinesis Video Streams .
This service can capture streams from millions of devices, and durably store, encrypt,
and index video data for use in ML models. If data does not need to be stored for real-time
processing, Amazon Data Firehose
-
Batch Data. There are a number of mechanisms available for data ingestion in batch format. With AWS Database Migration Services
(AWS DMS), you can replicate and ingest existing databases while the source databases remain fully operational. The service supports multiple database sources and targets, including writing data directly to Amazon S3. AWS DataSync is a data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage systems such as network file system (NFS) and AWS storage services such as Amazon Elastic File System (EFS) and Amazon S3 . You can use AWS Transfer Family for ingestion of data from flat files using secure protocols such as Secure File Transfer Protocol (SFTP), File Transfer Protocol over SSL (FTPS), and File Transfer Protocol (FTP). For large amounts of data, you can use the AWS Snow Family for transferring data in bulk using secure physical appliances.
Data Preparation
Once the data is extracted, it needs to be transformed and loaded into a data store for feeding into an ML model. It also needs to be cataloged and organized so that it is available for consumption, and also needs to enable data lineage for compliance with federal government guidelines. AWS Cloud provides three services that provide these mechanisms. They are:
-
AWS Glue
is a fully managed ETL (extract, transform and load) service that makes it simple and cost-effective to categorize, clean, enrich, and migrate data from a source system to a data store for ML. The AWS AWS Glue Data Catalog provides the location and schema of ETL jobs as well as metadata tables (where each table specifies a single source data store). A crawler can be set to automatically take inventory of the data in your data stores. ETL jobs in AWS Glue consist of scripts that contain the programming logic that performs the transformation. Triggers are used to initiate jobs either on a schedule or as a result of a specified event. AWS Glue Studio provides a graphical interface that enables visual composition of data transformation workflows on AWS Glue’s Apache Spark-based serverless ETL engine. AWS Glue generates the code that's required to transform the data from source to target based on the source and target information provided. Custom scripts can also be provided in the AWS Glue console or API to transform and process the data.
In addition, AWS Glue DataBrew
, a visual data preparation tool, can be used to simplify the process of cleaning and normalizing the data. It comes with hundreds of data transformations that can be used quickly to prepare data for ML without having to write your own transformation scripts. AWS Glue also features the ability to integrate with Amazon SageMaker AI
. Amazon SageMaker AI is a comprehensive service that provides purpose-built tools for every step of ML development and implementation. In AWS Glue, you can create a development endpoint and then create a SageMaker notebook to help develop your ETL and ML scripts. A development endpoint allows you to iteratively develop and test your ETL scripts using the AWS Glue console or API. -
Amazon SageMaker AI Data Wrangler
is a service that enables the aggregation and preparation of data for ML and is directly integrated into Amazon SageMaker AI Studio . Both Amazon Data Wrangler and Amazon SageMaker AI Studio are features of the Amazon SageMaker AI service. Data Wrangler contains hundreds of built-in transformations to quickly normalize, transform, and combine features without having to write any code. Using the Data Wrangler user interface, you can view table summaries, histograms, and scatter plots. -
Amazon EMR
: Many organizations use Spark for data processing and other purposes such as for a data warehouse. These organizations already have a complete end-to-end pipeline in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term. In these situations, Amazon EMR, a managed service for Hadoop-ecosystem clusters, can be used to process data. Amazon EMR reduces the need to set up, tune, and maintain clusters. Amazon EMR also features other integrations with Amazon SageMaker AI , for example, to start a SageMaker model training job from a Spark pipeline in Amazon EMR.
Data quality
Data that is obsolete or inaccurate not only causes issues in developing accurate ML
models, but can significantly erode stakeholder and public trust. Public sector
organizations need to ensure that data ingested and prepared for ML is of the highest
quality by establishing a well-defined data quality framework. See How to Architect
Data Quality on the AWS Cloud