Data Sources and Ingestion - Amazon SageMaker

Data Sources and Ingestion

There are multiple ways to bring your data into Amazon SageMaker Feature Store. Feature Store offers a single API call for data ingestion called PutRecord that enables you to ingest data in batches or from streaming sources. You can use Amazon SageMaker Data Wrangler to engineer features and then ingest your features into your Feature Store. You can also use Amazon EMR for batch data ingestion through a Spark connector.

Stream Ingestion

You can use streaming sources such as Kafka or Kinesis as a data source where features are extracted from there and directly fed to the online feature store for training, inference or feature creation. Records can be pushed into the feature store by calling the synchronous PutRecord API call. Since this is a synchronous API call it allows small batches of updates to be pushed in a single API call. This enables you to maintain high freshness of the feature values and publish values as soon an update is detected. These are also called streaming features.

Data Wrangler with Feature Store

Data Wrangler is a feature of Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. Data Wrangler enables you to engineer your features and ingest them into a feature store. 

In Studio, after interacting with Data Wrangler, choose the Export tab, choose Export Step, and the choose Feature Store, as shown in the following screenshot. This exports a Jupyter notebook that has all the source code in it to create a Feature Store feature group that adds your features from Data Wrangler to an offline or online feature store.

After the  feature group has been created, you can also select and join data across multiple feature groups to create new engineered features in Data Wrangler and then export your data set to an S3 bucket. 

For more information on how to export to Feature Store, see Export to SageMaker Feature Store.