MLREL-03: Use a data catalog
Process data across multiple data stores using data catalog technology. An advanced data catalog service can enable ETL process integration. This approach enables more reliability and efficiency.
Implementation plan
-
Use AWS Glue Data Catalog - The AWS Glue Data Catalog provides a way to track the data assets that have been loaded into your ML workload. Data catalogs also describe how data is transformed as it is loaded into the data lake and data warehouse. AWS Glue is a fully managed ETL (extract, transform, and load) service. It enables a simple and cost-effective approach to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog. It also has an ETL engine that automatically generates Python or Scala code. With a flexible scheduler, AWS Glue handles dependency resolution, job monitoring, and retries.