Monitor Data Quality - Amazon SageMaker

Monitor Data Quality

Data quality monitoring automatically monitors machine learning (ML) models in production and notifies you when data quality issues arise. ML models in production have to make predictions on real-life data that is not carefully curated like most training datasets. If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions. Amazon SageMaker Model Monitor uses rules to detect data drift and alerts you when it happens. To monitor data quality, follow these steps:

  • Enable data capture. This captures inference input and output from a real-time inference endpoint and stores the data in Amazon S3. For more information, see Capture Data.

  • Create a baseline. In this step, you run a baseline job that analyzes an input dataset that you provide. The baseline computes baseline schema constraints and statistics for each feature using Deequ, an open source library built on Apache Spark, which is used to measure data quality in large datasets. For more information, see Create a Baseline.

  • Define and schedule data quality monitoring jobs. For more information, see Schedule Monitoring Jobs.

  • View data quality metrics. For more information, see Schema for Statistics (statistics.json file).

  • Integrate data quality monitoring with Amazon CloudWatch. For more information, see CloudWatch Metrics.

  • Interpret the results of a monitoring job. For more information, see Interpret Results.

  • Use SageMaker Studio to enable data quality monitoring and visualize results. For more information, see Visualize Results in Amazon SageMaker Studio.

Note

Amazon SageMaker Model Monitor currently supports only tabular data.