MLSEC-10: Protect against data poisoning threats - Machine Learning Lens

MLSEC-10: Protect against data poisoning threats

Protect against data injection and data manipulation that pollutes the training dataset. Data injections can add corrupt training data that can result in incorrect model and outputs. Data manipulations can change existing data (for example, labels) that can result in inaccurate and weak predictive models. Identify and address corrupt data and inaccurate models using security methods and anomaly detection algorithms. Ensure immutability of datasets by providing protection against ransomware and malicious code in installed third-party packages.

Implementation plan

  • Use only trusted data sources for training data - Verify that you have sufficient audit controls to replay activity and determine where a change occurred, by whom, and at what time. Before training, validate the quality of training data to look for strong outliers and potentially incorrect labels.

  • Look for underlying shifts in the patterns and distributions in training data - Using monitoring of data drift, derive the impact to prediction variance. These skews can be an indicator of underlying data drift, and can provide an early warning of unauthorized access targeting the training data.

  • Identify model updates that negatively impact the results before moving them to production - Determine if the retrained model results are different from the past model iteration. Use past test data and previous model iterations as a baseline.

  • Have a rollback plan - Using versioned training data and versioned models, make sure you can revert to a known good working model in a failure scenario. Use a fully managed service to store features, such as the Amazon SageMaker Feature Store. See more details of the Amazon SageMaker Feature Store under the Reliability pillar section (MLREL-07).

  • Use low-entropy classification cases - Look for significant, unexpected changes. Determine the bounds of thresholds, identify classifications that you do not expect to see, and alert if the retrained model exceeds them.

Documents

Blogs

Videos

Examples