Considerations for synthetic data generation
With AWS Clean Rooms ML, collaboration members can create a synthetic dataset that irreversibly de-identifies the subjects of the original dataset from their collective datasets to train a custom machine learning model. When creating the collaboration, you must configure payment information to specify who pays for synthetic data generation. Here are the high-level steps to generate a synthetic dataset and train a custom machine learning model:
-
A collaboration member creates an analysis template that includes:
-
The SQL needed to define the dataset to be synthesized.
-
Privacy-related configurations used to ensure the synthetic data meets data providers’ compliance requirements.
-
-
Once all the data providers approve the analysis template, the collaboration query runner creates an machine learning (ML) input channel, using the template.
-
Clean Rooms ML generates the synthetic dataset and verifies that it meets the privacy thresholds specified in the analysis template.
-
If all the thresholds are satisfied, the ML input channel is populated with the synthetic dataset.
-
Customers can then use this ML input channel to train the custom ML model associated to the collaboration.
Important considerations:
-
Synthetic data generated in Clean Rooms ML does not remove, redact, obfuscate, or sanitize any individual values, including personally identifiable information (PII) found in the original dataset. The synthetic dataset is generated by sampling values, but not whole records, from the original dataset.
-
If the original dataset contains similar rows, it's possible the synthetic data contains rows that look identical to rows in the original dataset.
Dataset preparation:
-
Avoid columns with a significantly imbalanced class distribution. This is especially important for the predicted value or “Y” column. Extreme imbalances reduce the synthetic dataset's overall privacy.
-
Clean Rooms ML doesn't support generating synthetic data from time series data where maintaining correlations across sequential records is important.
-
Clean Rooms ML doesn't support generating synthetic data from text or unstructured data.
-
The following data types are supported:
Data type name BIGINT BOOLEAN CHAR DATE DECIMAL FLOAT INTEGER LONG REAL SHORT SMALLINT TIME TIMESTAMP_LTZ TIMESTAMP_NTZ TINYINT VARCHAR
Limitations:
-
For synthetic data generation, the maximum number of predictive columns is one.
-
If the target column is categorical, the maximum number of categories in the orginal dataset is 100.
-
In the original dataset, the number of rows must be between 1,500 and 2.5 million and the maximum number of columns is 1.000. For non-null values in the target column, the minimum number of rows is 1,000.
Privacy metrics:
-
Clean Rooms ML provides a privacy score that measures how protected the generated synthetic data is against membership inference attacks (MIAs). The service holds out 5% of the original data from the synthesization process to calculate this score.
-
Scores near 50% are considered good; higher scores indicate less protection against MIAs. Scores significantly below 50% are rare and may be due to non-representation of patterns from the original data in the synthesized data.
Downstream custom model:
-
Synthetic data generated in Clean Rooms ML is best suited for training binary classification models and multi-class classification models with up to five classes.
-
Training regression models using synthetic data generated in Clean Rooms ML may result in low model accuracy, as measured by Root Mean Square Error (RMSE).