Preview your model - Amazon SageMaker

Preview your model

Note

The following functionalities are only available for custom models built with tabular datasets. Multi-category text prediction models are also excluded.

SageMaker Canvas provides you with tools to preview your model and validate data before you begin building. The following functionalities include previewing the accuracy of your model, validating your dataset to prevent issues while building the model, and changing the size of the random sample for your model.

Preview a model

With Amazon SageMaker Canvas, you can get insights from your data before you build a model by choosing Preview model. For example, you can see how the data in each column is distributed. For models built using categorical data, you can also choose Preview model to generate an Estimated accuracy prediction of how well the model might analyze your data. The accuracy of a Quick build or a Standard build represents how well the model can perform on real data and is generally higher than the Estimated accuracy.

Amazon SageMaker Canvas automatically handles missing values in your dataset while it builds the model. It infers the missing values by using adjacent values that are present in the dataset.


     Screenshot of the Build tab for a model in Canvas.

Validate data

Before you build your model, SageMaker Canvas checks your dataset for issues that might cause your build to fail. If SageMaker Canvas finds any issues, then it warns you on the Build page before you attempt to build a model.

You can choose Validate data to see a list of the issues with your dataset. You can then use the SageMaker Canvas data preparation features, or your own tools, to fix your dataset before starting a build. If you don’t fix the issues with your dataset, then your build fails.

If you make changes to your dataset to fix the issues, you have the option to re-validate your dataset before attempting a build. We recommend that you re-validate your dataset before building.

The following table shows the issues that SageMaker Canvas checks for in your dataset and how to resolve them.

Issue Resolution

Wrong model type for your data

Try another model type or use a different dataset.

Missing values in your target column

Replace the missing values, drop rows with missing values, or use a different dataset.

Too many unique labels in your target column

Verify that you've used the correct column for your target column, or use a different dataset.

Too many non-numeric values in your target column

Choose a different target column, select another model type, or use a different dataset.

One or more column names contain double underscores

Rename the columns to remove any double underscores, and try again.

None of the rows in your dataset are complete

Replace the missing values, or use a different dataset.

Too many unique labels for the number of rows in your data

Check that you're using the right target column, increase the number of rows in your dataset, consolidate similar labels, or use a different dataset.

Random sample

SageMaker Canvas uses the random sampling method to sample your dataset. The random sample method means that each row has an equal chance of being picked for the sample. You can choose a column in the preview to get summary statistics for the random sample, such as the mean and the mode.

By default, SageMaker Canvas uses a random sample size of 20,000 rows from your dataset for datasets with more than 20,000 rows. For datasets smaller than 20,000 rows, the default sample size is the number of rows in your dataset. You can increase or decrease the sample size by choosing Random sample in the Build tab of the SageMaker Canvas application. You can use the slider to select your desired sample size, and then choose Update to change the sample size. The maximum sample size you can choose for a dataset is 40,000 rows, and the minimum sample size is 500 rows. If you choose a large sample size, the dataset preview and summary statistics might take a few moments to reload.

The Build page shows a preview of 100 rows from your dataset. If the sample size is the same size as your dataset, then the preview uses the first 100 rows of your dataset. Otherwise, the preview uses the first 100 rows of the random sample.