With Amazon Machine Learning (Amazon ML), you can build and train predictive applications and host your applications in a scalable cloud solution. In this tutorial, we show you how to use Amazon ML to create a datasource, build a machine learning (ML) model, and use the model to generate batch predictions.
Our sample exercise in the tutorial shows how to identify potential customers for targeted marketing campaigns, but you can apply the same principles to create and use a variety of machine learning models. To complete the sample exercise, you use the publicly available banking and marketing dataset from the University of California at Irvine (UCI) repository. This dataset contains information about customers as well as descriptions of their behavior in response to previous marketing contacts. You use this data to identify which customers are most likely to subscribe to your new product. In the sample dataset, the product is a bank term deposit. A bank term deposit is a deposit made into a bank with a fixed interest rate that cannot be withdrawn for a certain period of time, also known as a certificate of deposit (CD).
To complete the tutorial, you download sample data and upload the data to Amazon S3 to create a datasource—an Amazon ML object that contains information about your data. Next, you create an ML model from the datasource. You evaluate and adjust the ML model’s performance, and then use it to generate predictions.
You need an AWS account for this tutorial. If you don’t have an AWS account, see Setting Up Amazon Machine Learning.
Complete the following steps to get started using Amazon ML:
Step 1: Download, Edit, and Upload Data
Step 2: Create a Datasource
Step 3: Create an ML Model
Step 4: Review the ML Model’s Performance and Set a Score Threshold
Step 5: Use the ML Model to Generate Batch Predictions
Step 6: Clean Up
To start, you download the data and check to see if you need to format it before you provide it to Amazon ML. For Amazon ML formatting requirements, see Understanding the Data Format for Amazon ML. To make the download step quick for you, we downloaded the banking and marketing dataset from the UCI Machine Learning Repository, formatted it to conform to Amazon ML guidelines, shuffled the records, and made it available at the location that is shown in the following procedure.
To download and save the data
To open the datasets that we have placed in an Amazon S3 bucket for your use, click https://s3.amazonaws.com/aml-sample-data/banking.csv and https://s3.amazonaws.com/aml-sample-data/banking-batch.csv
Download the files by saving them as banking.csv and banking-batch.csv on your desktop.
If you open the banking.csv file, you should see rows and columns full of data. The header row contains the attribute names for each column. An attribute is a unique, named property. Each row represents a single observation.
The following two screenshots show the data before and after our edits.
The banking-batch.csv data does not contain the binary attribute, y. Once you have an ML model, we will use the model to predict y for each row in the banking-batch.csv file.
Next, upload your banking.csv and banking-batch.csv files to an Amazon S3 bucket that you own. If you have not created a bucket, see the Amazon S3 User Guide to learn how to create one.
To upload the file to an Amazon S3 bucket
The datasource does not actually store your data. The datasource only references it. If you move or change the S3 file, Amazon ML cannot access or use it to create a ML model, generate evaluations, or generate predictions.
Now you are ready to create your datasource.
After you upload banking.csv to your Amazon S3 bucket, you need to provide Amazon ML with the following information:
You provide this information to Amazon ML by creating a datasource. A datasource is an Amazon ML object that holds the location of your input data, the attribute names and types, the name of the target attribute, and descriptive statistics for each attribute. Operations like ML model training or ML model evaluations use a datasource ID to reference your data.
In the next step, you reference banking.csv as the input data of your datasource, provide the schema using the Amazon ML console to assign data types, and select a target attribute.
Open the Amazon Machine Learning console at https://console.aws.amazon.com/machinelearning/. On the Amazon ML console, you can create data sources, ML models, evaluations, and batch predictions. You can also view detail pages for these objects, which include information such as the object’s creation status.
On the Entities page, choose Create new, Datasource.
On the Input Data page, for Where is your data located?, select S3.
For S3 Location, type the location of the banking.csv file dataset: example-bucket/banking.csv
In the S3 permissions dialog box, choose Yes.
Amazon ML validates the location of your data.
If your information is correct, a property page appears with a Validation success message. Review the properties, and then choose Continue.
Next, you establish a schema. A schema is composed of attributes and their assigned data types. There are two ways to provide Amazon ML with a schema:
In this tutorial, Amazon ML infers the schema for you.
For more information about creating a separate schema file, see this link.
To create a schema by using Amazon ML
On the Schema page, for Does the first line in your CSV contain the column names?, choose Yes.
The data type of each attribute is inferred by Amazon ML based on a sample of each attribute’s values. It is important that attributes are assigned the most correct data type possible to help Amazon ML ingest the data correctly and to enable the correct feature processing on the attributes. This step influences the predictive performance of the ML model that is trained on this datasource.
Attributes that are numeric quantities for which the order is meaningful should be marked as numeric
Attributes that are numbers or strings that are used to denote a category should be marked as categorical
Attributes that are expected to take only values 1 or 0 should be marked as binary
Attributes that are strings that you would like to treat as words delimited by spaces should be marked as text
Next, you select a target attribute.
In this step, you select a target attribute. The target attribute is the attribute that the ML model must learn to predict. Because you are trying to send the new marketing campaign to customers who are most likely to subscribe, you should choose the binary attribute y as your target attribute. This binary attribute labels an individual as having subscribed for a campaign in the past: 1 (yes) or 0 (no). When you select y as your target attribute, Amazon ML identifies patterns in the datasource that was used for training to create a mathematical model. The model can generate predictions about data for which you do not know the answer.
For example, if you want to predict your customers’ education levels, you would choose education as your target attribute.
Target attributes are required only if you use the datasource for training ML models and evaluating ML models.
To select y as the target attribute
On the Target page, for Do you want to use this dataset to create and/or evaluate a ML model?, choose Yes.
In the lower right of the table, choose the single arrow until the attribute y appears in the table.
In the Target column, choose the option next to y.
Amazon ML confirms that y is selected as your target.
On the Row ID page, for Do you want to select an identifier?, choose No.
On the Review page, choose Finish.
Once you choose Finish, the request to create the datasource is submitted. The datasource moves into Initialized status and takes a few minutes to reach Completed status. You do not need to wait for the datasource to complete, so proceed to the next step.
After the request to create the datasource has been submitted, you use it to train an ML model. The ML model generates predictions by using your training datasource to identify patterns in the historical data.
To create an ML model
Because you’ve already created a datasource, choose I already created a datasource pointing to my S3 data.
In the table, choose Banking Data 1, and then choose Continue.
On the ML model settings page, for ML model name, type Subscription propensity model.
Giving your ML model a human readable name helps you identify and manage the ML model.
Once you choose Finish, the following requests are submitted:
The split datasources, ML model, and evaluation move into Pending status and take a few minutes to reach Completed status. You need to wait for the evaluation to complete before proceeding to step 4.
Now that the ML model is successfully created and evaluated, let’s see if it is good enough to put to use. Amazon ML already computed an industry-standard quality metric called the Area Under a Curve (AUC) metric that expresses the performance quality of your ML model. Start by reviewing and interpreting it.
An evaluation describes whether or not your ML model is better than making random guesses. Amazon ML interprets the AUC metric to tell you if the quality of the ML model is adequate for most machine learning
applications. Learn more about AUC in the Amazon Machine Learning Concepts.
Next, let’s look at the AUC metric of your ML model.
To view the AUC metric of your ML model
Choose Amazon Machine Learning, ML models.
In the ML models table, select Subscription propensity model.
On the ML model report page, choose Evaluations, Subscription propensity evaluation.
On the Evaluation summary page, review your information. This page includes a summary of your evaluation, including the AUC performance metric of the ML model.
Next, you set a score threshold in order to change the ML model’s behavior when it makes a mistake.
Our ML model works by generating numeric prediction scores, and then applying a threshold to convert these scores into binary 0/1 labels. By changing the score threshold, you can adjust the ML model’s behavior for which records are predicted as 0/1.
To set a score threshold for your ML model
Amazon ML displays the ML model performance results page. This page includes a chart that shows the score distribution of your predictions. You use this page to view advanced metrics and the effect of different score thresholds on the performance of your model. You can fine-tune your ML model performance metrics by adjusting the score threshold value.
Let’s say you want to target the top 3% of the customers that are most likely to subscribe to the product. Slide the vertical selector to set the score threshold to a value that corresponds to 3% of the records predicted as “1”.
You can review the impact of this score threshold on the ML model’s performance. Now let’s say the false positive rate of 0.007 is acceptable to your application.
Choose Save Score Threshold.
The score threshold is saved for this ML model.
Each time you use this ML model to make predictions, it will predict records with scores>0.77 to be predicted as “1”, and the rest of the records will be predicted as “0”.
Remember, machine learning is an iterative process that requires you to discover what score threshold is most appropriate for you. You can adjust the predictions by adjusting your score threshold based on your use case.
To learn more about the score threshold, see the Amazon Machine Learning Concepts.
In Amazon ML, there are two ways to get predictions—batch and online. If your application requires predictions to be generated in real-time, you first need to mount the ML model to get online predictions. When you mount an ML model, you make it available to generate predictions on demand, and at low latency. These real-time predictions are usually used in interactive web, mobile, or desktop applications.
For this tutorial, you choose the method that generates predictions for a large batch of input records without going through the real-time Enable for Real-time Prediction interface.
A batch prediction is useful when you want to generate predictions for a set of observations all at once, and you do not have a low latency requirement. For your targeted marketing campaign, you want a single file with all of the answers included in it. In this sample problem, you are scoring your customers for whom you have not yet marketed your new product as a batch, and you don’t need to predict who will subscribe to the new product in real time.
When creating batch predictions, you select your banking data ML model as well as the prediction data from which you want to generate predictions. When the request is complete, your batch predictions are sent to an Amazon S3 bucket that you define. When Amazon ML makes the predictions, you will be able to more effectively strategize and execute your targeted marketing campaign.
To create batch predictions
Choose Amazon Machine Learning, Batch predictions.
Choose Create new batch prediction.
On the ML Model for batch predictions page, choose Subscription propensity model from the list.
The ML model name, ID, creation time, and the associated datasource ID appears.
To generate predictions, you need to show Amazon ML the data that you need answers to. This is called the input data.
For Locate the input data, choose My data is in S3, and I need to create a datasource.
For Datasource name, type Banking Data 2.
For S3 Location, enter the location of your banking-batch.csv.
For Does the first line in your CSV contain the column names?, choose Yes.
In the S3 permissions dialog box, choose Yes.
Amazon ML validates the location of your data.
To view the predictions
Choose Amazon Machine Learning, Batch predictions.
In list of batch predictions, choose Subscription propensity predictions. The Batch prediction info page appears.
Navigate to the Output S3 URL in your Amazon S3 console to view the batch prediction.
The prediction is stored in a compressed .gz file.
Download the file to your desktop, and uncompress and open the prediction file.
The file includes two columns: bestAnswer and score. The bestAnswer column is based on the score threshold that you set in step 4.
The following examples show a positive and negative prediction based on the score threshold.
In the positive prediction example, the value for bestAnswer is 1, and the value of score is 0.88682. The value for bestAnswer is 1 because the score value is above the score threshold of 0.77 that you saved.
The value of bestAnswer in the negative prediction example is 0 because the score value is 0.76525, which is less than the score threshold of 0.77.
You have now successfully completed the tutorial. To prevent your account from accruing additional S3 charges, you should clean up the data stored in S3 for this tutorial.
To delete the input data used for training, evaluation, and batch prediction steps
To delete the predictions generated from the batch prediction step
To learn how to use the API, see the Amazon Machine Learning API Reference.