Machine learning overview - Amazon Redshift

Machine learning overview

By using Amazon Redshift ML, you can train machine learning models using SQL statements and invoke them in SQL queries for prediction.

How machine learning can solve your problem

A machine learning model generates predictions by finding patterns in your training data and then applying these patterns to new data. In machine learning, you train these models by learning the patterns that best explain your data. Then you use the models to make predictions (also called inferences) on new data. Machine learning is typically an iterative process where you can continue to improve the accuracy of the predictions by iteratively changing parameters and improving your training data. If data changes, retraining new models with the new dataset happens.

To address various business goals, there are different fundamental machine learning approaches. Amazon Redshift supports supervised learning, which is the most common approach to advanced enterprise analytics. Supervised learning is the preferred machine learning approach when you have an established set of data and an understanding of how specific input data predicts various business outcomes. These outcomes are sometimes called labels. In particular, your dataset is a table with attributes that comprise features (inputs) and targets (outputs). For example, suppose that you have a table that provides the age and postal code for past and present customers. Suppose that you also have a field “active” that is true for present customers and false for customers who have suspended their membership. The goal of supervised machine learning is to spot the patterns of age and postal code leading to customer churn as represented by customers whose target are “False”. You can use this model to predict customers who are likely to churn, such as suspending their membership, and potentially offer retention incentives.

Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification. Regression refers to the problem of predicting continuous values, such as the total spending of customers. Binary classification refers to the problem of predicting one of two outcomes, such as predicting whether a customer churns or not. Multiclass classification refers to the problem of predicting one of many outcomes, such as predicting the item a customer might be interested. Data analysts and data scientists can use it to perform supervised learning to tackle problems ranging from forecasting, personalization, or customer churn prediction. You can also use supervised learning in problems such as prediction of which sales will close, revenue prediction, fraud detection and customer life-time value prediction.

Terms and concepts for Amazon Redshift ML

The following terms are used to describe some Amazon Redshift ML concepts:

  • Machine learning in Amazon Redshift trains a model with one SQL command. Amazon Redshift ML and Amazon SageMaker manage all the data conversions, permissions, resource usage, and discovery of the proper model.

  • Training is the phase when Amazon Redshift creates a machine learning model by running a specified subset of data into the model. Amazon Redshift automatically launches a training job in Amazon SageMaker and generates a model.

  • Prediction (also called inference) is the use of the model in Amazon Redshift SQL queries to predict outcomes. At inference time, Amazon Redshift uses a model-based prediction function as part of a larger query to produce predictions. The predictions are computed locally, at the Amazon Redshift cluster, thus providing high throughput, low latency, and zero cost.

  • With bring your own model (BYOM), you can use a model trained outside of Amazon Redshift with Amazon SageMaker for in-database inference locally in Amazon Redshift. Amazon Redshift ML supports using BYOM in local inference.

  • Local inference is used when models are pretrained in Amazon SageMaker, compiled by Amazon SageMaker Neo, and localized in Amazon Redshift ML. To import models that are supported for local inference to Amazon Redshift, use the CREATE MODEL command. Amazon Redshift imports the pretrained SageMaker models by calling Amazon SageMaker Neo. You compile the model there and import the compiled model into Amazon Redshift. Use local inference for faster speed and lower costs.

  • Remote inference is used when Amazon Redshift invokes a model endpoint deployed in SageMaker. Remote inference provides the flexibility to invoke all types of custom models and deep learning models, such as tensor flow models that you built and deployed in Amazon SageMaker.

Also important are the following:

  • Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can easily build, train and then directly deploy models into a production-ready hosted environment. For information about Amazon SageMaker, see What is Amazon SageMaker in the Amazon SageMaker Developer Guide.

  • Amazon SageMaker Autopilot is a feature set that automatically trains and tunes the best machine learning models for classification or regression, based on your data. At the same time, you maintain full control and visibility. Amazon SageMaker Autopilot supports input data in tabular format, with automatic data cleaning and preprocessing, automatic algorithm selection for linear regression, binary classification, and multiclass classification. It also supports automatic hyperparameter optimization (HPO), distributed training, automatic instance, and cluster size selection. For information about Amazon SageMaker Autopilot, see Automate model development with Amazon SageMaker Autopilot in the Amazon SageMaker Developer Guide.