Machine learning overview - Amazon Redshift

Machine learning overview

By using Amazon Redshift ML, you can train machine learning models using SQL statements and invoke them in SQL queries for prediction.

To help you learn how to use Amazon Redshift ML, you can watch the following video.

For information about the prerequisites for setting up your Redshift cluster, permissions, and ownership for using Amazon Redshift ML, read the following sections. These sections also describe how simple training and predictions work in Amazon Redshift ML.

How machine learning can solve a problem

A machine learning model generates predictions by finding patterns in your training data and then applying these patterns to new data. In machine learning, you train these models by learning the patterns that best explain your data. Then you use the models to make predictions (also called inferences) on new data. Machine learning is typically an iterative process where you can continue to improve the accuracy of the predictions by changing parameters and improving your training data. If data changes, retraining new models with the new dataset happens.

To address various business goals, there are different fundamental machine learning approaches.

Supervised learning in Amazon Redshift ML

Amazon Redshift supports supervised learning, which is the most common approach to advanced enterprise analytics. Supervised learning is the preferred machine learning approach when you have an established set of data and an understanding of how specific input data predicts various business outcomes. These outcomes are sometimes called labels. In particular, your dataset is a table with attributes that comprise features (inputs) and targets (outputs). For example, suppose that you have a table that provides the age and postal code for past and present customers. Suppose that you also have a field “active” that is true for present customers and false for customers who have suspended their membership. The goal of supervised machine learning is to spot the patterns of age and postal code leading to customer churn, as represented by customers whose targets are “False.” You can use this model to predict customers who are likely to churn, such as suspending their membership, and potentially offer retention incentives.

Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification. Regression refers to the problem of predicting continuous values, such as the total spending of customers. Binary classification refers to the problem of predicting one of two outcomes, such as predicting whether a customer churns or not. Multiclass classification refers to the problem of predicting one of many outcomes, such as predicting the item a customer might be interested. Data analysts and data scientists can use it to perform supervised learning to tackle problems ranging from forecasting, personalization, or customer churn prediction. You can also use supervised learning in problems such as prediction of which sales will close, revenue prediction, fraud detection, and customer life-time value prediction.

Unsupervised learning in Amazon Redshift ML

Unsupervised learning uses machine learning algorithms to analyze and group unlabeled training data. The algorithms discover hidden patterns or groupings. The goal is to model the underlying structure or distribution in the data to learn more about the data.

Amazon Redshift supports the K-Means clustering algorithm to solve an unsupervised learning problem. This algorithm solves clustering problems where you want to discover groupings in the data. The K-Means algorithm attempts to find discrete groupings within the data. Unclassified data is grouped and partitioned based on its similarities and differences. By grouping, the K-Means algorithm iteratively determines the best centroids and assigns each member to the closest centroid. Members nearest the same centroid belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. For example, the K-Means clustering algorithm can be used to classify cities impacted by a pandemic or classify cities based on the popularity of consumer products.

When using the K-Means algorithm, you specify an input k that specifies the number of clusters to find in the data. The output of this algorithm is a set of k centroids. Each data point belongs to one of the k clusters that is closest to it. Each cluster is described by its centroid. The centroid can be thought of as the multi-dimensional average of the cluster. The K-Means algorithm compares the distances to see how different the clusters are from each other. A larger distance generally indicates a greater difference between the clusters.

Preprocessing the data is important for K-Means, as it ensures that the features of the model stay on the same scale and produce reliable results. Amazon Redshift supports some K-Means preprocessors for the CREATE MODEL statement, such as StandardScaler, MinMax, and NumericPassthrough. If you don't want to apply any preprocessing for K-means, choose NumericPassthrough explicitly as a transformer. For more information about K-Means parameters, see CREATE MODEL with K-MEANS parameters.

To help you learn how to perform unsupervised training with K-Means clustering, you can watch the following video.

Terms and concepts for Amazon Redshift ML

The following terms are used to describe some Amazon Redshift ML concepts:

  • Machine learning in Amazon Redshift trains a model with one SQL command. Amazon Redshift ML and Amazon SageMaker manage all the data conversions, permissions, resource usage, and discovery of the proper model.

  • Training is the phase when Amazon Redshift creates a machine learning model by running a specified subset of data into the model. Amazon Redshift automatically launches a training job in Amazon SageMaker and generates a model.

  • Prediction (also called inference) is the use of the model in Amazon Redshift SQL queries to predict outcomes. At inference time, Amazon Redshift uses a model-based prediction function as part of a larger query to produce predictions. The predictions are computed locally, at the Redshift cluster, thus providing high throughput, low latency, and zero additional cost.

  • With bring your own model (BYOM), you can use a model trained outside of Amazon Redshift with Amazon SageMaker for in-database inference locally in Amazon Redshift. Amazon Redshift ML supports using BYOM in local inference.

  • Local inference is used when models are pretrained in Amazon SageMaker, compiled by Amazon SageMaker Neo, and localized in Amazon Redshift ML. To import models that are supported for local inference to Amazon Redshift, use the CREATE MODEL command. Amazon Redshift imports the pretrained SageMaker models by calling Amazon SageMaker Neo. You compile the model there and import the compiled model into Amazon Redshift. Use local inference for faster speed and lower costs.

  • Remote inference is used when Amazon Redshift invokes a model endpoint deployed in SageMaker. Remote inference provides the flexibility to invoke all types of custom models and deep learning models, such as TensorFlow models that you built and deployed in Amazon SageMaker.

Also important are the following:

  • Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can easily build, train, and directly deploy models into a production-ready hosted environment. For information about Amazon SageMaker, see What is Amazon SageMaker in the Amazon SageMaker Developer Guide.

  • Amazon SageMaker Autopilot is a feature set that automatically trains and tunes the best machine learning models for classification or regression, based on your data. You maintain full control and visibility. Amazon SageMaker Autopilot supports input data in tabular format. Amazon SageMaker Autopilot provides automatic data cleaning and preprocessing, automatic algorithm selection for linear regression, binary classification, and multiclass classification. It also supports automatic hyperparameter optimization (HPO), distributed training, automatic instance, and cluster size selection. For information about Amazon SageMaker Autopilot, see Automate model development with Amazon SageMaker Autopilot in the Amazon SageMaker Developer Guide.