This is prerelease documentation for a service in preview release. It is subject to change.

The DeepAR+ Recipe

Amazon Forecast DeepAR+ is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNNs). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series, and then use that model to extrapolate the time series into the future. In many applications, however, you have many similar time series across a set of cross-sectional units. These time-series groupings demand different products, server loads, and requests for webpages. In this case, it can be beneficial to train a single model jointly over all of the time series. DeepAR+ takes this approach. When your dataset contains hundreds of related time series, the DeepAR+ recipe outperforms the standard ARIMA and ETS methods. You can also use the trained model for generating forecasts for new time series that are similar to the ones it has been trained on.

How DeepAR+ Works

During training, DeepAR+ uses a training dataset and an optional testing dataset. It uses the testing dataset to evaluate the trained model. In general, the training and testing datasets don't have to contain the same set of time series. You can use a model trained on a given training set to generate forecasts for the future of the time series in the training set, and for other time series. Both the training and the testing datasets consist of (preferably more than one) target time series. Optionally, they can be associated with a vector of feature time series and a vector of categorical features (for details, see DeepAR Input/Output Interface in the Amazon SageMaker Developer Guide). The following example shows how this works for an element of a training dataset indexed by i. The training dataset consists of a target time series, zi,t, and two associated feature time series, xi,1,t and xi,2,t.


                Image: DeepAR+ time-series data.

The target time series might contain missing values (denoted in the graphs by breaks in the time series). DeepAR+ supports only feature time series that are known in the future. This allows you to run counterfactual "what-if" scenarios. For example, "What happens if I change the price of a product in some way?"

Each target time series can also be associated with a number of categorical features. You can use these to encode that a time series belongs to certain groupings. Using categorical features allows the model to learn typical behavior for those groupings, which can increase accuracy. A model implements this by learning an embedding vector for each group that captures the common properties of all time series in the group.

To facilitate learning time-dependent patterns, such as spikes during weekends, DeepAR+ automatically creates feature time series based on time-series granularity. For example, DeepAR+ creates two feature time series (day of the month and day of the year) at a weekly time-series frequency. It uses these derived feature time series along with the custom feature time series that you provide during training and inference. The following example shows two derived time-series features: ui,1,t represents the hour of the day, and ui,2,t the day of the week.


                Image: DeepAR+
                    under
                    the hood, derived.

DeepAR+ automatically includes these feature time series based on the data frequency and the size of training data. The following table lists the features that can be derived for each supported basic time frequency.

Frequency of the Time Series Derived Features
Minute minute-of-hour, hour-of-day, day-of-week, day-of-month, day-of-year
Hour hour-of-day, day-of-week, day-of-month, day-of-year
Day day-of-week, day-of-month, day-of-year
Week day-of-month, week-of-year
Month month-of-year

A DeepAR+ model is trained by randomly sampling several training examples from each of the time series in the training dataset. Each training example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. The context_length hyperparameter controls how far in the past the network can see, and the prediction_length parameter controls how far in the future predictions can be made. During training, Amazon Forecast ignores elements in the training dataset with time series shorter than the specified prediction length. The following example shows five samples, with a context length of 12 hours and a prediction length of 6 hours, drawn from element i. For the sake of brevity, we've excluded the feature time series x_{i,1,t} and u_{i,2,t}.


                Image:
                    DeepAR+
                    under the hood, sampled.

To capture seasonality patterns, DeepAR+ also automatically feeds lagged values from the target time series. In our example with samples taken at an hourly frequency, for each time index t = T, the model exposes the zi,t values, which occurred approximately one, two, and three days in the past.


                Image: DeepAR+
                    under
                    the hood, lags.

For inference, the trained model takes as input the target time series, which might or might not have been used during training, and forecasts a probability distribution for the next prediction_length values. Because DeepAR+ is trained on the entire dataset, the forecast takes into account learned patterns from similar time series.

For information on the mathematics behind DeepAR+, see DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks on the Cornell University Library website.

Exclusive Features of Amazon Forecast DeepAR+

The Amazon Forecast DeepAR+ algorithm improves upon the Amazon SageMaker DeepAR algorithm with the following new features:

  • Learning rate scheduling

    During a single training run, DeepAR+ can reduce its learning rate. This often reduces loss and forecasting error.

  • Model averaging

    When you use multiple models for training with the DeepAR+ algorithm, Amazon Forecast averages the training runs. This can reduce forecasting error and dramatically increase model stability. Your DeepAR+ model is more likely to provide robust results every time you train it.

  • Weighted sampling

    When you use a very large training dataset, DeepAR+ applies streaming sampling to ensure convergence despite the size of the training dataset. A DeepAR+ model can be trained with millions of time series in a matter of hours.

For information on how to use these features, see DeepAR+ Hyperparameters.

DeepAR+ Hyperparameters

Parameter Name Description
time_freq

The granularity of the time series in the dataset. Use time_freq to choose appropriate date features and lags. The model supports the following basic frequencies. It also supports multiples. For example, 5min corresponds to 5 minutes.

Basic frequencies:

  • - M: monthly

  • -W: weekly

  • -D: daily

  • - H: hourly

  • - min: every minute

Required

Yes

Valid values

An integer followed by M, W, D, H, or min. For example, 5min.

Default value

N/A

prediction_length

The number of time-steps that the model is trained to predict, also called the forecast horizon. The trained model always generates forecasts with this length.

The prediction_length is fixed when a model is trained. It cannot be changed later.

Required

Yes

Valid values

Positive integer

Default value

N/A

context_length

The number of time points that the model reads in before making the prediction. The value for this parameter should be about the same as the prediction_length. The model also receives lagged inputs from the target, so context_length can be much smaller than typical seasonalities. For example, a daily time series can have yearly seasonality. The model automatically includes a lag of one year, so the context length can be shorter than a year. The lag values that the model picks depend on the frequency of the time series. For example, lag values for daily frequency are: previous week, 2 weeks, 3 weeks, 4 weeks, and year.

Required

Yes

Valid values

Positive integer

Default value

N/A

likelihood

The model generates a probabilistic forecast, and can provide quantiles of the distribution and return samples. Depending on your data, choose an appropriate likelihood (noise model) that is used for uncertainty estimates. Valid values:

  • gaussian: Use for real-valued data.

  • beta: Use for real-valued targets between 0 and 1, inclusively.

  • student-T: Use this alternative for real-valued data for bursty data.

  • negative-binomial: Use for count data (non-negative integers).

  • deterministic-L1: A loss function that does not estimate uncertainty and only learns a point forecast.

Required

No

Valid values

One of gaussian, beta, negative-binomial, student-T, or deterministic-L1

Default value

student-T

epochs

The number of passes over the training data. The optimal value depends on the size of your data and the learning rate. See also early_stopping_patience. Typical values range from 10 to 1000.

Required

Yes

Valid values

Positive integer

Default value

N/A

num_batches_per_epoch

The number of batches used per epoch. If you don't define this parameter, one epoch corresponds to a pass over the whole dataset. We recommend using this parameter when the training set contains a very large number of time series. Typical values range from 100 to 1000.

Required

Yes

Valid values

Positive integer

Default value

N/A

num_dynamic_feat

The number of dynamic features provided in the data. If two dynamic features are provided, set this to 2. To use the dynamic_feat field if it is present in the data and extract the number from the dataset, set this to auto. To disable the num_dynamic_feat field, even if it is present in the data, set it to an empty string.

Required

No

Valid values

Positive integer, auto or an empty string.

Default value

auto

cardinality

Applies only when using the categorical features (cat). cardinality is a list that specifies the number of categories (groups) per categorical feature. To use the cat field if it is present in the data and extract the cardinalities automatically from the dataset, set this value to auto. To disable use of the cat field even if it is present in the data, set it to an empty string. .

Required

No

Valid values

A list of positive integers, auto or an empty string

Default value

auto

embedding_dimension

The size of the embedding vector learned per categorical feature (the algorithm uses the same value for all categorical features). A DeepAR+ model can learn group-level time-series patterns when a categorical grouping feature is provided. The model learns an embedding vector of size embedding_dimension for each group, capturing the common properties of all time series in the group. A larger embedding_dimension allows the model to capture more complex patterns. However, because increasing the embedding_dimension value increases the number of parameters in the model, more training data is required to accurately learn these parameters. Typical values are from 10 to 100.

Required

No

Valid values

Positive integer

Default value

10

num_cells

The number of cells to use in each hidden layer of the RNN. Typical values range from 30 to 100.

Required

No

Valid values

Positive integer

Default value

40

num_layers

The number of hidden layers in the RNN. Typical values range from 1 to 4.

Required

No

Valid values

Positive integer

Default value

2

mini_batch_size

The size of mini-batches used during training. Typical values range from 32 to 512.

Required

No

Valid values

Positive integer

Default value

128

learning_rate

The learning rate used in training. Typical values range from 0.0001 to 0.1.

Required

No

Valid values

Floating-point integer

Default value

0.001

early_stopping_patience

If this parameter is set, training stops when no progress is made within the specified number of epochs, and the learning rate cannot be reduced (see learning_rate_decay and max_learning_rate_decays). The model that has the lowest loss is returned as the final model.

Required

No

Valid values

Integer

Default value

10

learning_rate_decay

During training, the learning rate is reduced by this factor every time training metrics don't improve for early_stopping_patience epochs. At most, the learning rate is reduced max_learning_rate_decays times, then training stops.

Required

No

Valid values

Floating-point integer

Default value

0.5

max_learning_rate_decays

The maximum number of learning rate reductions that should occur.

Required

No

Valid values

Integer

Default value

5

num_averaged_models

In DeepAR+, a training trajectory can encounter multiple models. Each model might have different forecasting strengths and weaknesses. DeepAR+ can average the model behaviors to take advantage of the strengths of all models. When this hyperparameter is set, DeepAR+ averages the best num_averaged_models encountered during training. Typical values range from 5 to 10. The num_averaged_models hyperparameter is activated when early_stopping_patience and num_averaged_models are set.

Required

Yes

Valid values

Positive integer

Default value

N/A

dropout_rate

The dropout rate to use during training. The model uses zoneout regularization. For each iteration, a random subset of hidden neurons is not updated. Typical values are less than 0.2.

Required

No

Valid values

Floating-point integer

Default value

0.1

test_quantiles

Quantiles for which to calculate quantile loss on the test channel.

Required

No

Valid values

Array of floating-point integers

Default value

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

num_eval_samples

The number of samples per time series used to calculate metrics on test accuracy. This hyperparameter doesn't have any influence on training or on the final model (specifically, the model can be queried with a different number of samples). This hyperparameter affects only the reported accuracy scores on the test channel after training. Smaller values result in faster evaluation, but the evaluation scores are worse and less certain. When evaluating with higher quantiles, for example, 0.95, consider increasing the number of evaluation samples.

Required

No

Valid values

Integer

Default value

100

Tune DeepAR+ Models

To tune Amazon Forecast DeepAR+ models, follow these recommendations for optimizing the training process and hardware configuration.

Best Practices for Process Optimization

To achieve the best results, follow these recommendations:

  • Except when splitting the training and testing datasets, always provide entire time series for training and testing, and when calling the model for inference. Regardless of how you set context_length, don't divide the time series or provide only a part of it. The model will use data points further back than context_length for the lagged values feature.

  • For model tuning, you can split the dataset into training and testing datasets. In a typical evaluation scenario, you should test the model on the same time series used in training, but on the future prediction_length time points immediately after the last time point visible during training. To create training and testing datasets that satisfy these criteria, use the entire dataset (all of the time series) as a testing dataset and remove the last prediction_length points from each time series for training. This way, during training, the model doesn't see the target values for time points on which it is evaluated during testing. In the test phase, the last prediction_length points of each time series in the testing dataset are withheld and a prediction is generated. The forecast is then compared with the actual values for the last prediction_length points. You can create more complex evaluations by repeating time series multiple times in the testing dataset, but cutting them off at different end points. This produces accuracy metrics that are averaged over multiple forecasts from different time points.

  • Avoid using very large values (> 400) for the prediction_length because this slows down the model slow and makes it less accurate. If you want to forecast further into the future, consider aggregating to a higher frequency. For example, use 5min instead of 1min.

  • Because of lags, the model can look further back than context_length. Therefore, you don't have to set this parameter to a large value. A good starting point for this parameter is the same value as the prediction_length.

  • Train DeepAR+ models with as many time series as are available. Although a DeepAR+ model trained on a single time series might already work well, standard forecasting methods such as ARIMA or ETS might be more accurate and are more tailored to this use case. DeepAR+ starts to outperform the standard methods when your dataset contains hundreds of related time series.