DeepAR Forecasting
Amazon SageMaker DeepAR is a supervised learning algorithm for forecasting scalar time series using recurrent neural networks (RNN). Classical forecasting methods, such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ETS), fit one model to each individual time series, and then use that model to extrapolate the time series into the future. In many applications, however, you might have many similar time series across a set of crosssectional units (for example, demand for different products, load of servers, requests for web pages, and so on). In this case, it can be beneficial to train a single model jointly over all of these time series. DeepAR takes this approach, training a model for predicting a time series over a large set of (related) time series.
For the training phase, the dataset consists of one or preferably more than one time series, and an optional categorical grouping variable of which the time series is a member. The model learns entirely from these values. The DeepAR algorithm currently accepts no other external features. The model is then trained by randomly selecting time points from the provided time series and using them as training examples.
For inference, the trained model takes as input an individual time series, which might or might not have been used during training, and generates a forecast for the time series. This forecast takes into account what typically happened for similar time series in the training set.
Input/Output Interface
DeepAR supports two data channels. The train channel is used for training a model and is required. The test channel is optional. If the test channel is present, the algorithm uses it to calculate accuracy metrics for the model after training. You can provide datasets as JSON or Parquet files.
By default, the model determines the input format from the file extension (either
.json
or .parquet
). If you provide input files with
different extensions, you can specify the file type by setting the
ContentType
parameter of the Channel data type.
If you use a JSON file, it must be in the JSON Lines format, where each record contains the following fields:

"start"
whose value is a string of the formatYYYYMMDD HH:MM:SS
. 
"target"
, whose value is an array of floats (or integers) that represent the time series variable’s values. 
"cat"
(optional), whose value is an integer that encodes the categorical grouping that record’s time series is a member of. The categorical feature allows the model to learn typical behavior for that group. This can increase accuracy.
The following is an example of JSON data:
{"start":"20091101 00:00:00", "target": [4.3, 10.3, ...], "cat": 0} {"start":"20120130 00:00:00", "target": [1.0, 5.0, ...], "cat": 2} {"start":"19990130 00:00:00", "target": [2.0, 1.0], "cat": 0}
For Parquet, you use the same three fields as
columns.
In addition, "start"
can be the datetime
type.
gzip
and snappy
compression types are also supported.
For training data:

All time series must have the same time unit: minutes, hours, days, weeks, or months.

To train an accurate model, the training set should contain a sufficient number of time series (typically at least a few hundred) and should cover a representative time range. For example, one or more years when yearly seasonal patterns occur..

The training file should be shuffled. In other words, the time series should occur in a random order in the file.

If you use the categorical feature (
"cat"
), all time series must have this feature. It's required that you provide the largest value of ("cat"
), and all values between 0 and this largest value must be present in the training data.
If you specify optional test channel data, the DeepAR algorithm evaluates the trained model with different accuracy metrics. The algorithm calculates the root mean square error (RMSE) over the test data as follows:
where
y_{i,t}
is the true value of time series i at time t
and
ŷ_{i,t}
is the mean prediction. The sum is over all n time series in the
test set and over the last Τ time points for each time series, where Τ
corresponds to the forecast horizon. You specify the length of the forecast horizon
by
setting the prediction_length
hyperparameter (see DeepAR Hyperparameters).
In addition, the accuracy of the forecast distribution is evaluated using weighted quantile loss. For a quantile in the range [0, 1], the weighted quantile loss is defined as follows:
Here, q_{i,t}^{(τ)} is the τquantile of the distribution that the model predicts. Set the test_quantiles hyperparameter to specify which quantiles for which the algorithm calculates quantile loss. In addition to these, the average of the prescribed quantile losses is reported as part of the training logs. For information, see DeepAR Hyperparameters.
If you have a set of time series, a simple way to prepare training and test datasets is as follows:

Use the full dataset in the test channel.

In the training channel, remove the last
prediction_length
points from each time series.
This ensures that the model does not see the removed points during training, and then those points are used for calculating the accuracy of the model.
For inference, DeepAR accepts JSON format with an "instances"
field which
includes one or more time series in JSON Lines format, and a name of
"configuration"
, which includes parameters for generating the forecast.
For details, see DeepAR
Request and Response Formats.
DeepAR Instance Recommendations
You can train DeepAR on both GPU and CPU instances, in both single and multimachine settings. We recommend starting with a single CPU instance (for example, c4.xlarge or c4.2xlarge), and switching to GPU instances and multiple machines only when necessary. Using GPUs and multiple machines improves performance only when the model has more than 100 cells in each hidden layer and/or a minibatch size greater than 1000.
For information on the mathematics behind DeepAR, see DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.
DeepAR Common Questions
Q: Can the model handle unobserved, missing values, or nan values?
No, unobserved, missing values, and nan values are not currently supported.
Q: Do all timeseries require the same length or the same starting point?
No, the timeseries can have arbitrary starting points and arbitrary length. (Note,
however, that time series shorter than prediction_length
are ignored during
training).
Q: Is there a onetoone relation between the training set and the test set?
No, timeseries in the training set are used to train the model. After that, the trained model can be used to generate forecasts for the future of the time series used in the training set, or for other time series that were not previously included.
Q: Do I need to train one model per time series?
We recommend that more than one time series be used when training a model.
Q: Can I pass timedependent features or scalar features?
Currently, only signal categorical features are supported. In particular we do not
support timedependent covariates. However, timedependent features such as day of
month are generated internally. Because of this, it is important to set the
start
field to the right value rather than simply a dummy date.
Q: How do I use the categorical feature?
The categorical feature cat
can be used to encode a grouping. If the
timeseries belong to N
different groups, you can encode each such group by
a number (0 to N  1
). The model can then use the categorical feature to
generate better forecasts. To use this feature, the parameter cardinality
has to be set to the number of groups (e.g. N
) and the
embedding_dimension
parameter also has to be set. If either of these
two hyperparameters is not set, then the cat field in train/test time series is ignored.
The embedding dimension is typically smaller than cardinality
, for instance
log(N)
. It is important to remember that, in the training set, all
categories from 0 to N  1
must be present in the training data or an
otherwise an exception will be thrown. This is occurs because during inference, we
can
only forecast for categories which we have previously seen in training.
Q: Can I pass multiple files?
Yes. The training folder and the test folder can each contain multiple files. The
file
names can be arbitrary, but the file ending should be .json
,
.gz.json
or .parquet
. For example:
s3://mybucket/myfolder/train/data1.parquet
or
s3://mybucket/myfolder/train/data2.parquet
.
Q: Can I use this to train on a single timeseries?
Models need sufficient data in order to learn typical behavior. A single or small number of timeseries are typically not sufficient for training the neural network (unless the timeseries are very long). While a DeepAR model trained on a single timeseries will usually still generate sensible forecasts, standard forecasting methods such as ARIMA or ETS may be more accurate and stable. Where the DeepAR approach starts to outperform the standard methods is when your dataset contains hundreds of timeseries and thus can be significantly more accurate with more data.
Q: Do I need to split data into train/test set for evaluation?
It can be useful. The time series in the train
channel is used for
training the model. The timeseries in the test
channel are used for
evaluation after the model is trained. For the evaluation, the last
prediction_length
points of each timeseries in the test set are
withheld and a prediction is generated. The forecast is then compared with the actual
last prediction_length
points. Starting from a dataset of timeseries, the
simplest train / test split can be created by using the entire dataset in the test
channel (in other words, alltime series of full length). For the train channel you
can
then remove the last prediction_length
points from each timeseries so that
the model does not see these points during training. You can create more complex
evaluations by repeating timeseries multiple times in the test set, but cutting them
at
different end points, resulting in accuracy metrics that are averaged over multiple
forecasts from different time points.
Q: Can the forecast horizon be changed later?
No. The forecast horizon (prediction_length
) is fixed when a model is
trained and it cannot be changed later.
Q: Can the time series in the dataset have different frequencies?
No, all time series in the dataset have to have the same frequency (for example, hourly).
Q: Do I have to split my individual time series?
No. You should not split individual time series into pieces. Each time series should be provided as a whole unit in the dataset (see training format). It is also important to make sure the start point is accurate for each time series.
Q: What is context_length
and how should I set
it?
The context_length
corresponds to the number of data points the algorithm
gets to see before making a prediction. Typical values are of the same order of
magnitude as the forecast length. Note that the algorithm also uses a set of socalled
"lags" that take into account observations that are farther back in time. For instance,
with a daily time series where you want to predict for one week, you might set the
context_length
to 14 days. The lags are then automatically set
depending on the frequency you set. For daily data, in addition to current data, they
will take into account observations 1 month previously as well as 1 year previously.
As
a result, for the prediction, the algorithm will read in the last 14 days, 14 days
one
month ago, and 14 days one year ago. Because of this, it is important to provide the
entire time series when training and when doing inference.