Automated deployment - Improving Forecast Accuracy with Machine Learning

Automated deployment

Before you launch the solution, review the architecture and configuration in this guide. Follow the step-by-step instructions in this section to configure and deploy the solution into your account.

Time to deploy: Approximately five minutes

Prerequisites

Update the stack

If you have previously deployed the solution, use this procedure to update the Improving Forecast Accuracy with Machine Learning CloudFormation stack to get the latest version of the solution.

  1. Sign in to the AWS CloudFormation Console, select the existing Improving Forecast Accuracy with Machine Learning CloudFormation stack, and select Update.

  2. Select Replace current template.

  3. Under Specify template:

    1. Select Amazon S3 URL.

    2. Copy the link of the latest template for the stack.

    3. Verify that the correct template URL shows in the Amazon S3 URL text box, and choose Next. Choose Next again.

  4. Under Parameters, review the parameters for the template and modify them as necessary. Refer to Step 2. Launch the Stack for details about the parameters.

  5. Choose Next.

  6. On the Configure stack options page, choose Next.

  7. On the Review page, review and confirm the settings. Check the box acknowledging that the template might create AWS Identity and Access Management (IAM) resources.

  8. Choose View change set and verify the changes.

  9. Choose Update stack to deploy the stack.

You can view the status of the stack in the AWS CloudFormation console in the Status column. You should see a status of UPDATE_COMPLETE in approximately 10 minutes depending on the options chosen.

Deployment overview

Use the following steps to deploy this solution on AWS. For detailed instructions, follow the links for each step.

Step 1. Launch the stack

  • Launch the main AWS CloudFormation template into your AWS account. Optionally, you can also continue to use the demo template if you launched it in Step 1.

  • Review the template parameters, and adjust if necessary.

Step 2. Create and upload your Forecast configuration file

  • Review the Forecast configuration file format.

  • Modify the configuration file’s default parameters for dataset group, datasets, predictor, and forecast.

Step 3. Experiment with Amazon Forecast

  • Generate and evaluate multiple predictors using different configurations.

Step 4. Visualize forecast output

  • Set up visualizations automatically using an Amazon QuickSight dashboard or a Jupyter Notebook instance.

Step 1. Launch the stack

This automated AWS CloudFormation template deploys the Improving Forecast Accuracy with Machine Learning solution in the AWS Cloud.

Note

You are responsible for the cost of the AWS services used while running this solution. For more details, view the Cost section in this guide, and refer to the pricing webpage for each AWS service used in this solution.

  1. Sign in to the AWS Management Console and select the button below to launch the improving-forecast-accuracy-with-machine-learning.template AWS CloudFormation template.

    
                Improving Forecast Accuracy with Machine Learning main stack launch button

    Alternatively, you can download the template as a starting point for your own implementation.

  2. The template launches in the US East (N. Virginia) Region by default. To launch the solution in a different AWS Region, use the Region selector in the console navigation bar.

Note

This solution uses the Amazon Forecast service, and when specified, Amazon SageMaker, which are not currently available in all AWS Regions. You must launch this solution in an AWS Region where Amazon Forecast is available. For the most current availability by Region, refer to the AWS Service Region Table.

  1. On the Create stack page, verify that the correct template URL is in the Amazon S3 URL text box and choose Next.

  2. On the Specify stack details page, assign a name to your solution stack. The stack name must be 20 characters or less – for example, “forecast-stack”.

  3. Under Parameters, review the parameters for this solution template and modify them as necessary. This solution uses the following default values.

Parameter Default Description
Email <Optional input> The email that receives status notifications from the AWS Step Functions state machine. If this parameter is blank, you are not notified with your forecast results. You must accept the SNS subscription (through the email link that is sent after stack deployment) to activate notifications.
Deploy QuickSight Dashboards <Optional input> To deploy Amazon QuickSight analyses automatically, enable Amazon QuickSight Enterprise Edition in your account and region, then set this parameter to the ARN of the desired default owner of the Amazon QuickSight analyses that will be created. For example, arn:aws:quicksight:<region>:<account_id>:user/default/Admin/<username>.

To find the ARN, use the following AWS CLI command:

aws quicksight list-users --region us-east-1 --aws-account-id <your_account_id> --namespace default
Deploy Jupyter Notebook No The Amazon SageMaker Jupyter Notebook Instance is not deployed by default.
Jupyter Notebook instance type ml.t2.medium The Amazon SageMaker Jupyter Notebook Instance type. The default configuration might not be supported in all Regions.
Jupyter Notebook volume size 10 The Amazon SageMaker Notebook Instance volume size. This must be an integer between 5 GB and 16384 GB (16 TB). Note that a value must be provided, even if the notebook is not deployed.
KMS Key ARN used to encrypt Datasets and Predictors managed by Amazon Forecast <Optional input>

While Amazon Forecast will encrypt your data by default, you can monitor and restrict access to your data by specifying a managed KMS key in your account that can be used by the Amazon Forecast service. Specifying an AWS KMS key ARN in this parameter allows Amazon Forecast to use that key to protect your data.

Revoking, disabling, or modifying key policy associated with this key may render your data unusable by the Amazon Forecast service.

To use your own key, specify the full key ARN, for example, arn:aws:kms:<region>:<account_id>:key/f8fed2cd-14ab-4ac4-a8a3-57975cbff81b.

Leave this parameter blank to have Amazon Forecast use its default encryption configuration for your data.

Demo/Forecast Deployment No Download and deploy the default NYC Taxi forecasting demonstration assets with the stack, or your own datasets and forecast configurations.
Forecast Name nyctaxi_weather_auto The name of the forecast dataset group to create in the Amazon Forecast console.
Target Time Series URL NYC taxi data URL The URL for the target time series to use. This field is optional.
Related Time Series URL NYC taxi data URL The URL for the related time series to use. This field is optional.
Metadata URL NYC taxi metadata URL The URL for the item metadata dataset to use. This field is optional.
Forecast default configuration file URL NYC taxi forecast defaults URL The URL to the forecast-defaults.yaml file used to configure all forecasts generated by the stack.
CloudWatch Log Level WARNING Change the verbosity of the logs output to CloudWatch Logs.
Note

The Amazon SageMaker Jupyter Notebook Instances deployed by this solution allow direct internet access and are internet-enabled. Though not recommended for production use, these notebooks are useful for development testing and collaboration.

For a more secure production configuration, it is recommended that Amazon SageMaker Jupyter Notebook Instances are deployed using the security best practices outlined in the Connect a Notebook Instance to Resources in a VPC topic in the Amazon SageMaker Developer Guide.

  1. Choose Next.

  2. On the Configure stack options page, choose Next.

  3. On the Review page, review and confirm the settings. Check the box acknowledging that the template will create AWS Identity and Access Management (IAM) resources.

  4. Choose Create stack to deploy the stack.

    You can view the status of the stack in the AWS CloudFormation console in the Stack info column. You should receive a CREATE_COMPLETE status in approximately five minutes.

Note

In addition to the primary AWS Lambda functions used by the AWS Step Functions workflow deployed by this solution, this solution includes the SolutionMetrics, RedeployLamdasFunction, BucketNameFunction, and UniqueNameFuction Lambda functions, which run only during initial configuration or when resources are updated or deleted.

When you run this solution, you will notice multiple Lambda functions in the AWS Management Console. Only the AWS Lambda functions used by the AWS Step Functions workflow are regularly active. However, you must not delete these, because it is necessary to manage associated resources.

Step 2. Create and upload your forecast configuration file

The forecast configuration file (forecast-defaults.yaml) must be present at the root of the S3 forecast bucket prior to triggering the AWS Step Functions state machine by uploading your data. This is a manual task that must be performed prior to generating forecasts. Follow the procedure below to generate the forecast configuration file.

In the output of your deployed CloudFormation stack from Step 1, take note of the ForecastBucketName value under the Outputs tab. This is the bucket where the configuration file must be uploaded. You can use the AWS Management Console to copy your forecast configuration file to the root of your ForecastBucket bucket.

The following is a minimal configuration file that you can use to configure the AWS Step Functions state machine:

Default: DatasetGroup: Domain: RETAIL Datasets: - Domain: RETAIL DatasetType: TARGET_TIME_SERIES DataFrequency: D TimestampFormat: yyyy-MM-dd Schema: Attributes: - AttributeName: item_id AttributeType: string - AttributeName: timestamp AttributeType: timestamp - AttributeName: demand AttributeType: float AutoPredictor: MaxAge: 604800 ForecastHorizon: 72 ForecastFrequency: D Forecast: ForecastTypes: - "0.10" - "0.50" - "0.90"

The following sections must be present in your configuration file:

  • Default – All files uploaded use the forecast defaults unless an override is provided. For more information about overriding defaults, refer to Overriding forecast defaults.

    • Default.DatasetGroup – Under the default key, you must specify your dataset group configuration. The only supported configuration key under DatasetGroup is Domain, which represents the supported dataset domain. This can be RETAIL, CUSTOM, INVENTORY_PLANNING, EC2_CAPACITY, WORK_FORCE, WEB_TRAFFIC, or METRICS.

  • Default.Datasets – Under the default key, you must either specify your dataset configurations, or reference datasets from another top-level configuration item. In order to specify your dataset configurations, this must be configured as an array, and each array element must contain the following:

    • Domain (matching the domain of the Default.DatasetGroup).

    • DatasetType (TARGET_TIME_SERIES, RELATED_TIME_SERIES or _ITEM_METADATA).

    • DataFrequency (Y, M, W, D, 30min, 15min or 1min — only required for target time series and related time series dataset types).

    • TimestampFormat (this must be yyyy-MM-dd if the DataFrequency is Y, M, W or D, and must be yyyy-MM-dd HH:mm:ss if your DataFrequency is 30min, 15min or 1min — only required for target time series and related time series dataset types).

    • Schema (a schema attribute is required for every field in a dataset. The schema object contains an attributes object, which describes an array of SchemaAttribute objects (containing AttributeName and AttributeValue).

      To reference datasets from another top-level configuration item, that item must be configured with datasets (as above, containing Domain, DatasetType, DataFrequency and TimestampFormat), then this item contains the following key:

    • From (other dataset group name, for example, Default).

      If you want to use AWS managed weather data ingestion to improve your forecast accuracy you can also specify GeoLocationFormat (as LAT_LONG) and TimeZone for your TARGET_TIME_SERIES dataset, while ensuring that the schema has a geolocation dimension (with AttributeType geolocation) and values in LAT_LONG format (such as 45.4236_75.7009) for each item. If this is enabled, you must also ensure that your predictor configuration specifies the geolocation attribute as a forecast dimension. If you have a related time series dataset, it must also include the geolocation dimension as well. For more information on this particular configuration, refer to Demo stack configuration.

  • Default.AutoPredictor OR Default.Predictor – Under the default key, you must specify your auto predictor or predictor configuration. If both are provided, the AutoPredictor configuration will be used. This can include all supported properties of the Amazon Forecast service for the CreatePredictor API call, and can also contain:

    • MaxAge (number of seconds between consecutive dataset import jobs where a new file upload should not generate or update a predictor). This defaults to 604800 seconds (one week), if not otherwise specified.

  • Default.Forecast – Under the forecast key, you must specify your forecast configuration. This can include all supported properties of the Amazon Forecast service for the CreateForecast API call, commonly "ForecastTypes".

All resources can contain Tags. To set default tags you can specify Default.Tags, and they will be inherited by all resources deployed by the solution. If you wish to override these defaults, they can be specified at a lower-level resource (for example, Default.AutoPredictor.Tags or YourDatasetGroup.Tags or YourDatasetGroup.AutoPredictor.Tags). To remove a default tag on a lower-level resource, include the tag Key and mark its state as Absent.

Step 3. Experiment with Amazon Forecast

The Improving Forecast Accuracy with Machine Learning solution makes it straightforward to generate and evaluate multiple predictors using different configurations by tuning a forecast configuration file and uploading the same dataset under different names. For example, during testing, you can evaluate the following different scenarios:

  • Scenario A: Try AutoML to automatically determine which algorithm generates the most accurate forecast for your time-series data.

  • Scenario B: Based on the AutoML results, select the winning algorithm and tune its hyperparameters based on your domain specific knowledge, or using automated hyperparameter Optimization (HPO).

  • Scenario C: Based on the AutoML results, select the top-performing model and experiment with different predictor featurization configurations concurrently.

  • Scenario D: Generate an AutoPredictor, and compare its accuracy metrics to that of the predictors generated in the previous scenarios.

Refer to Forecast experimentation for a detailed walkthrough of automating Amazon Forecast experiments with the Improving Forecast Accuracy with Machine Learning solution. To improve forecast accuracy, ensure that you reference and follow the best practices listed in the Time Series Forecasting Principles with Amazon Forecast AWS technical guide.

Step 4. Visualize forecast output

After this solution successfully deploys, you can configure your data sets and visualizations using any visualization service. This implementation guide provides procedures for setting up visualizations using either an Amazon Quicksight analysis or a Jupyter Notebook instance.

Visualization with Amazon QuickSight

An Amazon QuickSight analysis can be created on a per-forecast basis to assist users with forecast output visualization across hierarchies and categories of forecasted items.

Consider the following scenarios:

  • You want to quickly visualize any probabilistic forecast for any item or aggregate set of items.

  • You want to look up and aggregate forecasted items across any of your forecast dimensions. You want to view forecasts for all your sales on a per-store, per-region, or per-geography basis.

  • You want insight into the performance of certain types of items (for example, brands), or to compare the sales of two different brands, or to identify high and low performing items.

Amazon QuickSight analyses can combine forecast input and output visualizations that present data to address these scenarios, as well as other useful analyses. You can create each of the above visualizations in Amazon QuickSight by dragging and dropping data into the solution’s bundled QuickSight analysis.

As forecasts are generated, the raw dataset files, predictor backtest data, and forecast outputs are combined to create a combined forecast output file containing both the forecast input and forecast horizon. These are output to Amazon S3 in the forecast data bucket in a format that can be ingested by Amazon QuickSight via Amazon Athena. In order to offer generic dashboarding capability, column names are mapped to generic names that can be dragged to Amazon QuickSight charts. This Amazon Athena table can also be consumed by third-party visualization tools.

This solution provides the following mappings:

Dataset Domain Fields Mapping
RETAIL

item_id (string)

timestamp (date)

demand (float or integer)

Identifier (string)

timestamp (date)

metric (float)

CUSTOM

item_id (string)

timestamp (date)

target_value (float or integer)

identifier (string)

timestamp (date)

metric (float)

INVENTORY_PLANNING

item_id (string)

timestamp (date)

demand (integer)

identifier (string)

timestamp (date)

metric (float)

EC2_CAPACITY

instance_type (string)

timestamp (date)

number_of_instances (integer)

identifier (string)

timestamp (date)

metric (float)

WORK_FORCE

workforce_type (string)

timestamp (date)

workforce_demand (float)

identifier (string)

timestamp (date)

metric (float)

WEB_TRAFFIC

item_id (string)

timestamp (string)

value (float)

identifier (string)

timestamp (date)

metric (float)

METRICS

metric_name (string)

timestamp (date)

metric_value (float)

identifier (string)

timestamp (date)

metric (float)

In addition to these mappings, arbitrary forecast dimensions and metadata can be added to augment the analysis:

  • Up to 10 forecast dimensions (mapped as forecast_dimension_1 … forecast_dimension_10)

  • Up to 10 forecast metadata attributes (mapped as forecast_metadata_1 … forecast_metadata_10)

Predictor backtest export data is also added to the analysis, including:

  • Predictor backtest quantiles (backtest_p1 … backtest_p99), however only the calculated quantiles will have values, the rest will be null

  • Predictor backtest accuracy metrics values (wQL, RMSE and WAPE) are not currently added to the export data

Forecast output through the forecast horizon is also added to the analysis, including:

  • Forecast quantiles (p1 … p99). However, only the calculated quantiles will have values, the rest will be null.

Note

The Amazon Athena table representing the combined forecast output is limited in size to the maximum cumulative size of all files in your Amazon S3 bucket (currently 30GB for Amazon Forecast datasets). The solution automatically partitions data on the month_starting string value.

This partitioning strategy effectively limits the amount of data per partition to between approximately 720 records and 43,200 records, which reduces the amount of data that needs to be queried when using direct queries.

To use the partition, filter your queries using a WHERE clause on month_starting.

Further, we format the table as Parquet to minimize the cost of querying the data when using direct queries from Athena. Performance can also be improved by importing the data directly into SPICE, which avoids performing direct queries.

After the Amazon Athena table representing the combined forecast output file has been created, a QuickSight analysis is cloned into your account matching the name of the forecast export. This analysis can be customized using drag-and-drop controls to suit your needs, then published and shared as a dashboard within your organization.

Use the following procedure to customize your visualization.

Example 1 Instructions
I want to be able to quickly visualize any probabilistic forecast for any identifier or set of items as well as some historical demand.
  • Drag the desired probabilistic forecasts to the Forecast Horizon line chart.

  • Items can be filtered out based on the Amazon QuickSight controls for dimensions and metadata – for instance, restricting to only items sold matching a specific dimension (e.g. location) .

  • In the diagram below, note the field wells for the Forecast Horizon line chart contain the three quantiles p10, p50 and p90. Select the output quantiles of your forecast to visualize in this graph.

  • To restrict the date range, adjust the Date Filter control.


        
          Example 1

Figure 4: Example forecast to quickly visualize any probabilistic forecast for any identifier or set of items as well as some historical demand.

Example 2 Instructions
I want to be able to look up and aggregate items across my forecast dimensions. Specifically, I want to view forecasts for all items on a per-dimension basis.
  • To filter the data, use the multiselect dropdown menus, in this case, we are filtering based on Forecast Dimension 1, which corresponds in this forecast to store location. We have selected the Kanata location to display only data for that location.


        
          Example 2

Figure 5: Example forecast to look up and aggregate items across forecast dimensions.

Example 3 Instructions
I want better insight into the performance of certain items matching different metadata attributes (for example, different brands).
  • Consuming Amazon Forecast fields in QuickSight can allow us to further filter based on metadata attributes. Some commonly used metadata might be to filter based on category, brand, or genre.

  • To filter the displayed data, use the controls to filter based on the metadata fields provided to the visualization. In the following diagram, data is further filtered to only include sales of items matching of ‘brand x’


        
          Example 3

Figure 6: Example forecast for better insight into the performance of certain items matching different metadata attributes.

Example 4 Instructions
I want to be able to visualize predictor accuracy on the training data and combine it with the forecast data to assess if my forecast is likely to be accurate.
  • Drag the desired backtest output to the Historical Data line chart.

  • Items can be filtered out based on the Amazon QuickSight controls for dimensions and metadata – for instance, restricting to only items sold matching a specific dimension (e.g. location).

  • In the diagram below, note the field wells for the Historical Data line chart contain the input data and forecast backtest exports for three quantiles -p10, p50 and p90. Select the output quantiles of your forecast to visualize in this graph.

    To restrict the date range, adjust the Date Filter control.


        
          Example 4

Figure 7: Example forecast to visualize predictor accuracy on the training data and combine it with the forecast data.

Visualization with a Jupyter Notebook

The optional Jupyter Notebook included with the solution can be used to visualize forecast output and related time series to help you select features for your related timeseries datasets.

The graph in Figure 8 displays item demand for a synthetic dataset alongside the source code for this solution. Two predictors have been generated, one using Prophet and the other with DeepAR+. Under the demand and forecast, a related time series (item price) is displayed.


        
          Item demand vs price

Figure 8: Item demand vs price

Comparing the results show similarities between the forecast generated using the Prophet algorithm and the DeepAR+ algorithm. These results show that p1, p50, and p99 forecasts were generated with Prophet, and p10, p50, and p90 forecasts were generated with DeepAR+. The similarities in output (both across models and across probabilistic forecasts) suggest that using the p50 forecast for this dataset is sufficient if the business impact of over- or under-forecasting is negligible.

You can use the following experiments to identify features that improve forecast accuracy.

Upload several copies of your dataset(s) to the solution-managed S3 bucket, with different related timeseries datasets (for example, millimeters of rain that fell during the time interval, or number of ongoing sporting events that might have influenced demand during that time interval). The solution generates predictors based on your configuration. Observe the predictor accuracy in the console and graph the results as shown in Figure 8.

Use the following procedure to open the Jupyter notebook.

  1. Sign in to the AWS Console.

  2. Navigate to the Amazon SageMaker service.

  3. Find your notebook instance and select Open JupyterLab.

  4. A new browser tab opens. Navigate to the improving-forecast-accuracy-with-machine-learning/v1.1.0/notebooks folder, and select SampleVisualization.ipynb.

  5. In the first cell of the notebook, configure your start date, end date, and paths to your time series data, related timeseries data, and forecast exports.

    # Set the start and end dates of the forecast start_date = '2014-09-01' # YYYY-MM-DD end_date = '2015-02-01' # YYYY-MM-DD # provide the full CSV name uploaded to the /train folder in S3 demand_dataset_name = 'RetailDemandTRMProphet.csv' related_dataset_name = 'RetailDemandTRMProphet.related.csv' # provide the exports to show from the export/ folder in S3 (these are created by Amazon Forecast) forecast_exports = [ { 'path': 'export_2020_07_04_17_30_13/export_2020_07_04_17_30_13_2020-07-06T19-22-29Z_part0.csv', 'name': 'Forecast 1 - Prophet' }, { 'path': 'export_2020_07_02_15_19_45/export_2020_07_02_15_19_45_2020-07-02T17-12-57Z_part0.csv', 'name': 'Forecast 2 - DeepAR+' } ]
  6. In the Run menu, select Run All Cells. If your data is not aggregated by date, modify the date formatter for the graph axis.

Experiment and develop a forecast model for your dataset. When new demand data is available, follow the procedure in Forecast experimentation to update your datasets and generate a new predictor, then compare predictor accuracy through similar visualizations.