AWS DeepRacer
Developer Guide

Understand Reinforcement Learning in AWS DeepRacer

AWS DeepRacer uses reinforcement learning to enable the AWS DeepRacer 1/18th scale vehicle to drive autonomously. To achieve this, you train and evaluate a reinforcement learning model in a virtual environment with a simulated track. After the training, you upload the trained model artifacts to your AWS DeepRacer vehicle. You can then set the vehicle for autonomous driving in a physical environment with a real track.

Training a reinforcement learning model can be challenging, especially if you're new to the field. AWS DeepRacer simplifies the process by integrating required components together and providing easy-to-follow wizard-like task templates. However, it's helpful to have a good understanding of the basics of reinforcement learning training implemented in AWS DeepRacer.

Overview of Reinforcement Learning

In reinforcement learning, an agent with an objective interacts with an environment to maximize the agent's total reward. The agent takes an action, guided by a strategy referred to as a policy, at a given environment state and reaches a new state. There is an immediate reward associated with any action. The reward is a measure of the desirability of the action. This immediate reward is considered to be returned by the environment.

The goal of the reinforcement learning in AWS DeepRacer is to learn the optimal policy in a given environment. Learning is an iterative process of trials and errors. The agent takes the random initial action to arrive at a new state. Then the agent iterates the step from the new state to the next one. Over time, the agent discovers actions that lead to the maximum long-term rewards. The interaction of the agent from an initial state to a terminal state is called an episode.

The following sketch illustrates this learning process:


                Image: An overview of reinforcement learning.

The agent embodies a neural network that represents a function to approximate the agent's policy. The image from the vehicle's front camera is the environment state and the agent action is defined by the agent's speed and steering angles.

The agent receives positive rewards if it stays on-track to finish the race and negative rewards for going off-track. An episode starts with the agent somewhere on the race track and finishes when the agent either goes off-track or completes a lap.

Note

Strictly speaking, the environment state refers to everything relevant to the problem. For example, the vehicle's position on the track as well as the shape of the track. The image fed through the camera mounted the vehicle's front does not capture the entire environment state. Hence, the environment is deemed partially observed and the input to the agent is referred to as observation rather than state. For simplicity, we use state and observation interchangeably throughout this documentation.

Training the agent in a simulated environment has the following advantages:

  • The simulation can estimate how much progress the agent has made and identify when it goes off the track to compute a reward.

  • The simulation relieves the trainer from tedious chores to reset the vehicle each time it goes off the track, as is done in a physical environment.

  • The simulation can speed up training.

  • The simulation provides better controls of the environment conditions, e.g. selecting different tracks, backgrounds, and vehicle conditions.

The alternative to reinforcement learning is supervised learning, also referred to as imitation learning. Here a known dataset (of [image, action] tuples) collected from a given environment is used to train the agent. Models that are trained through imitation learning can be applied to autonomous driving. They work well only when the images from the camera look similar to the images in the training dataset. For robust driving, the training dataset must be comprehensive. In contrast, reinforcement learning does not require such extensive labeling efforts and can be trained entirely in simulation. Because reinforcement learning starts with random actions, the agent learns a variety of environment and track conditions. This makes the trained model robust .

AWS DeepRacer Action Space and Reward Function

For autonomous driving, the AWS DeepRacer vehicle receives input images streamed at 15 frames per second from the front camera. The raw input is downsized to 160x120 pixels in size and converted to grayscale images.

Responding to an input observation, the vehicle reacts with a well-defined action of specific speed and steering angle. The actions are converted to low-level motor controls. The possible actions a vehicle can take is defined by an action space of the dimensions in speed and steering angle. An action space can be discrete or continuous. AWS DeepRacer uses a discrete action space.

For a discrete action space of finite actions, the range is defined by the maximum speed and the absolute value of the maximum steering angles. The granularities define the number of speeds and steering angles the agent has.

For example, the AWS DeepRacer default action space has the following actions you can use to train an AWS DeepRacer model.

The default AWS DeepRacer action space

Action number Steering Speed
0 -30 degrees 0.4 m/s
1 -30 degrees 0.8 m/s
2 -15 degrees 0.4 m/s
3 -15 degrees 0.8 m/s
4 0 degrees 0.4 m/s
5 0 degrees 0.8 m/s
6 15 degrees 0.4 m/s
7 15 degrees 0.8 m/s
8 30 degrees 0.4 m/s
9 30 degrees 0.8 m/s

This default action space is characterized by the following ranges and granularities:

The default action space characteristics

Property Value Description
Maximum steering angle 30 degrees
Steering angle granularity 5
Maximum speed 0.8 m/s
Speed granularity 2

To influence behavior, we can explore a reward function to assign immediate rewards to the actions in this action space. For example, AWS DeepRacer has a basic reward function by default to encourage the agent to stay as close to the center line as possible. The agent avoids steering close to the edge of the track and going off the track with even a slight turn. For details of this default reward function in the AWS DeepRacer console, see AWS DeepRacer reward function example.

In addition to the default action space, you can also explore a custom action space and a custom reward function to train your models.

You can apply a trained model to an AWS DeepRacer vehicle by mapping the maximum speed (0.8 m/s) and maximum steering angles (30 degrees) used in training to the corresponding maximum physical values. This is called vehicle calibration.

AWS DeepRacer Training Algorithm

AWS DeepRacer uses the Proximal Policy Optimization (PPO) algorithm to train the reinforcement learning model. PPO uses two neural networks during training: a policy network and a value network. The policy network (also called actor network) decides which action to take given an image as input. The value network (also called critic network) estimates the cumulative reward we are likely to get given the image as input. Only the policy network interacts with the simulator and gets deployed to the real agent, namely an AWS DeepRacer vehicle.


                Image: Graphical illustration of PPO.

Below we explain how the actor and critic work together mathematically.

PPO is a derivative of the policy gradient method. In the most basic form, the policy gradient method trains the agent to move along a track by searching for the optimal policy π*(a|s; θ*). The optimization aims at maximizing a policy score function J) that can be expressed in terms of the immediate reward r(s,a) of taking action (a) in state (s) averaged over the state probability distribution ρ(s) and the action probability distribution (π(a|s; θ)):


                Image: Policy score function

The optimal policy, as represented by the optimal policy network weights θ*, can be expressed as follows:


                Image: Optimal policy

The maximization can proceed by following the policy gradient ascent over episodes of training data (s, a, r):


                Image: Policy gradient ascent update

where α is known as the learning rate and θJτ) is the policy gradient with respect to θ evaluated at step τ.

In terms of the total future reward:


                Image: Total future reward

where γ is the discount factor ranging between 0 and 1, and τ maps to an experience (sτ, aτ, rτ) at step τ, and the summation includes experiences in an episode that starts from time t = 0 and ends at time t = H when the agent goes off-track or reaches to the finish line, the score function becomes the expected total future reward averaged over the policy distribution π across many episodes of experiences:


                Image: Policy score function over experiences

From this definition of J(θ), the policy weight updates can be expressed as follows:


                Image: Policy gradient over experiences.

Here, averaging over π is approximated by sample averaging over N of episodes each of which consists of possibly unequal H number of experiences.

The update rule for a policy network weight then becomes:


                Image: policy weight update over experiences.

The policy gradient method outlined above is of limited utility in practice, because the score function R(si,t, ai,t) has high variance as the agent can take many different paths from a given state. To get around this, one uses a critic network (ϕ) to estimate the score function. To illustrate this, let Vϕ(s) the value of the critic network, where s describes a state and ϕ the value network weights. To train this value network, the estimated value (yi,t) of state s at step t in episode i is estimated to be the immediate reward taking action ai,t at state si,t plus the discounted total future value of the state s at the next step t+1 in the same episode:


                Image: Estimated value network

The loss function for the value network weights is:


                Image: Value network loss function

Using the estimated values, the policy gradient for updating the policy network weights θ becomes:


                Image: Actor-Critic policy gradient

This formulation changes the policy network weights such that it encourages actions that give higher rewards than prior estimate and discourages otherwise.

Every reinforcement learning algorithm needs to balance between exploration and exploitation. The agent needs to explore the state and action space to learn which actions lead to high rewards in unexplored state space. The agent should also exploit by taking the actions that leads to high rewards so that the model converges to a stable solution. In our algorithm, the policy network outputs the probability of taking each action and during training the action is chosen by sampling from this probability distribution (e.g. an action with probability 0.5 will be chosen half the time). During evaluation, the agent picks with action with the highest probability.

In addition to the above actor-critic framework, PPO uses importance sampling with clipping, adds a Gauss-Markov noise to encourage exploration and uses generalized advantage estimation. To learn more, see the original paper.

AWS DeepRacer Service Architecture

The AWS DeepRacer service is built upon Amazon SageMaker, AWS RoboMaker and other AWS services such as Amazon S3.

Amazon SageMaker is an AWS machine learning platform to train machine learning models in general. AWS DeepRacer uses it to train reinforcement learning model in particular. AWS RoboMaker is a cloud service to develop, test and deploy robotic solutions in general. AWS DeepRacer uses it create the virtual agent and its interactive environment. Amazon S3 is an economical general-purpose cloud storage solution. AWS DeepRacer uses it to store trained model artifacts. In addition, AWS DeepRacer uses Redis, an in-memory database, as an experience buffer to select training data from for training the policy neural network.

Within the AWS DeepRacer architecture, AWS RoboMaker creates a simulated environment for the agent to drive along a specified track. The agent moves according to the policy network model that has been trained up to a certain time in Amazon SageMaker. Each run starts from the starting line to an end state which can be the finishing line or off the track and is known as an episode. For each episode, the course is divided into segments of a fixed number of steps. In each segment experiences, defined as an ordered list of the tuples of (state, action, reward, new state) associated with individual steps, are cached in Redis as an experience buffer. Amazon SageMaker then randomly draws from the experience buffer training data in batches and feeds the input data to the neural network to update the weights. It then stores the updated model in Amazon S3 for Amazon SageMaker to use in order to generate more experiences. The cycle continues until training stops.

In the beginning before the first model is trained for the first time, Amazon SageMaker initializes the experience buffer with random actions.

The following diagram illustrates this architecture.


                Image: AWS DeepRacer architecture.

This setup allows running multiple simulations to train a model on multiple segments of a single track at the same time or to train the model for multiple tracks simultaneously.

AWS DeepRacer Solution Workflow

Training an AWS DeepRacer model involves the following general tasks:

  1. The AWS DeepRacer service initializes the simulation with a virtual track, an agent representing the vehicle, and the background. The agent embodies a policy neural network that can be tuned with hyper-parameters as defined in the PPO algorithm.

  2. The agent acts (as specified with a steering angle and a speed) based on a given state (represented by an image from the front camera).

  3. The simulated environment updates the agent's position based on the agent action and returns a reward and an updated camera image. The experiences collected in the form of state, action, reward, and new state are used to update the neural network periodically. The updated network models are used to create more experiences.

  4. You can monitor the training in progress along the simulated track with a first-person view as seen by the agent. You can display metrics such as rewards per episode, the loss function value, the entropy of the policy. CPU or memory utilization can also be displayed as training progresses. In addition, detailed logs are recorded for analysis and debugging.

  5. The AWS DeepRacer service periodically saves the neural network model to persistent storage.

  6. The training stops based on a time limit.

  7. You can evaluate the trained model in a simulator. To do this submit the trained model for time trials for a selected number runs on the selected track.

After the model is successfully trained and evaluated, it can be uploaded to a physical agent (an AWS DeepRacer vehicle). The process involves the following steps:

  1. Download the trained model from its persistent storage (an Amazon S3 bucket).

  2. Use the vehicle's device control console to upload the trained model to the vehicle. Use the console to calibrate the vehicle for mapping the simulated action space to the physical action space. You can also use the console to check the throttling parity, view the front camera feed, load a model into the inference engine, and watch the vehicle driving on a real track.

    The vehicle's device control console is a web server hosted on the vehicle's compute module. The console is accessible from the vehicle IP address with a connected Wi-Fi network and a web browser on a computer or a mobile device.

  3. Experiment with the vehicle driving under different lighting, battery levels, and surface textures and colors.

    The vehicle's performance in a physical environment may not match the performance in a simulated environment due to model limitations or insufficient training. The phenomenon is referred to as the sim2real performance gap. To reduce the gap, see Simulated-to-Real Performance Gaps.

Simulated-to-Real Performance Gaps

Because the simulation cannot capture all aspects of the real world accurately, the models trained in simulation may not work well in the real world. Such discrepancies are often referred to as simulated-to-real (sim2real) performance gaps.

Efforts have been made in AWS DeepRacer to minimize the sim2real performance gap. For example, the simulated agent is programmed to take about 10 actions per second. This matches the frequency the AWS DeepRacer vehicle runs inference with, about 10 inferences per second. As another example, at the start of each episode in training, the agent's position is randomized. This maximizes the likelihood that the agent learns all parts of the track evenly.

To help reduce real2sim performance gaps, make sure to use the same or similar color, shape and dimensions for both the simulated and real tracks. To reduce visual distractions, use barricades around the real track. Also, carefully calibrate the ranges of the vehicle's speed and steering angles so that the action space used in training matches the real world. Evaluating model performance in a different simulation track than the one used in training can show the extent of the real2real performance gap.

For more information about how to reduce the sim2real gap when training an AWS DeepRacer model, see Optimize Training AWS DeepRacer Models for Real Environments.