AWS DeepRacer
Developer Guide

This is prerelease documentation for a service in preview release. It is subject to change.

Systematically Tune Hyperparameters for Optimal Training Performances

The following table shows the hyperparameters that can be adjusted to tune the performance of training using a supported algorithm:

Algorithm-specific hyperparameters and their effects

Algorithm Hyperparameters Description
PPO

Batch size

The number of images used for each gradient descent update. A larger Batch size value leads to more stable updates, but slower training.

The batch size determines the number of steps, randomly sampled from the most recent experience, that will be used to train the deep neural network. It corresponds to the amount of experience used to update the network.

Required

Yes

Valid values

Positive integer of (32, 64, 128, 256, 512)

Default value

32

PPO

Number of epochs

The number of passes through the experience buffer during gradient descent. A smaller num_epoch value promotes more stable updates, but slower training. A larger Number of epochs value is acceptable when the batch size is large.

This is the number of times that the algorithm runs through the batch and update the network weights when training is performed each time.

Required

No

Valid values

Positive integer between [3 - 10]

Default value

3

PPO

Learning rate

Controls how much each gradient descent contributes to an update. A larger value can increase the training speed, but may cause the expected rewards not to convergence if it's too large.

Required

No

Valid values

Real number between [10-5- 10-3]

Default value

0.001

PPO

Exploration

A method used to provide the trade-off between exploitation and exploration, which determines how much the training follows model prescriptions (exploitation) or proceeds at random (exploration). Depending on the type of action space, you can use the categorical exploration (CategoricalParameters) for a discrete action space and use the epsilon exploration (EpsilonGreedy) for a continuous action space:

  • The categorical exploration takes action according to the probability distribution of the action space of the policy network.

  • The epsilon exploration takes action at random with an epsilon distribution. The initial value is 1 and then linearly decreased to 0.1 over X steps, where X is typically between 10,000 and 100,000.

As the training progresses, the exploration helps prevent the neural network from being trapped in parts of the action space (local maxima). As the result, the neural network learns what actions to take with more certainty and confidence, as the training progresses.
Required

No

Valid values

String literal of CategoricalParameters or EpsilonGreedy

Default value

EpsilonGreedy

PPO Entropy

A degree of the randomness added to the probability distribution the agent takes action with. The larger the entropy, the more random actions the agent will take for exploration.

Required

No

Valid values

Real number between [10-4- 10-2]

Default value

0.5

PPO Discount factor

A factor specifies how much of the future reward contributes to the expected rewards. The larger the Discount factor value is, the farther out contributions the agent considers to take an action. If it is 0, only the immediate reward is considered. If it is 1, the full-range future rewards are included. As a simplified example, if the discount factor is 0.9, the agent takes an order of 10 future steps to make a move. If the discount factor is 0.99, the agent takes an order of 100 future steps to make a move. And if the discount factor is 0.999, the agent takes an order of 1000 steps to make a move.

Required

No

Valid values

Real number between [0- 1]

Default value

0.999

PPO Loss type

Type of the objective function used to update the network weights. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The Huber loss and Mean squared error loss types behave similarly for small updates. But as the updates become larger, Huber loss takes smaller increments compared to Mean squared error loss. When you have convergence problems, use the Huber loss type. When convergence is good and you want to train faster, use the Mean squared error loss type.

Required

No

Valid values

(Huber loss, Mean squared error loss)

Default value

Huber loss

PPO Number of episodes between each training

This parameter determines how much experience, in the form of episodes, to obtain before a training is run. After each training step, the newly trained model is used to obtain more episodes and the process continues iteratively.

Recommended values are (10, 20, 40).

Required

No

Valid values

[1 - 1000]

Default value

20