Systematically Tune Hyperparameters for Optimal Training Performances
The following table shows the hyperparameters that can be adjusted to tune the performance of training using a supported algorithm:
Algorithmspecific hyperparameters and their effects
Algorithm  Hyperparameters  Description 

PPO 
Batch size 
The number of images used for each gradient descent update. A larger Batch size value leads to more stable updates, but slower training.

PPO 
Number of epochs 
The number of passes through the experience buffer during
gradient descent. A smaller

PPO 
Learning rate 
Controls how much each gradient descent contributes to an update. A larger value can increase the training speed, but may cause the expected rewards not to convergence if it's too large.

PPO 
Exploration 
A type of exploration used in training reinforcement
learning models. Use the categorical exploration
(

PPO  Entropy 
A degree of the randomness the agent takes action with. The larger the entropy, the more random actions the agent will take for exploration.

PPO  Discount factor 
A factor specifies how much of the future reward contributes to the expected rewards. The larger the Discount factor value is, the farther out contributions the agent considers to take an action. If it is 0, only the immediate reward is considered. If it is 1, the fullrange future rewards are included. As a simplified example, if the agent takes order of 100 steps to take a turn, then discount factor of 0.99 is a good value, if it takes 1000 steps, then 0.999 is a good value.

PPO  Loss type 
Type of the objective function for optimization. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The Huber loss and Mean squared error loss types behave similarly for small updates. But as the updates become larger, Huber loss takes smaller increments compared to Mean squared error loss. When you have convergence problems, use the Huber loss type. When convergence is good and you want to train faster, use the Mean squared error loss type.

PPO  Number of episodes between each training 
This parameter tells the agent how frequently it should train it's neural networks. If this value is low, the agent gets a small amount of experience between updates. For problems that are easy to solve, a small number suffices and learning is fast. For more complex problems, it is better to gather more experience so that the agent observes variations of the effect of its actions. Learning will be slower in this case but more stable. Recommended values are
