针对PPO的一些Code-level性能优化技巧

Intro

这篇blog是我在看过Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之后的总结。

reward clipping

clip the rewards within a preset range( usually [-5,5] or [-10,10])

observation clipping

The state are first normalized to mean-zero, variance-one vectors

value function clipping

将(Loss^{V} = (V_{ heta t} - V_{targ})^{2})替换为(L^{V} = min[ (V_{ heta t} - V_{targ})^{2} , (clip(V_{ heta t}, V_{ heta t-1}-epsilon, V_{ heta t-1}+epsilon) - V_{targ})^{2} ])

orthogonal initialization and layer scaling

use orthogonal initialization with scaling that varies from layer to layer

adam learning rate annealing

anneal the learning rate of Adam

hyperbolic tan activations

use hyperbolic tan activations when constructing the policy network and value network

global gradient clipping

clip the gradients such the 'global l2 norm' doesn't exceed 0.5