Skip to main content
Featured image for post: Reward Shaping Bitter Lesson

Reward Shaping Bitter Lesson

2 min

What I worked on

Added an additional JUMP action and immediately saw a collapse in the search space for the non-linear policy.

What I noticed

  • Changing the activation function from ReLU to tanh smoothed the search space but didn’t help avoid shrinking parameter space

tanh activationrelu activation

  • Actions that maximized reward early on (NOOP, EAT) resulted in CMA-ES sampling a distribution that’s shrunk.

    action space heatmap
    action space entropy
    logit drift
    generation actions

  • Changing ENV dynamics (NOOP >= FORWARD) encouraged exploration

  • Almost identical states lead to different policies so is anything being learnt?

  • Trimmed down features from 11 to 5 didn’t have any impact

”Aha” Moment

  • I now understand why policies need entropy/action diversity into the objective
  • This type of engineering (i.e. reward shaping, encouraging entropy, hand-engineered features) is the bitter lesson

What still feels messy

  • CMA-ES is optimizing over long haul, so scenarios where there is exploration that result in low scores are viewed negatively
  • I want to avoid the bitter lesson but that doesn’t seem possible this early on
  • Extremely frustrating that it’s hard to understand the policy once you introduce the hidden layer(s)

Next step

  • I’m purposely going to stay away from more generations, bigger population, larger standard deviation, reward hacking, encouraging entropy
  • Think/play around about what a good action means. I don’t want to encourage entropy but I want to be able to determine positive and negative “valence”