Skip to main content
Featured image for post: A Learned CMA-ES Policy to Survive

A Learned CMA-ES Policy to Survive

1 min

What I worked on

  • Read through the CMA-ES code to understand how policies, parameters, and features interact. Examined the act() function, evaluation logic, population size calculations, and random seeding.
  • Created my own “best policy”

What I noticed

  • act() maps observations to actions using features
  • Policies flatten parameters into vectors (θ)
  • Horizon defines episode length
  • Random seed variations affect population initialization

”Aha” Moment

  • The output of this policy was a linear policy that selected an action a=Wx+ba=Wx+b
  • This is the equivalent to a single layer perceptron except in how it learns - no back prop
  • SLP learns from supervised training, CMA-SE learns from fitness signal
  • SLP learns the gradient but CMA-SE estimates over promising preturbing
  • SLP is local but CMA-SE is global needing the whole episode

What still feels messy

CMA-ES is searching the parameter space and learning what matters but needs the full episode. For a multi-layered network it’s shifting all weights. That doesn’t sound efficient.

Next step

Create a multi-layered network and introduce JUMP to see how it learns the best policy.