Skip to main content
Featured image for post: CMA ES for Agent Training

CMA ES for Agent Training

1 min

What I worked on

Studied how CMA-ES can train an agent without backpropagation by optimizing a fitness function. Looked at how a linear policy maps features to actions and how it behaves after training.

What I noticed

  • The agent always moved forward and never ate meaning a poor policy
  • CMA-ES optimizes a parameter vector directly, no neural net required
  • Policies are encoded as weight and bias vectors

”Aha” Moment

That CMA-ES treats the policy as a parameter vector and evolves it purely by measuring fitness outcomes.

What still feels messy

How to design a reward or feature set that encourages meaningful exploration rather than simple repetitive behavior.

Next step

Modify the environment to only reward time alive and expose a “near-food” feature