CMA ES for Agent Training
• 1 min read 1 min
What I worked on
Studied how CMA-ES can train an agent without backpropagation by optimizing a fitness function. Looked at how a linear policy maps features to actions and how it behaves after training.
What I noticed
- The agent always moved forward and never ate meaning a poor policy
- CMA-ES optimizes a parameter vector directly, no neural net required
- Policies are encoded as weight and bias vectors
”Aha” Moment
That CMA-ES treats the policy as a parameter vector and evolves it purely by measuring fitness outcomes.
What still feels messy
How to design a reward or feature set that encourages meaningful exploration rather than simple repetitive behavior.
Next step
Modify the environment to only reward time alive and expose a “near-food” feature