GQA vs MQA
• 1 min read 1 min
What I worked on
Looked into why exponential functions are used in equations and how grouped query attention differs from multi-query attention. Also checked what rotary position embedding means.
What I noticed
- Negative exponents help constrain values between 0 and 1
- Grouped query attention handles multiple query heads more efficiently
- Rotary position embeddings encode relative positions smoothly
”Aha” Moment
n/a
What still feels messy
How rotary embeddings compare to sinusoidal ones.
Next step
Visualize positional embeddings to see how they transform token space.