GQA vs MQA

What I worked on

Looked into why exponential functions are used in equations and how grouped query attention differs from multi-query attention. Also checked what rotary position embedding means.

What I noticed

Negative exponents help constrain values between 0 and 1
Grouped query attention handles multiple query heads more efficiently
Rotary position embeddings encode relative positions smoothly

”Aha” Moment

n/a

What still feels messy

How rotary embeddings compare to sinusoidal ones.

Next step

Visualize positional embeddings to see how they transform token space.

What I worked on

What I noticed

”Aha” Moment

What still feels messy

Next step

Command Palette

Choose Theme