Skip to main content

GQA vs MQA

1 min

What I worked on

Looked into why exponential functions are used in equations and how grouped query attention differs from multi-query attention. Also checked what rotary position embedding means.

What I noticed

  • Negative exponents help constrain values between 0 and 1
  • Grouped query attention handles multiple query heads more efficiently
  • Rotary position embeddings encode relative positions smoothly

”Aha” Moment

n/a

What still feels messy

How rotary embeddings compare to sinusoidal ones.

Next step

Visualize positional embeddings to see how they transform token space.