Skip to main content
Diagram illustrating the wav2vec 2.0 architecture and contrastive prediction task.

Representation Learning in wav2vec Models

1 min

What I worked on

Explored how wav2vec 2.0 and vq‑wav2vec learn from raw audio and differ in how they represent speech features.

What I noticed

  • wav2vec 2.0 masks latent representations and learns through contrastive prediction.
  • vq‑wav2vec discretizes features into quantized codes and reconstructs them to predict context.
  • wav2vec 2.0 is end‑to‑end and continuous, while vq‑wav2vec is discrete and symbolic.
  • Visualization and probing methods (e.g., t‑SNE, PCA) can help interpret feature embeddings.

”Aha” Moment

wav2vec 2.0 and vq‑wav2vec represent two ends of the representation‑learning spectrum. Continuous contextual embeddings versus discrete symbolic codes that are both driven by self‑supervised objectives.

What still feels messy

How exactly discrete versus continuous embeddings affect downstream generalization and interpretability.

Next step

Run visualization or probing experiments on wav2vec 2.0 embeddings to examine structure or clustering.