Full-Width Version (true/false)

Breaking

Thursday, January 8, 2026

PART III — Tokenization & Representation

 

Chapter 5: Tokenization Explained from Scratch

Goal: Teach machines to read

Topics Covered:

  • Why tokenization is required

  • Character-level tokenization

  • Word-level tokenization

  • Subword tokenization

  • Byte Pair Encoding (BPE)

  • Trade-offs in vocabulary size

📌 Medium Post 5: Tokenization: Teaching Machines to Read


Chapter 6: Vocabulary Design & Token IDs

Goal: Understand token spaces

Topics Covered:

  • Vocabulary creation

  • Special tokens (PAD, BOS, EOS)

  • Unknown tokens

  • Token frequency & pruning

  • Vocabulary size vs model performance

📌 Medium Post 6: Designing a Vocabulary for LLMs

No comments:

Post a Comment