Twitter profile picture of Andrej Karpathy
Tweet by Andrej Karpathy @karpathy

Exploring the Transformer Model with GPT-3


๐Ÿงต The TL;DR

This tweet thread explores the Transformer model, a deep learning architecture used in NLP. It introduces the key attention mechanism and builds up the Transformer with various components. A 10M parameter model is then trained, compared to OpenAI's GPT-3 and ChatGPT, and sampled from to generate fake Shakespeare.


    ๐Ÿ”‘ Key Points

  • Core 'attention' mechanism at heart of Transformer as communication/message passing between nodes in a directed graph
  • Multi-headed self-attention, MLP, residual connections, layernorms build up the Transformer
  • 10M parameter model trained for about 15 minutes on 1 GPU to generate fake Shakespeare

View Tweet Thread