Tweet by Andrej Karpathy @karpathy

Exploring the Transformer Model with GPT-3

🧵 The TL;DR

This tweet thread explores the Transformer model, a deep learning architecture used in NLP. It introduces the key attention mechanism and builds up the Transformer with various components. A 10M parameter model is then trained, compared to OpenAI's GPT-3 and ChatGPT, and sampled from to generate fake Shakespeare.

🔑 Key Points

Core 'attention' mechanism at heart of Transformer as communication/message passing between nodes in a directed graph
Multi-headed self-attention, MLP, residual connections, layernorms build up the Transformer
10M parameter model trained for about 15 minutes on 1 GPU to generate fake Shakespeare

UnrollAI

Exploring the Transformer Model with GPT-3

🧵 The TL;DR

🔑 Key Points

View Tweet Thread