Nano Gpt

GPT is a multi-head attention architecture

the default parameters gives a model much smaller than nanoGPT
- tuned for fastest convergence on a very small data set
this model takes as input a sequence of existing text (context)
- and produces as output the predicted next character
- it actually produces the next predicted character
  - for each initial sub-sequence of the input
    - in effect giving an extra degree of parallelism for the purpose of training