Repository: will-thompson-k/tldr-transformers
Branch: main
Commit: 583048b9b345
Files: 23
Total size: 97.7 KB
Directory structure:
gitextract_1c2fka5_/
├── .gitignore
├── LICENSE
├── README.md
└── notes/
├── TEMPLATE.md
├── adapter_bert.md
├── albert.md
├── bart.md
├── bert.md
├── bigtable.md
├── byt5.md
├── clip.md
├── codex.md
├── contrastive.md
├── dalle.md
├── dedup.md
├── distilbert.md
├── gradient-attack.md
├── human-pref.md
├── megatron.md
├── reformer.md
├── roberta.md
├── scaling-laws.md
└── t5.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# MAC STUFF
.DS_Store
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2021 Will Thompson
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# tldr-transformers
The "tl;dr" on a few notable papers on Transformers and modern NLP.
This is a ~~living~~ repo to keep tabs on different research threads.
**Last Updated**: September 20th, 2021.
<ins>Models</ins>: GPT- *, * BERT *, Adapter- *, * T5, Megatron, DALL-E, Codex, etc.
<ins>Topics</ins>: Transformer architectures + training; adversarial attacks; scaling laws; alignment; memorization; few labels; causality.
<p float="left">
<p align="middle">
<img src="assets/bert_fig1.png" width="50%" />
<img src="assets/t5_fig1_clipped.png" width="50%" />
<img src="assets/scaling-laws_fig1.png" width="50%" />
</p>
<div align="center">
<b>BERT</b>, <b>T5</b>, <b>Scaling Laws Paper</b> (art from the original papers)
</div>
<p>
 
 
 
</p>
Each set of notes includes links to the paper, the original code implementation (if available) and the Huggingface :hugs: implementation.
<ins>Here are some examples</ins> ---> [t5](notes/t5.md), [byt5](notes/byt5.md), [deduping transformer training sets](notes/dedup.md).
This repo also includes a [table](notes/bigtable.md) quantifying the differences across transformer papers <ins>all in one table</ins>.
The transformers papers are presented somewhat chronologically below. Go to the ":point_right: Notes :point_left:" column below to find the notes for each paper.
## Contents
- [Quick Note](#Quick_Note)
- [Motivation](#Motivation)
- [Papers::Transformer Papers](#Models)
- [Papers::1 Table To Rule Them All](#BigTable)
- [Papers::Adversarial Attack Papers](#Attac)
- [Papers::Fine-tuning Papers](#FineTune)
- [Papers::Alignment Papers](#Alignment)
- [Papers::Causality Papers](#Causal)
- [Papers::Scaling Law Papers](#Scaling)
- [Papers::LM Memorization Papers](#Memorization)
- [Papers::Limited Label Learning Papers](#FewLabels)
- [How To Contribute](#Contribute)
- [How To Point Our Errors](#Errata)
- [Citation](#Citation)
- [License](#License)
## Quick_Note
This is *not* an intro to deep learning in NLP. If you are looking for that, I recommend one of the following: [Fast AI's course](https://www.fast.ai/2019/07/08/fastai-nlp/), [one of the Coursera courses](https://www.coursera.org/specializations/natural-language-processing), or [maybe this old thing](https://github.com/will-thompson-k/deeplearning-nlp-models). Come here after that.
## Motivation
With the explosion in papers on all things Transformers the past few years, it seems useful to catalog the salient features/results/insights of each paper in a digestible format. Hence this repo.
## Models
| Model | Year | Institute | Paper | :point_right: Notes :point_left: | Original Code | Huggingface :hugs: | Other Repo |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
|Transformer | 2017 | Google|[Attention is All You Need](https://arxiv.org/abs/1706.03762) | Skipped, too many good write-ups: <ul><li> [Harvard NLP Group](http://nlp.seas.harvard.edu/2018/04/03/attention.html) </li><li> [Jay Alammar](http://jalammar.github.io/illustrated-transformer/) </li><li> [Lilian Weng](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) </li><li> [Something old](https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/notebooks/transformer/README.md) </li></ul> | | ? | |
|GPT-3 | 2018 | OpenAI|[Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | To-Do | X | X | |
|GPT-J-6B | 2021 | EleutherAI | [GPT-J-6B: 6B Jax-Based Transformer (**public GPT-3**)](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) | X | [here](https://github.com/kingoflolz/mesh-transformer-jax) | x | x |
|BERT | 2018 | Google|[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | [BERT notes](notes/bert.md) | [here](https://github.com/google-research/bert) | [here](https://huggingface.co/transformers/model_doc/bert.html) | |
|DistilBERT | 2019 | Huggingface |[DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)| [DistilBERT notes](notes/distilbert.md) | | [here](https://huggingface.co/transformers/model_doc/distilbert.html) | |
|ALBERT | 2019 | Google/Toyota |[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) | [ALBERT notes](notes/albert.md) | [here](https://github.com/google-research/albert) | [here](https://huggingface.co/transformers/model_doc/albert.html) | |
|RoBERTa | 2019 | Facebook|[RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) | [RoBERTa notes](notes/roberta.md) | [here](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) | [here](https://huggingface.co/transformers/model_doc/roberta.html) | |
|BART | 2019 | Facebook |[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) | [BART notes](notes/bart.md) | [here](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md) | [here](https://huggingface.co/transformers/model_doc/bart.html) |
|T5 | 2019 | Google|[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) | [T5 notes](notes/t5.md) | [here](https://github.com/google-research/text-to-text-transfer-transformer) | [here](https://huggingface.co/transformers/model_doc/t5.html) | |
|Adapter-BERT | 2019 | Google|[Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/abs/1902.00751) | [Adapter-BERT notes](notes/adapter_bert.md) | [here](https://github.com/google-research/adapter-bert) | - | [here](https://github.com/Adapter-Hub/adapter-transformers)|
|Megatron-LM | 2019 | NVIDIA |[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) | [Megatron notes](notes/megatron.md) | [here](https://github.com/NVIDIA/Megatron-LM) | - | [here](https://github.com/Adapter-Hub/adapter-transformers)|
|Reformer | 2020 | Google |[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) | [Reformer notes](notes/reformer.md) | | [here](https://huggingface.co/transformers/model_doc/reformer.html) | |
|byT5 | 2021 | Google |[ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) | [ByT5 notes](notes/byt5.md) | [here](https://github.com/google-research/byt5) | [here](https://huggingface.co/transformers/model_doc/byt5.html) | |
|CLIP | 2021 | OpenAI |[Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) | [CLIP notes](notes/clip.md) | [here](https://github.com/openai/CLIP) | [here](https://huggingface.co/transformers/model_doc/clip.html) | |
|DALL-E | 2021 | OpenAI|[Zero-Shot Text-to-Image Generation](https://arxiv.org/abs/2102.12092) | [DALL-E notes](notes/dalle.md)| [here](https://github.com/openai/DALL-E) | - | |
|Codex | 2021 | OpenAI|[Evaluating Large Language Models Trained on Code](https://arxiv.org/pdf/2107.03374.pdf) | [Codex notes](notes/codex.md) | X | - | |
## BigTable
All of the table summaries found ^ collapsed into one really big table [here](notes/bigtable.md).
## Attac
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [Gradient-based Adversarial Attacks against Text Transformers](https://arxiv.org/pdf/2104.13733.pdf)| 2021| Facebook | [Gradient-based attack notes](notes/gradient-attack.md)| None|
## FineTune
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning](https://openreview.net/pdf?id=cu7IUiOhujH)| 2021| Facebook | [SCL notes](notes/contrastive.md)| None|
## Alignment
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/pdf/1909.08593.pdf)| 2019| OpenAI | [Human pref notes](notes/human-pref.md)| None|
## Scaling
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)| 2020 | OpenAI | [Scaling laws notes](notes/scaling-laws.md)| None|
## Memorization
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805)| 2021 | Google et al. | To-Do | None|
| [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499)| 2021 | Google et al. | [Dedup notes](notes/dedup.md)| None|
## FewLabels
| Paper | Year | Institute | :point_right: Notes :point_left: | Codes |
| :----: | :----: | :----: | :----: | :----: |
| [An Empirical Survey of Data Augmentation for Limited Data Learning in NLP](https://arxiv.org/abs/2106.07499)| 2021 | GIT/UNC | To-Do | None|
| [Learning with fewer labeled examples](https://colinraffel.com/publications/probml2021learning.pdf)| 2021 | Kevin Murphy & Colin Raffel (Preprint: "Probabilistic Machine Learning", Chapter 19) | Worth a read, won't summarize here. | None|
## Contribute
If you are interested in contributing to this repo, feel free to do the following:
1. Fork the repo.
2. Create a Draft PR with the paper of interest (to prevent "in-flight" issues).
3. Use the suggested [template](notes/TEMPLATE.md) to write your "tl;dr". If it's an architecture paper, you may also want to add to the larger table [here](notes/bigtable.md).
4. Submit your PR.
## Errata
Undoubtedly there is information that is incorrect here. Please open an Issue and point it out.
## Citation
```python
@misc{cliff-notes-transformers,
author = {Thompson, Will},
url = {https://github.com/will-thompson-k/cliff-notes-transformers},
year = {2021}
}
```
For the notes above, I've linked the original papers.
## License
MIT
================================================
FILE: notes/TEMPLATE.md
================================================
# <TITLE>
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
## Art
================================================
FILE: notes/adapter_bert.md
================================================
# Parameter-Efficient Transfer Learning for NLP
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| Adapter-BERT | Encoder-Only | Same as BERT (only fine-tuning is happening in this paper) | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | pre-trained BERT + Adapter Layers | Fine-tuning procedure: ADAM with LR increased during first 10%, followed by decaying to 0 during last 90%. LR is swept ans well as adapter size. |Same as BERT | Batch size = 32 |
## TL;DR
There are 2 typical forms of **transfer learning** in NLP: feature-based (i.e., ELMO, pre-trained embeddings) and fine-tuning (i.e., most language models). It's been shown that fine-tuning often does better than feature-based transfer learning. Both require a new set of parameters for the new task and are sharing the lower-level layers.
In the author's proposed **adapter architecture**, only the new parameters (adapter layers, top layer + layer norm params) are trained and (most) of the original parameters are untouched. The adapter layer is a **bottle-neck** architecture with a FFN->non-linearity->FFN loop.
## Art
### Figure 1: Accuracy v. Fine-Tuning
This plot is a rather important depiction of the accuracy achieved on GLUE benchmarks. The major take-away is that adapters attain a similiar performance as fine-tuning with 2 orders of magnitude fewer parameters.

(from original paper)
### Figure 2: Adapter in Transformer
This a depiction of how the Adapter layer in inserted into the usual Transformer layer. The Adapter is a bottle-neck architecture + skip-connection. Only the green parameters are tuned during fine-tuning: adapter, layer norm + final classification layer.

(from original paper)
================================================
FILE: notes/albert.md
================================================
# ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| ALBERT | Encoder-Only | <ul><li> **Masked Language Model (MLM) loss function (refer to BERT)** </li><li> **Sentence Order Prediction (SOP) loss** </li></ul> | SentencePiece (as opposed to BERT's WordPiece) similar to XLNet | ~30k | Same as XLNet (greedy algorithm) | SentencePiece embeddings | Encoder-only self-attention, but with different prob(masking) | GeLU (same as BERT) | **Albert-base (sharing attention layers)**: 12 layers, hidden_size=768, embed_size=128 --> **64 MM parameters** | <ul><li>Fine-tuning is task specific (see table 14)</li><li> **LAMB** optimizer was used w/ LR=0.00176 @ 125k steps </li></ul> | Same as BERT | 4096 batch size |
## TL;DR
Similar to Distilbert, the authors were concerned about the trade off between increases in model size and the limitations of GPU/TPU memory, training times, and inference times (but more particularly the limitations of memory required). They attack it through parameter reduction as well as a **self-supervised loss** that focuses on modeling inter-sentence coherence. Through this they were able to achieve S.O.T.A on all the usual benchmarks.
The parameter reduction takes 2 forms:
1. **factorized embedding parameterization**: The authors note that increasing hidden size in LMs have shown improved performance (along with attention heads and layers); however, due to computational costs, they hypothesize people stopped there. In BERT, XLNet and RoBERTa, ```embedding_size == hidden_size```. The authors think that given embeddings and hidden layers have different tasks (context-independent v. context-dependent learning), they insist that ``` embedding_size << hidden_size ```. By decoupling these 2 dimensions, they go from ```VxH -> VxE + ExH``` where ```E``` is a much smaller embedding space (recall ```V=~30k```). This would effectively allow the model's hidden_size to be scaled in a way that the parameter space does not grow as quickly.
2. **cross-layer parameter sharing**: Instead of each layer having it's own unique set of parameters (and therefore meaning an increase in parameters proportional to depth), the authors designed cross-layer parameter sharing. Prior work explored this idea with standard encoder-decoder tasks versus pre-training. Instead of only sharing the FFN parameters or only sharing the attention parameters, they share **all** parameters across layers.
**Loss function**: First, the authors cite literature which show that the Next Sentence Prediction (NSP) loss in BERT does not ultimately add any empirical value in down-stream tasks. They argue that this is because of the lack of difficulty of the task (in their words: "topic prediction" versus "coherence prediction") and therefore the eliminate it from their loss function. Instead, they use a different loss on top of MLM that uses *coherence* -> **Sentence Order Prediction (SOP)**. In the case of NSP, negative examples were drawn from other documents with equal probability (and target == 0), which is why the authors claim that this loss captures topic prediction. For SOP loss, negative examples are the **same 2 sentences (A and B)** but their order swapped, emphasizing how sentences flow together. The authors show that using this loss function improves downstream performance of multi-sentence encoding tasks.
**MLM masking**: **N-gram masking** is used, where the length of each n-gram mask is selected randomly and the maximimum n-gram length was 3 words.
Overall, the results seemed mixed from sharing parameters, but the SOP task does seem to improve downstream tasks. Sharing all parameters, particularly FFN parameters, seems to have the largest ding against performance- sharing attention parameters seems negligible.
**Final model chosen**: ALBERT-xxlarge (still less than BERT-large), MLM + SOP, no drop out. Authors note its more expensive given larger structure, so suggested looking into sparse and block attention.
## Art
### Table 1: ALBERT v. BERT Model Size
This table shows how parameter-sharing and factorizing embedding and hidden layers impact the number of parameters (Note: these are showing the model size if one shares both FFN and Attention parameters across layers).

(from original paper)
### Table 4: ALBERT Shared Parameter Configurations

(from original paper)
### Table 5: NSP v. SOP Loss Functions
One can see that SOP, as predicted, by the authors proves to have better downstream behavior than NSP.

(from original paper)
### Figure 2: Ablation Study on Dropout
One can see that Dropout does dominantly worse in training (although, is the sudden bifurcation explainable?)

(from original paper)
### Table 14: Hyper-parameters for Fine-tuning

(from original paper)
================================================
FILE: notes/bart.md
================================================
# BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| BART | Encoder-Decoder (Transformer) | **Re-construction loss**: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants: <ul><li> **GPT::Language Model** </li><li> **XLNet::Permuted Language Model** </li><li> **BERT: MLM** </li><li> **Multitask MLM** </li><li> **Masked Seq-to-seq** </li><ul> They use **two-stream attention** to compute likelihoods. | Same BPE encoding as GPT-2 | Same as GPT? Or RoBERTa?| Same as GPT? Or RoBERTa?| Same as GPT? Or RoBERTa? | Same as the original Transformer | GeLU | <ul><li> BART contains roughly 10% more parameters than equivalent sized BERT model: 6 encoder layers, 6 decoder layers, embed_size==hidden_size=768.</li><li> For large-scale experiments: 12 encoder, 12 decoder, hidden_size=1024.</li></ul> | <ul><li> 5MM steps. Use 30% token masking, permute sentences. </li><li>There is a different training process for NMT (2-step process)</li></ul>| 160 GB of data similar to *Liu et al 2019* | (for large scale experiments) batch_size=8K|
## TL;DR
Basically, the authors set out to combine the bi-directional encoder (BERT) and the auto-regressive decoder (GPT) in one model. Their hypothesis is that since BERT is trained to predict random tokens using bi-directional information, it cannot be used easily for text generation; conversely, GPT is designed for text generation, but lacks context for understanding other tasks. To setup the problem, a noising function is used to corrupt the original text. The authors explore a few different noising functions. BART performs as well as RoBERTa on GLUE + SOTA on some other tasks. Further, a new fine-tuning technique was developed where additional layers are stacked.
**Document corruption**: For the encoder, the following are done to corrupt the documents:
1. **Token masking**: Random tokens are sampled and replaced with MASK token.
2. **Token deletion**: Random tokens are deleted.
3. **Token infilling**: Drawing from Poisson distribution, different lengths of text are sampled and replaced with a single MASK token.
4. **Sentence permutation**: A document is split based on sentences, then the sentences are shuffled in random order.
5. **Document rotation**: A token is chosen uniformly, and the whole document is rotated such that the document starts at that token.
## Art
### Figure 1: BART is BERT + GPT
Picture says it all. BART is essentially a composition of BERT and GPT.

(from original paper)
### Figure 2: Encoder Noise Injections
This depiction shows the different ways the corpus is distorted for the encoding task.

(from original paper)
### Figure 3: BART Fine-tuning
BART has different fine-tuning differs between classification and neural machine translation. In the latter, an additional encoder is used.

(from original paper)
================================================
FILE: notes/bert.md
================================================
# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size|
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| <span style="color:blue"> BERT </span> | <span style="color:blue"> Encoder-Only </span> | <ul><li> <span style="color:blue"> **Masked Language Modeling (MLM)** </span> :~15% tokens chosen -> 80% replaced with [MASK], 10% random token, 10% left unchanged. A shallow decoder is used to reproduce the original text. </li><li> <span style="color:blue"> **Next Sentence Prediction (NSP)** </span> : Binary classification task, predicts if 2 sequences follow each other in corpus (useful on Q&A, etc.). Sampling is 50% 0,1. </li></li><li> Training loss is mean of MLM + NSP likelihood </li></ul> | <ul><li> <span style="color:blue"> **Wordpiece Tokenization** </span> : [original paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf), [huggingface explanation](https://huggingface.co/transformers/tokenizer_summary.html#wordpiece) </li><li> **Token break down** :<span style="color:green"> **[CLS] token** </span> (useful for many-to-one fine-tuned tasks such as classification) + <span style="color:green"> **WordPiece tokens** </span> + <span style="color:green"> **[SEP]** </span> token for each sentence. </li><li> <span style="color:blue"> **MAX 512 Tokens**. </span> </li></ul> | <span style="color:blue"> 30k tokens </span> | Greedy decomposition of token into sub-words until it finds tokens in vocabulary. | Sum of: **Token embeddings** (WordPiece) + **segment embedding (learned)** + **Absolute position embedding** | **Scaled Dot-product Self-Attention** (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) | <span style="color:blue"> **GeLU** </span> : **Dying ReLU problem** - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover. | <ul><li> **BERT base**: 12 layers (transformer blocks), 12 attention heads, hidden size = 768 -> ~110 MM params </li><li> **BERT large**: 24 layers, 16 attention heads, hidden size = 1024 -> ~340 MM params </li><li> Generally the parameter space choices are ```embed_size (E) == hidden_size (H)```, the feed-forward size is ```4H``` and the number of attention heads is ```H/64```. To see the math on how the total number of parameters is calculated, check out this comment on BERT github [here](https://github.com/google-research/bert/issues/656#issuecomment-554718760) </li></ul> | <ul><li> Adam and L_2 weight decay </li><li> Learning rate is warmed up during first 10k steps to peak value of 1e.-4, then linearlly decayed </li><li> Models are pretrained for S= 1MM updates</li><li> No layers frozen </li><li> Same learning rate throughout </li></ul> | Book Corpus and Wikipedia (~16 GB uncompressed) | 256 batch size, maximum length 512 |
## TL;DR
Recall that GPT takes the original Transformer encoder-decoder model for neural translation and crafts a left-to-right decoder-only architecture. This is tantamount to only employing the "self-attention" mechanism suggested in the original paper (there were 2 types: encoder-decoder and self-attention for both the encoder and decoder...GPT uses the decoder's self-attention mechanism). The main drawback from the author's perspective is the lack of information from GPT's uni-directional approach. Instead, they suggest an "encoder-only" approach, where information is used from all directions (i.e. "bi-directional"). This means departing from the upper-right triangle self-attention mask used in GPT and instead using a mask that is based on randomly masking tokens in the corpus that will be predicted using the surrounding context - hence the name "masked language model" (detailed in table). There are some other architecture departures such as activation functions and input/embeddings.
## Art
### Fig 1: Pre-training v. Fine-Tuning
This shows the pre-training task and the various branches of fine-tuning depending on the problem.

(from original paper)
### Fig 2: Input Embedding Construction
This shows how the input representation is formed using 3 different embeddings.

(from original paper)
### Fig 4: Different Fine-Tuning Constructions
This shows the different ways the model output are used as input into another "final" layer when fine-tuned for task specific problems.

(from original paper)
### Fig 5: Left-to-Right versus MLM
This is buried in the appendix, but is a pretty important ablation study showing the difference in accuracy between MLM and Decoder style self-attention.

(from original paper)
### Table 6: Model Size v. Accuracy/Perplexity on Hold-Out Data
Here we can see the trade-off between number of layers, hidden size, attention heads and decreases in perplexity/increases in accuracy.

(from original paper)
================================================
FILE: notes/bigtable.md
================================================
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size|
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| <span style="color:blue"> BERT </span> | <span style="color:blue"> Encoder-Only </span> | <ul><li> <span style="color:blue"> **Masked Language Modeling (MLM)** </span> :~15% tokens chosen -> 80% replaced with [MASK], 10% random token, 10% left unchanged. A shallow decoder is used to reproduce the original text. </li><li> <span style="color:blue"> **Next Sentence Prediction (NSP)** </span> : Binary classification task, predicts if 2 sequences follow each other in corpus (useful on Q&A, etc.). Sampling is 50% 0,1. </li></li><li> Training loss is mean of MLM + NSP likelihood </li></ul> | <ul><li> <span style="color:blue"> **Wordpiece Tokenization** </span> : [original paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf), [huggingface explanation](https://huggingface.co/transformers/tokenizer_summary.html#wordpiece) </li><li> **Token break down** :<span style="color:green"> **[CLS] token** </span> (useful for many-to-one fine-tuned tasks such as classification) + <span style="color:green"> **WordPiece tokens** </span> + <span style="color:green"> **[SEP]** </span> token for each sentence. </li><li> <span style="color:blue"> **MAX 512 Tokens**. </span> </li></ul> | <span style="color:blue"> 30k tokens </span> | Greedy decomposition of token into sub-words until it finds tokens in vocabulary. | Sum of: **Token embeddings** (WordPiece) + **segment embedding (learned)** + **Absolute position embedding** | **Scaled Dot-product Self-Attention** (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) | <span style="color:blue"> **GeLU** </span> : **Dying ReLU problem** - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover. | <ul><li> **BERT base**: 12 layers (transformer blocks), 12 attention heads, hidden size = 768 -> ~110 MM params </li><li> **BERT large**: 24 layers, 16 attention heads, hidden size = 1024 -> ~340 MM params </li><li> Generally the parameter space choices are ```embed_size (E) == hidden_size (H)```, the feed-forward size is ```4H``` and the number of attention heads is ```H/64```. To see the math on how the total number of parameters is calculated, check out this comment on BERT github [here](https://github.com/google-research/bert/issues/656#issuecomment-554718760) </li></ul> | <ul><li> Adam and L_2 weight decay </li><li> Learning rate is warmed up during first 10k steps to peak value of 1e.-4, then linearlly decayed </li><li> Models are pretrained for S= 1MM updates</li><li> No layers frozen </li><li> Same learning rate throughout </li></ul> | Book Corpus and Wikipedia (~16 GB uncompressed) | 256 batch size, maximum length 512 |
| DistilBERT | Encoder-Only (Distilled version of BERT) | **Triplet loss**: (1) MLM + (2) Distillation + (3) Cosine-Distance (No NSP)| Same as BERT | Same as BERT | Same as BERT | Embeddings are similar to BERT, except the segment embeddings are removed | Same as BERT | Same as BERT | 66M parameters | Same as BERT (I think) | Same as BERT | 4096 batch size |
| ALBERT | Encoder-Only | <ul><li> **Masked Language Model (MLM) loss function (refer to BERT)** </li><li> **Sentence Order Prediction (SOP) loss** </li></ul> | SentencePiece (as opposed to BERT's WordPiece) similar to XLNet | ~30k | Same as XLNet (greedy algorithm) | SentencePiece embeddings | Encoder-only self-attention, but with different prob(masking) | GeLU (same as BERT) | **Albert-base (sharing attention layers)**: 12 layers, hidden_size=768, embed_size=128 --> **64 MM parameters** | <ul><li>Fine-tuning is task specific (see table 14)</li><li> **LAMB** optimizer was used w/ LR=0.00176 @ 125k steps </li></ul> | Same as BERT | 4096 batch size |
| RoBERTa | Encoder-Only | **Masked Language Model** objective with dynamic masking (see below) + **No NSP or SOP** (NSP removal was shown to be better)| **Byte-level BPE (like GPT)** | 50k | Same as GPT? | Same as BERT | Same as BERT | Same as BERT | Model parameters are kept fixed: <ul><li> L=12, H=768, A=12 -> 110MM parameters (+~15MM for increase in vocabulary with byte-level BPE) </li></ul> | They increase the pre-training steps from 100k (BERT) to up to 500k. They have a tweak on ADAM hyper-parameters. | They combine **5** datasets for **160MM GB in text**: <ul><li> Book Corpus + Wikipedia </li><li> CC-news </li><li> OpenWebText </li><li> Stories </li></ul>| ~2k batch size, max sequence length ~ 512 (less sometimes due to sampling technique) |
| BART | Encoder-Decoder (Transformer) | **Re-construction loss**: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants: <ul><li> **GPT::Language Model** </li><li> **XLNet::Permuted Language Model** </li><li> **BERT: MLM** </li><li> **Multitask MLM** </li><li> **Masked Seq-to-seq** </li><ul> They use **two-stream attention** to compute likelihoods. | Same BPE encoding as GPT-2 | Same as GPT? Or RoBERTa?| Same as GPT? Or RoBERTa?| Same as GPT? Or RoBERTa? | Same as the original Transformer | GeLU | <ul><li> BART contains roughly 10% more parameters than equivalent sized BERT model: 6 encoder layers, 6 decoder layers, embed_size==hidden_size=768.</li><li> For large-scale experiments: 12 encoder, 12 decoder, hidden_size=1024.</li></ul> | <ul><li> 5MM steps. Use 30% token masking, permute sentences. </li><li>There is a different training process for NMT (2-step process)</li></ul>| 160 GB of data similar to *Liu et al 2019* | (for large scale experiments) batch_size=8K|
| T5 | Encoder-Decoder | **BERT-style denoising objective**: Similar to MLM, model is trained to predict missing or corrupted tokens in input. 15% of tokens are randomly sampled and dropped out. (Note: They experimented with many variants) | **SentencePiece** | 32k (across many languages w/ 10:1 English-to-non-English) | Same as BERT | Just token embeddings | Self-attention + encoder-decoder attention (per layer) | ReLU | This study looks at many variants, but the base is similar to BERT_base: <ul><li> 12 blocks (encoder + decoder) </li><li> hidden_size == embed_size = 768 </li><li> FFN_dim=3072 (4*hidden) </li></ul> Utimately, about **twice the size of BERT --> 220MM params**. | <ul><li> **Pre-training:** 2^19 steps for pre-training. </li><li> Use **adaFactor** optimization with **inverse square root** LR scheduler </li><li> **Greedy decoding** at test time </li><li> **Fine-tuning**: 2^18 steps always with same batch_size dimensions, LR=0.001, **5k checkpoints and report results for highest validation performance**. | Common Crawl's C4 data (20 TB) | T=512, batch= 128 with packing such that each batch is approximately **65k tokens** (much smaller than other studies) |
| Adapter-BERT | Encoder-Only | Same as BERT (only fine-tuning is happening in this paper) | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | pre-trained BERT + Adapter Layers | Fine-tuning procedure: ADAM with LR increased during first 10%, followed by decaying to 0 during last 90%. LR is swept ans well as adapter size. |Same as BERT | Batch size = 32 |
| ByT5 | Encoder-Decoder | Similar to **Span Corruption** (T5's pre-training objective) | Tokenless! (uses UTF-8 bytes) | 256 byte values + 3 special tokens | There is still an OOV token, but is not used | Only 256 token embeddings, no positional embeddings | Self-Attention + Encoder-Decoder Attention | ReLU? | Model sizes were made to match mT5 (small, base, large, XL, XXL); to compensate, increased depth of encoder ("heavy encoder") and dim_model, dim_ffn | All hyper-parameters are the same as mT5, except now: <ul><li> Sequence length: 1024 tokens/bytes </li><li> 1MM steps </li><li> batch_size = 2^30 tokens </li></ul> | Same mC4 as mT5 model |T=1024 tokens, batch_size = 2^30 tokens |
| CLIP | Encoder-Only (2 Transformer-based Encoders: Text + Image) | Cross-Entropy loss (to minimize cosine similarity a la constrastive learning) | For Text Encoder: BPE | For Text Encoder: ~50k | Same as GPT | Multi-modal embeddings combining text,image features | Used in both text and image encoders differently | linear projections to embedding | In base Text Encoder, 63M| Adam optimizer with weight decay, Cosine scheduler,learnable temperature | WebImageText dataset, 400MM (text,image) pairs | 32k |
| DALL-E | Decoder-Only (read about attention) | no pre-training/fine-tuning per se | BPE for text + pixel tokens | text=16,384; image=8,192 | greedy? | token | 3 types of attention: text-to-text (causal), text-to-image, image-to-image attention | GeLU? | Up to 12 BN! | training broken out into 2 steps: 1. dVAE (gumbel-softmax relaxation) 2. transformer (cross-entropy) with 16-bit precision | Conceptual Captions + proprietary dataset | per-gpu=8, total_batch=512 |
| Codex | Decoder-Only (GPT) | The usual "causal" GPT decoding problem is presented | BPE | ~50k + white space tokens | GPT-3 construction | GPT-3 construction| GPT-3 construction | GPT-3 construction | Large model was 12 B parameters| Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay. | 54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB.| ?|
================================================
FILE: notes/byt5.md
================================================
# ByT5: Towards a token-free future with pre-trained byte-to-byte models
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| ByT5 | Encoder-Decoder | Similar to **Span Corruption** (T5's pre-training objective) | Tokenless! (uses UTF-8 bytes) | 256 byte values + 3 special tokens | There is still an OOV token, but is not used | Only 256 token embeddings, no positional embeddings | Self-Attention + Encoder-Decoder Attention | ReLU? | Model sizes were made to match mT5 (small, base, large, XL, XXL); to compensate, increased depth of encoder ("heavy encoder") and dim_model, dim_ffn | All hyper-parameters are the same as mT5, except now: <ul><li> Sequence length: 1024 tokens/bytes </li><li> 1MM steps </li><li> batch_size = 2^30 tokens </li></ul> | Same mC4 as mT5 model |T=1024 tokens, batch_size = 2^30 tokens |
## TL;DR
Tokenizers are often an independent artifact of a model - they convert text into a sequence of tokens for downstream tasks. The value of a "token-free" model is the ability to feed the raw text (of any language!) into a model. The authors demonstrate a way of using byte sequences as a direct input to the standard Transformer architecture. Further, they show that byte-level models are significantly more robust to noise (i.e. spelling errors).
**Tokenizers: A review** Tokenizers are often used to convert a text sequnece into a series of token IDs out of a finite _vocabulary_ (map of token <-> id). The problem with building this vocabulary is how to handle OOV tokens ( _out of vocabulary_). A standard approach is to map all unknown words to ```<UNK>```. Another answer, popular in **subword tokenizers** is to greedily decompose an OOV token into a set of tokens it recognizes. This property has led it to be the gold standard of tokenizers. However, subwords still are not immune to all varations found in language: (mis)spellings, captilizations, syntatical forms of a word, etc. Also, new characters from newly encountered languages falling into the OOV token. To avoid these pitfalls, the authors suggest exploring **token-free** models, ones that do not require a vocabulary at all and make DL true end-to-end models. Further, byte-level models can in theory read in arbitrarily long sequences of text and only require 256 (8 bits) embeddings. By relying on dense representations of the bytes themselves, models should be able to generalize more effectively and be more immune to the actual spelling variations of words. **However**, the biggest problem is that the lengths of sequences become much longer.
**ByT5 Design**:
The basis of the ByT5 is the mT5 model (which inherits from T5). The goal is create a text-to-text model for 100+ languages. Here are the changes:
1. **Byte-level input**: UTF-8 bytes are fed into the model directly without **any** pre-processing. Each byte is mapped directly to a vocabulary of 256 byte values + ```<PAD>```, ```<EOS>```, ```<UNK>``` (though this is not used since the embedding is complete).
2. **Pre-training objective**: Pre-train objective is similar to span corruption from T5.
3. **Heavy encoder**: Unlike the previous * T5 models, the byT5 has a much deeper encoder ("heavy") than decoder (3:1 ratio), which makes it more similar to BERT style models.
Given the reduction in vocabulary-based parameters (which accounted for as much as 66% of mT5), the authors increased the number of layers and size of the hidden_size and ffn_size of these layers. They did this mainly to make a rough comparison of model size with mT5. This also means the model requires more FLOPs for a given sequence versus a standard vocabulary-based model.
**Noise Studies**: This is one of the really interesting parts of the study where the authors attempt to see how noise affects the results of the mT5 and byT5 on different NLP tasks. They introduce synthetic noise on text using 6 different noising schemes:
1. <ins> Drop </ins>: They drop 10% of characters.
2. <ins> Add/Drop/Mutate </ins>: At each character position, there is a 10% chance of doing one of these operations.
3. <ins> Repetitions </ins>: Each character has 20% chance of ~2 repetitions appended to a character.
4. <ins> Antspeak </ins>: Weird conversion I've never seen.
5. <ins> Uppercase </ins>
6. <ins> Random case </ins>
The further distinguish between **learnable noise** (noise that is applied during fine-tuning and evaluation) and **unseen noise** (only in evaluation). Reviewing Table 6, you can see on average the byT5 does better.
## Art
### Figure 1: mT5 v. byT5 Design
This is a great illustration comparing the 2 encoder-decoder architectures. The byT5 reads in UTF-8 bytes on a 1028 byte long sequence. To compensate for the loss of vocabulary specific parameters, the authors made a much deeper encoder than decoder (as opposed to a symmetric number of parameters and layers).

(from original paper)
### Table 1: mT5 v. byT5 Parameters
To make an apples-to-apples comparison of model size, the authors made the byT5 model have a similar number of parmaeters per "size". This required increasing the dim_model and dim_ffn sizes as well as increase the depth of the encoder.

(from original paper)
### Table 2: mT5 v. byT5 Performance

(from original paper)
### Table 6: Noise Study
This table shows the degredation of signal due to noise injected into the text. Hugely important table.

(from original paper)
================================================
FILE: notes/clip.md
================================================
# Learning Transferable Visual Models From Natural Language Supervision
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| CLIP | Encoder-Only (2 Transformer-based Encoders: Text + Image) | Cross-Entropy loss (to minimize cosine similarity a la constrastive learning) | For Text Encoder: BPE | For Text Encoder: ~50k | Same as GPT | Multi-modal embeddings combining text,image features | Used in both text and image encoders differently | linear projections to embedding | In base Text Encoder, 63M| Adam optimizer with weight decay, Cosine scheduler,learnable temperature | WebImageText dataset, 400MM (text,image) pairs | 32k |
## TL;DR
**Note**: This model may not be considered a language model or transformer, but it uses attention in an interesting way, so worth a read. Also note that this paper is on the longer side, 48 pages.
**Most interesting quote**: (from blog post) "Deep learning systems are often reported to achieve human or even superhuman performancey on ImageNet that surpassed reported human top-5 accuracy on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark. In other words, there is a gap between “benchmark performance” and “real performance.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. This results in its benchmark performance being much more representative of its performance in the wild. To verify the <ins>“cheating hypothesis”</ins>, we also measure how CLIP’s performance changes when it is able to “study” for ImageNet. When a linear classifier is fitted on top of CLIP’s features, it improves CLIP’s accuracy on the ImageNet test set by almost 10%. However, this classifier does no better on average across an evaluation suite of 7 other datasets measuring “robust” performance" --> **Zero-shot performance is true robust performance, everything else is probably peaking**.
The authors motivate the problem by observing that breakthroughs in NLP were due in large part by pre-training using web-sourced text (as opposed to needing very large, crown-sourced, high-quality "gold-label" labeled datasets). They ask if this is possible as well for CV, which is predominantly still trained on hand-labeled datasets. Note that these models are generally *discriminative models*, attempting to learn class membership for a given image. For instance, the authors point out that *Noisy Student EfficientNet-L2* requires 33 TPUv3 core-years to train, which is a lot of resources for predicting only *1000* ImageNet classes.
The authors' contribution is the study of image classifiers trained with natural language understanding at large scale through **C.L.I.P.** - Constrastive Language-Image Pre-training. They find that this model architecture is able to learn to preform a wide-range of tasks during pre-training, including OCR, geo-locatization, etc. Further, the model is able to achieve SOTA results at 4x the effiency of a Transformer model.
**CLIP Motivation**: The core term in the literature is NLS (**natural language supervision**). By pairing text understanding to image recognition, it allows the flexibility of zero-shot transfer.
The authors constructed a new dataset of 400 MM (image, text) pairs derived from 500k queries on the internet which they call **WebImageText**.
The authors first attempted an approach of jointly training an image CNN + text Transformer, but scaled less efficiently than a simply BOW appraoch of the same text. The crux of the matter is that these *generative models* are trying to learn the <ins>exact language to describe an image </ins>, which is hard. In comparison, the **constrastive learning** literature requires an order less compute with the same prediction power. Further, the objective they pre-train is thought to allow for unseen class generalization better than the conventional learning a specific set of classes.
**Contrastive Pre-Training**: Taking a page out of the constrastive learning space, CLIP's ultimate goal is to identify which (image,text) belongs to which ```NxN``` category. This is done by training the embeddings derived from the outputs of image and text encoders to <ins> maximize the cosine similarity </ins> of image and text embeddings for ```N``` real pairs (i.e. <ins> the diagonal in Figure 1 </ins>) and trying to minimize the cosine similatiry for ```N^2-N``` negative pairs (the off-diagonal elements). The cross entropy loss is symmetric between these 2 objectives. Instead of using non-linear projections, they use a linear projection from ```encoder:->embedding```.
**Encoders**:
1. **Image**: The authors explored 2 models: (1) ResNet-50 design with an attention-based pooling layer (2) Vision Transformer.
2. **Text**: The authors explored 2 models: (1) CBOW or (2) Transformer, with base 63M-parameter, 12-layer, 76-max sequence length, 8 attention heads, BPE encoding for vocab of ~ 50k. This model uses **MLM** (masked self-attention)
**Ablation & Zero-Shot studies**: There is a lot more to this paper, specifically in the ablation and zero-shot studies, which you should check-out. Too much to summarize.
## Art
### Figure 1: Clip Architecture (Important!)
This is a great depiction of how CLIP is trained and used in a zero-shot setting. Notice that the text-encoder and image-decoder are both producing an embedding (through linear projection of the feature representations) which then is used to calculate a pairwise cosine similarity. The goal is to minimize the distance for N pairs and maximize the distance for the rest. This translates to trying to maximize the cosine similarity of the diagonal elements via cross-entropy and minimizing the cosine similarity for the off-diagonal elements.

(from original paper)
### Figure 2: Clip Efficiency
This is a depiction of how the zero-shot accuracy of CLIP scales with number of data points. Further, it shows the efficiency of data relative to information learned.

(from original paper)
### Figure 3: Clip Pseudo-code

(from original paper)
### Table 1: Zero-shot performance
This shows how the prior SOTA is beaten by CLIP.

(from original paper)
================================================
FILE: notes/codex.md
================================================
# Evaluating Large Language Models Trained on Code
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| Codex | Decoder-Only (GPT) | The usual "causal" GPT decoding problem is presented | BPE | ~50k + white space tokens | GPT-3 construction | GPT-3 construction| GPT-3 construction | GPT-3 construction | Large model was 12 B parameters| Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay. | 54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB.| ?|
## TL;DR
Codex is a GPT model fine-tuned on publicly available code from GitHub with the goal of writing code based on docstrings. In comparison to other GPT-* models, it achieved SOTA results on HumanEval. Further, the sampling strategy from their generative model dramatically increased the number of problems Codex solved.
**Results**:
<ins> Single sample results </ins>: 300M param model solved 13.2%; 12B param model solved 28.8%. After fine-tuning on correctly implemented function dataset, <ins> Codex-S </ins> got up to 37.7%.
<ins> N>1 (i.e. 100) sample results </ins>: Using <ins> nucleus sampling </ins>, 100 samples are drawn with a top p=0.95. The question of how to chose a sample is an important question. If the authors peeked and knew which sample would pass (**the "oracle strategy"**), Codex-S gets one correction function 77.5% of the time. If the **mean log-probability** is chosen, the accuracy is 44.5% of the time. Figure 7 shows the different outcomes for this.
For a comparison, 6B parameter GPT-J solved 11.4%.
**Codex Training Set**:
54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB.
<ins> HumanEval Test Dataset </ins>: The objective for Codex is to develop standalone Python functions from docstrings and evaluate the correctness via unit testing. The authors developed a HumanEval test dataset of 164 "leetcode", comprehension and arithmetic problems. Evaluation is determined by passing the unit tests.
Unit tests are the success criteria as the authors believe that BLEU and other match-based metrics fail to account for the complexity of code solutions. In fact, they provide evidence that there are solutions that are functionally incorrect yet achieve high BLEU scores.
The main metric used is <ins> Pass@k </ins>. The originally proposed version of this metric in the literature calculates the total fraction problems solved, i.e., ```pass@100``` means the percentage of successful trials out of 100. The author notes the high variance of this metric and derive an unbiased estimator of this metric using the expectation derived from a Binomial distribution.
**Training**: Codex was initialized as the usual GPT but fine-tuned using the Github dataset. They note, however, interestingly, that they did not think there was significant value to the pre-training, other than faster convergence.
Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay.
The tokenizer employed was the usual GPT-3. *Interestingly*, the authors observe that the <ins> distribution of words in Github is very different than natural text </ins>, and therefore their tokenizer is not the best for this dataset. To compensate, they added in tokens for representing whitespace of different lengths.
**Problem setup**: The input presented to the decdoer isa header, signature and docstring. The decoder's output is sampled until it arrives at one or more terminating sequence tokens. Figure 2 does a great job illustrating.
## Art
### Figure 1: Codex Performance
This shows pass rates as a function of model size. Mean log(p) is an encouraging metric of performance without peaking.

(from original paper)
### Figure 6: Pass Rate v Model Size v Temperature
This shows how controlling for temperature also plays an imporant role in outcomes. Higher temperatures equates to more diversity - since their metrics reward any correct solution, that would be helpful.

(from original paper)
### Figure 7: Sampling Heuristics
This is an important plot related to sampling strategy as it relates to pass rate. The oracle strategy is peeking; however, ```mean log(p)``` seems to be promising.

(from original paper)
### Table 1: Results
Results of transformers on HumanEval test set.

(from original paper)
================================================
FILE: notes/contrastive.md
================================================
# Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
***Motivation**:
The current state of the art of NLP is to first pre-train a model on an auxiliary task and then fine-tune to a specific problem using cross-entropy.
<ins>Cross-entropy has a number of problems</ins>:
1. <ins>Poor generalization</ins>: Models trained using CE are shown to display poor generalization performance + robustness to noise. The current best practices suggested by people include <ins>label smoothing</ins> or <ins>knowledge distillation</ins> or <ins>self-training</ins>.
2. <ins>Unstable across runs</ins>: CE also shows instability in the face of few shot learning. Currently empirical evidence suggests fine-tuning for more iterations, re-initing the top layers, etc. makes the fine-tuning stage more stable.
In the context of few-shot learning, the authors propose adding a <ins>Supervised Contrastive Learning (SCL)</ins> term to the fine-tuning objective. It's intuitive to think of different classes being separated in an embedding space- this loss function would partition the space futher.
**Result**: By including this term in their loss function, the authors found they were able to beat RoBERTa(Large) on a number of GLUE benchmarks, particularly those in a few-shot learning context (<=1000) and in the presence of noise.
**SCL Loss**: The inclusion of this loss takes the encoding output of any particular datapoint in a batch (the representation prior to the final softmax layer) and attempts to minimize the distance of the ```[CLS]``` between 2 datapoints of similar class membership. <ins>Temperature</ins> as a hyper-parameter can be thought to influence the boundary - lower temps usually means creating harder negatives / more margin-based training, similar to triplet loss, except this objective <ins>only focuses on positive cases</ins>. <ins>Note</ins>: Recall in triplet loss that this is achieved by comparing both postive and negative cases (as in self-supervised contrastive loss).
**Setup**: GLUE benchmarks are used with the ROBERTa(Large) model as the workbench.
## Art
### Figure 1: SCL separating classes

(from original paper)
### Figure 2: Class embedding separation

(from original paper)
### Figure 3: Class embedding separation v. N

(from original paper)
### Table 2: Few-shot learning results
Glad to see standard deviations for once in a table!!!

(from original paper)
### Table 3: Noise injection study

(from original paper)
================================================
FILE: notes/dalle.md
================================================
# Zero-Shot Text-to-Image Generation
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| DALL-E | Decoder-Only (read about attention) | no pre-training/fine-tuning per se | BPE for text + pixel tokens | text=16,384; image=8,192 | greedy? | token | 3 types of attention: text-to-text (causal), text-to-image, image-to-image attention | GeLU? | Up to 12 BN! | training broken out into 2 steps: 1. dVAE (gumbel-softmax relaxation) 2. transformer (cross-entropy) with 16-bit precision | Conceptual Captions + proprietary dataset | per-gpu=8, total_batch=512 |
## TL;DR
Compared to prior art, the authors demonstrate that training a 12 BN parameter autoregressive transformer on 250 MM (text/caption ,image) result in a high fidelity ```text-> image``` generative model. Further, the model produces a high quality image generation on MS-COCO <ins> zero shot </ins>, without any of the training labels.
**Model Construction**:
First, it's worth mentioning <ins> pre-processing </ins>. Pixels and text are both converted into tokens:
1. **text tokenization**: The text goes through OpenAI's favorite BPE tokenization with a **text vocab of 16,384**. Each caption is represented by 256 tokens.
2. **image tokenization**: Images also are represented by tokens with a **pixel vocab of 8,192**. Each image is represented by 1024 tokens (32x32).
Instead of directly training a transformer on the raw pixel-tokens, which would require an inordinate amount of memory, the authors construct a 2-stage training process:
1. **discrete variational autoencoder (dVAE)**: Similar to VQVAE, the authors used <ins> gumbel-softmax relaxation </ins> to compress the 256x256 image into 32x32, or as they call it, "learning the visual codebook" (see details in appendix). In their training process, they first train this model to accuractely represent the images. The dimensionality of the problem by 192x.
2. **[text,image] decoder-only transformer**: Concatenating 256 BPE-encoded text tokens with 32x32 (1024) image tokens, this input is given to an autoregressive, decoder-only transformer to model the joint distribution over the text and image tokens. In their training process, this model is trained after the dVAE is trained. The tokenization scheme is mentioned above. The image tokens are simply derived as ```argmax sampling``` from the dVAE. In terms of the transformer, there are **3 kinds of attention** used in the model: (1) <ins> text-to-text attention </ins> via auto regressive/ causal attention mask (2) <ins> image-to-image attention </ins> via row, column or conv attention and (3) <ins> image-to-text attention </ins>. The authors achieved this in a single attention mechanism versus building separate attention operations. This model is trained via cross-entropy for image and text, where most weight goes to the image loss.
**Training Objective**: They are interested in maximizing the model's log likelihood over the image and text tokens per usual. This is intractible, so we can break this out into our 2 stages of modeling and instead <ins> maximize ELBO </ins> as expected and trained separately.
**Dataset**: For <=1.2 BN parameter DALL-E, they used the Conceptual Captions dataset of 3.3 MM datapoints. For >=1.2 BN, <=12 BN params, they created their own dataset similar to JFT-300M.
**Training Hacks**: To save GPU memory, they store <ins> 16-bit precision</ins> and re-compute activations in resblocks during backward pass. They found that for >= 1 BN parameter models this was a challenge as the exponents of the activation gradients deeper in the model got rounded to zero a la <ins> underflow </ins>. Their solution was to implement independent "gradient scales" for each resblock. Also, their larger models were too large to keep in a single GPU's memory, so they employed <ins> parameter sharding </ins>.
**Sample Generation**: This was an interesting idea - similar to CLIP, a <ins> pre-trained contrastive model </ins> is used to select the best (image, text) pair generated. The highest ranking pair is then submitted.
## Art
### Figure 1: Downsampling via dVAE
This shows the before/after of using dVAE to compress/downsample an image before handing it to the transformer.

(from original paper)
### Figure 2: DALL-E Generation Variance

(from original paper)
### Figure 6: Contrastive Re-ranking
Basically, by generating more samples and selecting "the best" using CLIP, the quality of the output was better.

(from original paper)
### Figure 7: DALL-E Performance

(from original paper)
================================================
FILE: notes/dedup.md
================================================
# Deduplicating Training Data Makes Language Models Better
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
**The problem**:
An important reason for the recent SOTA results achieved by transformers is due to very large text datasets, on the order of 100s of GB. Given the size of these datasets, they suffer from a litany of biases that cannot be captured by perplexity or validation loss. This in turn is learned by the models that are trained on these datasets.
One bias is **duplicated training examples** and is quite pervasive. Attacking this problem at scale is a challenge.
According to the authors, <ins>there are a few reasons to remove deduplicated data</ins>:
1. >=1% of tokens output by language models in studies appear to be part of a memorized sequence. This is a problem in some studies, where <ins>they show that LMs are at risk of memorizing sensitive information</ins>. Deduping reduces this to 0.1%.
2. There is train-test overlap in many datasets, even though not intended (like in C4).
3. Obvious reducing the size of training data is good from a computation budget perspective.
4. Deduping doesn't reduce perplexity, and in fact might reduce perplexity on other hold-out sets.
5. When language models are trained on web-crawled datasets (per usual), it's also possible to that there can be **train/test leakage** between a training set and the fine-tuning test sets. (<ins>Note:</ins>: several authors try to remove these "contaminated" datasets from training; GPT-3 did the opposite are removed the data from the eval sets-> **over 90% of tasks were flagged as potentially contaminated**).
**Model**: These use a GPT family of transformer for their experiments in text generation.
**Datasets**: Wiki-40B, LM1B, C4, RealNews
**Solution Overview**:
The authors propose 2 techniques for combating duplication:
1. <ins>Exact substring match</ins>: Identfiy vertabim strings that are repeated. Obviously, this is super inefficient.
2. <ins>Approximate full document match</ins>: Using hash-based techniques, identify pairs of documents with high n-gram overlap. These 2 methods are <ins>suffix array</ins> and <ins>minHash</ins>.
**Suffix Arrays**:
Obviously, the naive exact-substring match approach is quadratic ```O(N^2)``` in complexity.
Suffix arrays are an interesting alternative. If the entire dataset is treated as a giant sequence, then suffix arrays allow for sub-string queries in ```O(N)``` time. The dataset is partitioned into a suffix array and then a search can occur for adjoint entries that share a common prefix. This can be done in an easily parallelizable way.
**minHash Algorithm**:
This approach has been used in other papers on LM as well. The idea is to use a hashing function to calculate an approximate <ins>Jaccard index</ins> between 2 documents - if it is sufficiently high, those 2 documents are likely copies of each other. A probability function is derived from the Jaccard index to establish the probability that 2 documents are the same.
The authors say that 2 documents are <ins> duplicates iff minHash>=0.8 and editSimilarity>=0.8</ins>
**Results**:
The authors found a significant amount of deduplication within their datasets using these 3 methods mentioned: ~3% - 14% of near duplicate examples.
## Art
### Figure 1: Sub-string occurrence in each dataset
Noting the probability of seeing a second copy of a string in another part of the data. They set their k>=50.

(from original paper)
### Figure 2: Near-duplicate cluster sizes

(from original paper)
### Figure 3: Dedup impact on ppl

(from original paper)
### Figure 5: Document similarities
Edit and Jaccard similarities

(from original paper)
### Table 1: Visualizing near-dups
This is a pretty nice visual

(from original paper)
================================================
FILE: notes/distilbert.md
================================================
# DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |
| DistilBERT | Encoder-Only (Distilled version of BERT) | **Triplet loss**: (1) MLM + (2) Distillation + (3) Cosine-Distance (No NSP)| Same as BERT | Same as BERT | Same as BERT | Embeddings are similar to BERT, except the segment embeddings are removed | Same as BERT | Same as BERT | 66M parameters | Same as BERT (I think) | Same as BERT | 4096 batch size |
## TL;DR
DistilBERT is the brainchild of Huggingface and uses **knowledge distillation** to capture the pre-training phase of BERT. Typically, a classification model's objective function attempts to minimize the cross entropy between the predicted distribution and the one-hot empirical distribution. A good model in theory will predict high probabilities on the correct class and close to 0 on all others (yet, this is often not the case). In the distillation paradigm, there is a **student** (distilBERT) and **teacher** (BERT). There is a **distillation loss** in the objective where instead of the one-hot used in a typical cross entropy, there is instead the *teacher's predicted probability* multiplied by the *student's surprisal function*. On top of the normal "masked language model" cross-entropy loss, there is a **cosine embedding** loss to align hidden vectors.
Prior to this paper, most prior work explored distillation in the context of fine-tuning/ task-specific models. To achieve this goal, a triplet loss was used (mlm, distillation, cosine-distance). Ultimately, this model has 40% fewer parameters than BERT base, runs 60% faster and still ~ 97% of BERT's performance on GLUE. Further, this smaller model doesn't suffer from the same inference time issues that large language models such as BERT, RoBERTa, XLNet require - it can even be run on mobile devices.
**Other differences to note**:
- distilBERT used **huge** batch sizes ~4k with gradient accumulation where BERT used batch sizes on the order of 8-128. They took the idea from RoBERTa paper.
- Architecture modifications:
> Dynamic masking as opposed to static masking.
> Reduced number of layers by factor x2.
> Discarded the Next Sentence Prediction and segment embeddings (similar to RoBERTa).
**Note:** Google Research also wrote a paper on knowledge distillation titled [Well-Read Students Learn Better: On the Importance of Pre-Training Compact Models](https://arxiv.org/pdf/1908.08962.pdf). Also, if you look on the BERT github page, .there are several contributions of much smaller language models ("tiny", etc.)
## Art
### Table 1: DistilBERT performance

(from original paper)
### Table 3: DistilBERT inference time
Note: These times are 1 batch size on a CPU

(from original paper)
================================================
FILE: notes/gradient-attack.md
================================================
# Gradient-based Adversarial Attacks Against Text Transformers
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
The study of adversarial attacks is motivated by the observation that DL models can be tricked to come to completely different conclusions based on small perturbations of an input.
Generally, these adversarial examples are derived from an optimization problem with some *adversarial loss* that maximizes prediction error subject to some constraints, such as fluency or perceptability. Unlike pixels in images, which are drawn from a continuous distribution, the tokens in NLP are discrete in nature. To overcome this problem, in NLP adversarial attacks are generally heuristic in nature where token-level replacement candidates are explored in an exhaustive search (which can be computationally expensive).
**Adversarial Attacks**:
Let's say that a classifier ```h``` is trained to predict a label ```y_label``` given an example ```x```, i.e.: ```y_label = h(x)```. An adversarial example, ```x'``` is defined such that it is "close" to x as defined by some function ```p(x,x')<eps``` and yet ```y_label != h(x')```. This closeness can be defined as <ins>perceptibility</ins>, i.e., the semantic meaning of ```x',x``` are similar. These examples are considered adversarsial because a human would be able to correctly label, but the model fails to make that connection.
**What is Adversarial Loss?**:
To derive said adversarial examples, there is usually an adversarial loss such as <ins>margin loss</ins> whereby a model misclassifies ```x``` by some margin ```k``` subject to the percetibility constraint.
However, the issue here is that this objective function usually cannot be solved via GD since the space of vectors, ```x``` is *discrete*.
**How These Models Are Measured**:
When papers report how well they did, they report degredation numbers, i.e.: "BERT-Attack reduced accuracy from 94.2 --> 10.6".
**The Gist**:
While in CV adversarial attacks were able to degrade performance to ~10%, NLP SOTA was hitting only 10%.
The authors suggest <ins>Gradient-based Distributional Attack (GBDA)</ins> as a general approach to generating adversarial examples. Rather than constructing a single adversarial example, the authors are generating a <ins>distribution</ins>. This distribution is based on the <ins>Gumbel-Softmax</ins> distribution (great paper explaining this distribution [here](https://arxiv.org/pdf/1611.01144.pdf)). To enforce the constraint of perceptibility, the authors use BERTscore; to enforce fluency, they enforce language perplexity. Ultimately, all 3 of these are components of one large loss function that is <ins> fully differentiable</ins> and can be solved via gradient-descent. The outcome is a continuous, adversarial distribution that can be sampled to query different target models. (Figure 1 illustrates this well.)
**The Adversarial Distribution**:
The problem set forth is a minimization problem of the adversarial loss function with respect to the token distribution. However, the token distribution is a <ins>categorical(read: discrete)</ins> distribution is therefore inherently <ins>non-differentiable</ins>.
To get around this, the authors drop in <ins>probability vectors instead of tokens</ins> and then use the Gumbel-softmax approximation to take the gradient. That is to say, <ins>the Transformer is taking in the average embedding of tokens from this G-S probability space</ins>. The goal is to optimize the probability distribution such that we minimize the loss which comprises of the 3 terms mentioned earlier (using 2 hyper-parameters, ```lambda```, to control the constraints).
## Art
### Figure 1: Overview
A depiction of the whole setup. Notice that the inputs to the different models that make up the 3 loss function components are average embeddings weighted by the probability matrix. This probability matrix is our "knob".

(from original paper)
### Figure 2: Constraint Effect on Attack Rate

(from original paper)
### Table 5: Fluency Constraint Illustrations
Great visuals showing how non-sensical adversarsial attacks can be without this fluency constraint.

(from original paper)
================================================
FILE: notes/human-pref.md
================================================
# Fine-Tuning Language Models from Human Preferences
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
The goal is to explore the idea of using reinforcement learning (RL) to learn value functions that are complex and defined by human judgment via reward learning.
In this paper, pre-trained models are fine-tuned with RL rather than the usual supervised learning objective. Interestingly, to prevent the model from drifting too far from a pre-trained model, they use <ins>KL divergence</ins>to keep the model from not straying from the pre-trained distribution.
**Fine-tuning**:
This RL, human preference task is defined 2 ways:
1. <ins>Stylistic continuation</ins>: 5k human comparisons were made where a human chose the best of 4.
The goal is to learn the reward function that weighs ```r(input,output_i)``` via a softmax loss function. This function is penalized by a KL term that considers the language model probability distribution. There is also a separate policy function ```pi```` that is trained via Proximal Policy Optimization (PPO). The policy function is initialized by the language model.
2. <ins>Summarization tasks</ins>: 60k human-curated examples where someone copies relevant sections of a larger text.
**Stylistic results**:
1. <ins>RL fine-tuned v. zero-shot</ins> -> human won **86%** of the time
2. <ins>RL fine-tuned v. supervised fine-tuned</ins> -> human won **77%** of the time
**Summarization results**:
The authors were underwhelmed by these results, believing that stylistic tasks require very little data.
## Art
### Figure 1: Reward modeling and policy training

(from original paper)
### Figure 2: Reward as a function of data

(from original paper)
### Table 1: Human evaluations

(from original paper)
================================================
FILE: notes/megatron.md
================================================
# Megatron-LM: Training Multi-Billion Parameter Language Models Using
Model Parallelism
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
The goal is to illustrate how to train very large transformer models via intra-layer model parallelization for 1BN+ parameter models. The cool thing is that it can be implemented natively in PyTorch.
Model parallelism is an essential part of training large LMs, with the sheer size of parameter spaces, ADAM parameters, etc. Mesh-Tensorflow is an example of a framework for model parallelism, but require rewriting the model, special compilers.
The authors make the following observations in this paper:
1. They show an intra-layer model-parallelism that is achievable by tweeking existing Transformer architecture without special code/compilers.
2. They show they can scale the model up to 76% scaling efficiency using 512 GPUs.
3. Re-arranging the placement of layer normalization in BERT **dramatically** increases the accuracy of the model as it scales.
4. They achieve SOTA on many benchmarks.
**Data v. Model Parallelism**:
<ins> Data Parallelism </ins>: A dataset is sharded and fixed onto a set of workers. These workers will usually send the gradient results to another server to then retrieve update weights. It is calculating gradients on N copies of the same model.
Going purely the data parallelism route, *weak scaling* can be achieved by increasing the mini-batch size proportional to the number of workers. Some memory optimizations are useful here - for instance, **activation checkpointing**, which involves not caching forward pass calculations, but insread re-computing them in backprop (trading speed for memory). The issue here is that **the model must fit onto one worker**. Some authors have created parameter sharing models (think alBERT), but the others think this is not adequate.
<ins> Model Parallelism </ins>: Memory usage and computation of a model is distributed across multiple workers.
Here there are 2 approaches: *pipeline model parallelism* and *distributed tensor computation*. In the former, operations cascade in serial across devices. **GPipe** is a model parameter server that tackles this issue. Sometimes this approach can effect the optimizer itself, which in turn effects the accuracy. In the latter, tensor calculations are distributed. An example library is **Mesh-Tensorflow**. This requires a special graphical compilation, but the authors observe some interesting ideas, such as parallelizing the attention heads.
**Their Tweaks**:
**TL;DR**: All of this is done through **GEMM**: i.e., matrix multiplication + concatentation. Rather than having each part computation done serially across nodes (done basically to handle GPU cache issues), they have "model parallel regions" (here is the **model parallelism**)- these regions remember multiple sets of parameters across models and do parallel calculations. The results are then "reduced" and then propogated to **data parallel** nodes that compute the rest of the graph in parallel. The data parallel components are in sync because the **random seed** is the same across all these nodes, ensuring synchronicity (see fig 8.)
1. *GeLU*: Their first observation is to split the data column-wise and concatenate the activation (GeLU) results. They then "reduce" across GPUs before passing to dropout layer.
2. *Attention*: They then exploit the column-wise parallelism of multi-head attention.
3. *Embeddings*: Similar idea as before, they parallelize the input embedding and fuse the output embedding with logit and also parallelize are return scalars.
## Art
### Figure 1: FLOPs Efficiency w/ Data and Model Parallelization
Combining both practices leads to better performance.

(from original paper)
### Figure 3: Layer Parallelism
Illustration of how the layer parallelism works.

(from original paper)
### Figure 4: Communication Across Layers

(from original paper)
### Figure 6: Larger Models Are Better
Lower ppl on val sets with larger models.

(from original paper)
### Figure 8: Model v. Data Parallelization

(from original paper)
================================================
FILE: notes/reformer.md
================================================
# REFORMER: THE EFFICIENT TRANSFORMER
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
The basic premise is that the authors are exploring ways of improving the efficiency of Transformers. By replacing the usual dot-product (or scaled-dot-product) attention mechanism that is ```O(T^2)``` in complexity, they suggest a ***locality-sensitive hashing*** that reduces complexity to ```O(T*log(T))```. This, along with some changes in residuals, is the basis of the **Reformer** model.
Latest class of Transformer models are ~ 0.5B per layer, up to 64 layers, up to 11k tokens of text in a single example (i.e., really long text), as well as multi-modal models. These models can realistically only be trained in industrial labs, cannot be fine-tuned on a single GPU.
**Memory Usage**:
The authors observes that, at face value, you might expect the following math for memory usage:
1. 0.5B of floats per layer -> 2GB of memory.
2. 0.5B for embeddings per batch.
3. 17GB of text.
With this math, a Transformer layer _should_ fit onto a modern GPU/TPU. **However**, this doesn't account for:
1. **Activations**: storing ```O(N)``` activations for back-prop, especially in the FFN layers.
2. **Dense Attention**: Attention layers are ```O(L^2)``` in runtime and memory complexity.
This makes the memory explode.
Their model, the **Refomer** implements the following ideas:
1. **Reversible layers**: Following a paper by Uber that introduces the idea of reversible layers, they reduce ```O(N) -> O(1)``` space complexity by storing 1 set of activations and performing backprop, simply by storing the activations and their derivatives for the top layer in the sequence.
2. **Splitting activations inside FFN**.
3. **Replace Dot-Product Attention Mechanism (Biggest Contribution)**: Replace the usual ```O(L^2)``` woth a "locality-sensitive hashing" that is ```O(L*log(L))``` space complexity.
Assuming Q,K,V are ```(batch_size, max_seq_lenth, dim_model)``` The ```Q*K_transpose``` term is ```(batch_size,length,length)``` can be huge in the case of very long sequences.
**How Locality-Sensitive Hashing (LSH) Works**: (Btw, fantastic explanation of this in the paper)
The basic idea presented is this: ```Q*K_transpose``` is large, but we are really interested in ```softmax(Q*K_transpose)``` which converts the smallest products ```q_i*k_j-->0``` and the largest products ```q_i*k_j-->1```. Thus, finding the largest ```q_i*k_j``` is the task, which is a **nearest neighbors problem**. The math details are in the paper, but this done with a **hashing function** where nearby vectors get the same hash with a high probability. There are issues with the un-evenness of hashes, etc that are discussed in more detail in the paper, including creating more uniformity via multiple sequential hashes.
## Art
### Figure 2: LSH Attention Schematic

(from original paper)
================================================
FILE: notes/roberta.md
================================================
# RoBERTa: A Robustly Optimized BERT Pretraining Approach
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| RoBERTa | Encoder-Only | **Masked Language Model** objective with dynamic masking (see below) + **No NSP or SOP** (NSP removal was shown to be better)| **Byte-level BPE (like GPT)** | 50k | Same as GPT? | Same as BERT | Same as BERT | Same as BERT | Model parameters are kept fixed: <ul><li> L=12, H=768, A=12 -> 110MM parameters (+~15MM for increase in vocabulary with byte-level BPE) </li></ul> | They increase the pre-training steps from 100k (BERT) to up to 500k. They have a tweak on ADAM hyper-parameters. | They combine **5** datasets for **160MM GB in text**: <ul><li> Book Corpus + Wikipedia </li><li> CC-news </li><li> OpenWebText </li><li> Stories </li></ul>| ~2k batch size, max sequence length ~ 512 (less sometimes due to sampling technique) |
## TL;DR
First, the authors identify the problem that ELMo, GPT, BERT, XLM, etc, while impressive, are hard to shore up and objectively weigh their relative contributions given the fact that they are computationally expensive and usually use private training data. Thus, this paper is an attempt at replicating the BERT pre-training process. The authors propose a new approach to training BERT:
1. training the model longer, with bigger batches, over more data
2. remove NSP objective (same as alBERT and others)
3. dynamically changing masking pattern used in self-attention
4. Byte-level BPE tokenization
This led to achieve SOTA across a number of benchmarks. **Their results are published in FairSeq**.
**Masking**: Original BERT paper implemented masking during _data pre-processing_; i.e., the masks were the same in every epoch.
In this paper, they tried 2 variants:
- **Enhanced static masking**: during pre-processing phase, training data is **duplicated 10 times** so that each sequence is masked 10 different ways over the 40 epochs.
- **Dynamic masking**: the mask pattern is generated every time a sequence is seen by the model. This helpful when increasing dataset sizes and steps.
**Notes on NSP**: The authors looked at a few different variants on this approach. They observe the original BERT paper's sampling approach is odd, where sentences are sampled from the same document, but are not necessarily contigous. They suggest a few alternatives with and without NSP, but settle upon "DOC-SENTENCES", which is where sentences are contiguously sampled from the same document but end if the paragraph ends. This allows for a **dynamic batch size** in cases where the number of tokens < 512.
**Larger batch sizes**: Looking at different batch sizes, perplexity is lower on held-out training data sets when increasing the batch size from 256 to 2K.
**Text Encoding**: The original BERT uses a character-level Byte-Pair Encoding ~ 30k tokens called **WordPiece**. This is learned after pre-processing the input with heuristic tokenizations rules. However, they use a method similar to **GPT** with **byte-level BPE** vocabulary containing 50K sub-word units (increasing the number of parameters ~ 15MM). The byte-level implementation uses bytes instead of unicode characters as the base sub-word units.
## Art
================================================
FILE: notes/scaling-laws.md
================================================
# Scaling Laws for Neural Language Models
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
## TL;DR
Given the inspiring results achieved in language tasks by transformers (near human-level performance), the authors set out to understand the quiddity of how hyper-parameters in the transformer training process affect the results.
<ins>This is a hugely important "meta" paper exploring the hyper</ins>.
Their findings are the following:
1. **Power law relationships**: Model performance (as quantified by **test loss**) is most strongly influenced by three factors: <ins> N </ins> (model parameters - embedding_size); <ins> D </ins> (size of dataset); <ins> C </ins> (size of compute used for training).
2. ** Overfitting ** occurs if N (parameters) and D (training size) are not scaled proportionally together. For examply, if scaling the model size 8x, data size must be roughly 5x.
3. **Training curves** appear independent of model size - this allows some predictability in terms of what to expect when a model is <ins> trained longer </ins>.
4. **Transfer penalty** appears to be a constant offset - i.e., <ins> results on train::val set correlate with hold-out::test set despite different distributions </ins>.
5. Larger models reach the same level of performance with <ins>fewer datapoints </ins> and <ins> fewer optimization steps </ins> - what is called "more sample-efficient".
6. Further, larger models actually achieve optimal performance <ins> when stopped early before convergence </ins>.
The authors fit **power laws** to all of these relationships using <ins> WebText </ins>, BPE with vocab size 50257, measuring performance over 1024-token context window and cross-entropy. They use their usual decoder-only model.
**Note**: They have many sections and appendices worth checking out. As always, OpenAI papers are designed for readability.
## Art
### Figure 1: Scaling Laws

(from original paper)
### Figure 2: Training Curves

(from original paper)
### Figure 4: Sample Efficiency of Large Models

(from original paper)
================================================
FILE: notes/t5.md
================================================
# Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer
## Summary
| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training| Pre-Train Data | Batch Size |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----: |:----: |
| T5 | Encoder-Decoder | **BERT-style denoising objective**: Similar to MLM, model is trained to predict missing or corrupted tokens in input. 15% of tokens are randomly sampled and dropped out. (Note: They experimented with many variants) | **SentencePiece** | 32k (across many languages w/ 10:1 English-to-non-English) | Same as BERT | Just token embeddings | Self-attention + encoder-decoder attention (per layer) | ReLU | This study looks at many variants, but the base is similar to BERT_base: <ul><li> 12 blocks (encoder + decoder) </li><li> hidden_size == embed_size = 768 </li><li> FFN_dim=3072 (4*hidden) </li></ul> Utimately, about **twice the size of BERT --> 220MM params**. | <ul><li> **Pre-training:** 2^19 steps for pre-training. </li><li> Use **adaFactor** optimization with **inverse square root** LR scheduler </li><li> **Greedy decoding** at test time </li><li> **Fine-tuning**: 2^18 steps always with same batch_size dimensions, LR=0.001, **5k checkpoints and report results for highest validation performance**. | Common Crawl's C4 data (20 TB) | T=512, batch= 128 with packing such that each batch is approximately **65k tokens** (much smaller than other studies) |
## TL;DR
**MOST IMPORTANT TAKE AWAY**: This problem setup of "text-to-text" allows for zero-shot transfer to other problems without the need of specialized output heads (unlike say BERT or GPT).
Raffel et al. seek out to do a comparative study of different pre-training, architectures, unlabeled datasets, transfer approaches, etc and sort out which features are most important in the transformer approach. Further, they introduce a 20 TB dataset from the Common Crawl project, C4. Encoder-decoder with their "denoising" pre-training objective did best.
<ins>Note</ins>: Worth reading the intro section - very good description of transfer learning and Transformer architecture.
**Model**: The model is very similar to the original Transformer, with a few simplifying design decisions, among them:
1. Layer normalization has no bias term.
2. Layer normalization is outside residual path.
3. **Positional information**: instead of fixed embeddings for each position (such as the original sinusoidal term), a scalar is added to each corresponding softmax calculation for attention weights. These positional embeddings are used as offsets for the key,query concept in the attention mechanism, as well as other layers. They only do this for up to 128 tokens.
**Baseline Model**: Encoder, decoder are similar to BERT_base.
To achieve model and data parallelism, they use the **Mesh TensorFlow** library.
**Pre-processing**: The authors do a few interesting things to prepare the C4 data. For instance, they remove all documents they believe are not English using *langdetect*; they dedup the data; remove code; remove placeholder data; remove "bad words".
**Training Problem Statement**: They combine all the different tasks into an encoder-decoder or "text-to-text" problem. Specifically, they add a prefix to the input such as " translate X to Y: " or "<fine-tuning-task>:", followed by output "<LABEL>" or "<translation>". In the case of classification, if the wrong label is produced, it is scored as wrong. With this setup, their encoder-decoder model is not identical in nature to a model like BERT, which produces a single output, but shares the same spirit.
They run these studies across several GLUE and superGLUE benchmarks, using many different architecture variants (note: not a Cartesian product). See Table 2 below for info.
**Final Note** this is a pretty thorough study with lots of information. I strongly recommend examining their different studies in detail.
## Art
### Figure 1: Text-to-Text Encoder-Decoder Design
This is a useful illustration to visualize how all the different NLP problems are mapped to the same problem space of input -> output in the fine-tuning stage. Notice that the inputs are appended with task descriptions.

(from original paper)
### Figure 2: Corrupted Token Prediction (Pre-Training)
This is a schematic illustrating the pre-training, unlabeled objective.

(from original paper)
### Figure 3: Self-Attention Patterns
This is a great visualization of the different types of self-attention. The Fully-visible is what is found in BERT; the "causal" or auto-regressive is GPT.

(from original paper)
### Figure 4: Transformer Architectures
This is a great visualization of variants of different language models/transfomers.

(from original paper)
### Figure 5: Pre-Training Experiments
This is a depiction of the different variants of pre-training objectives.

(from original paper)
### Table 2: Perfomance of Different Architectures

(from original paper)
### Table 3: Description of Different Pre-Training Objectives/ Noising
Honestly, this is one of the best tables/high-levels of different model objectives.

(from original paper)
### Table 5: Performance Different Pre-Training Objectives/ Noising

(from original paper)
gitextract_1c2fka5_/
├── .gitignore
├── LICENSE
├── README.md
└── notes/
├── TEMPLATE.md
├── adapter_bert.md
├── albert.md
├── bart.md
├── bert.md
├── bigtable.md
├── byt5.md
├── clip.md
├── codex.md
├── contrastive.md
├── dalle.md
├── dedup.md
├── distilbert.md
├── gradient-attack.md
├── human-pref.md
├── megatron.md
├── reformer.md
├── roberta.md
├── scaling-laws.md
└── t5.md
Condensed preview — 23 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (103K chars).
[
{
"path": ".gitignore",
"chars": 1822,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": "LICENSE",
"chars": 1070,
"preview": "MIT License\n\nCopyright (c) 2021 Will Thompson\n\nPermission is hereby granted, free of charge, to any person obtaining a c"
},
{
"path": "README.md",
"chars": 10609,
"preview": "# tldr-transformers\n\nThe \"tl;dr\" on a few notable papers on Transformers and modern NLP. \n\nThis is a ~~living~~ repo to "
},
{
"path": "notes/TEMPLATE.md",
"chars": 413,
"preview": "# <TITLE>\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab"
},
{
"path": "notes/adapter_bert.md",
"chars": 2151,
"preview": "# Parameter-Efficient Transfer Learning for NLP\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-tr"
},
{
"path": "notes/albert.md",
"chars": 5360,
"preview": "# ALBERT: A Lite BERT for Self-supervised Learning of Language Representations\n\n## Summary\n\n| Model Name| Model Type (En"
},
{
"path": "notes/bart.md",
"chars": 3458,
"preview": "# BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension\n\n## "
},
{
"path": "notes/bert.md",
"chars": 5340,
"preview": "# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\n\n## Summary\n\n| Model Name| Model Type"
},
{
"path": "notes/bigtable.md",
"chars": 9713,
"preview": "| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | "
},
{
"path": "notes/byt5.md",
"chars": 5876,
"preview": "# ByT5: Towards a token-free future with pre-trained byte-to-byte models\n\n\n## Summary\n\n| Model Name| Model Type (Encoder"
},
{
"path": "notes/clip.md",
"chars": 6913,
"preview": "# Learning Transferable Visual Models From Natural Language Supervision\n\n## Summary\n\n| Model Name| Model Type (Encoder-D"
},
{
"path": "notes/codex.md",
"chars": 4987,
"preview": "# Evaluating Large Language Models Trained on Code\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre"
},
{
"path": "notes/contrastive.md",
"chars": 3117,
"preview": "# Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning\n\n## Summary\n\n| Model Name| Model Type (Enco"
},
{
"path": "notes/dalle.md",
"chars": 5031,
"preview": "# Zero-Shot Text-to-Image Generation\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Objecti"
},
{
"path": "notes/dedup.md",
"chars": 4330,
"preview": "# Deduplicating Training Data Makes Language Models Better\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.)"
},
{
"path": "notes/distilbert.md",
"chars": 3184,
"preview": "# DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter\n\n## Summary\n\n| Model Name| Model Type (E"
},
{
"path": "notes/gradient-attack.md",
"chars": 4675,
"preview": "# Gradient-based Adversarial Attacks Against Text Transformers\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, e"
},
{
"path": "notes/human-pref.md",
"chars": 2245,
"preview": "# Fine-Tuning Language Models from Human Preferences\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | P"
},
{
"path": "notes/megatron.md",
"chars": 4666,
"preview": "# Megatron-LM: Training Multi-Billion Parameter Language Models Using\nModel Parallelism\n\n## Summary\n\n| Model Name| Model"
},
{
"path": "notes/reformer.md",
"chars": 3266,
"preview": "# REFORMER: THE EFFICIENT TRANSFORMER\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Object"
},
{
"path": "notes/roberta.md",
"chars": 3540,
"preview": "# RoBERTa: A Robustly Optimized BERT Pretraining Approach\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) "
},
{
"path": "notes/scaling-laws.md",
"chars": 2534,
"preview": "# Scaling Laws for Neural Language Models\n\n## Summary\n\n| Model Name| Model Type (Encoder-Decoder, etc.) | Pre-train Ob"
},
{
"path": "notes/t5.md",
"chars": 5744,
"preview": "# Exploring the Limits of Transfer Learning with a Unified\nText-to-Text Transformer\n\n## Summary\n\n| Model Name| Model Typ"
}
]
About this extraction
This page contains the full source code of the will-thompson-k/tldr-transformers GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 23 files (97.7 KB), approximately 26.0k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.