Repository: koayon/awesome-adaptive-computation
Branch: main
Commit: 1a78f5359d55
Files: 3
Total size: 64.5 KB
Directory structure:
gitextract_7f14l9l3/
├── .gitignore
├── LICENSE
└── README.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.vscode
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
# Awesome Adaptive Computation
[](https://awesome.re)
[](https://opensource.org/licenses/Apache-2.0)
Awesome Adaptive Computation is a curated list of Adaptive Computation papers,
models, explainers and libraries for Machine Learning.
## Contents
- [Contents](#contents)
- [About](#about)
- [Mixture of Experts (Sparse MoE)](#mixture-of-experts-sparse-moe)
- [Other Modular Architectures](#other-modular-architectures)
- [Early Exit: End-to-End Adaptive Computation](#early-exit-end-to-end-adaptive-computation)
- [More Compute Per Output Token](#more-compute-per-output-token)
- [Adaptive Computation for Black-box models](#adaptive-computation-for-black-box-models)
- [Continual Learning](#continual-learning)
- [Tools \& Agents](#tools--agents)
- [Games](#games)
- [Pre-cursors to Adaptive Computation](#pre-cursors-to-adaptive-computation)
- [Open Source Libraries](#open-source-libraries)
- [AI Safety](#ai-safety)
- [Scaling Laws](#scaling-laws)
- [Other](#other)
## About
`Adaptive Computation` (sometimes called `Dynamic Compute`) is the ability of a machine learning system to adjust its
`function` and `compute budget` for each example.
<!-- We can think of this as giving models a [System 2](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow) mode. -->
Adaptive Computation techniques include **Mixture of Experts** (decoupling model
capacity and model compute), **Early Exiting** (saving compute on easy
inputs) and **Inference-Time Computation** (search and verification at inference time) as well as sampling techniques.
<!-- [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) states that the two general methods to utilise large amounts of compute are `Learning` and `Search`. Whilst large pre-trained models focus on `learning` at _train time_, Adaptive Computation is about spending more compute at inference time with mechanisms similar to `Search`. -->
---
In this repo, links are organised by topic and have explanations so you can
decide what you would like to read. Especially recommended links are starred 🌟
Star this repository to see the latest developments in this research field.
We accept contributions! We strongly encourage researchers & practitioners to
make pull requests with papers, approaches and explanations that they feel
others in the community would benefit from 🤗
<!-- Ordered by topic, then date published -->
## Mixture of Experts (Sparse MoE)
The Mixture of Experts (MoE) paradigm uses a routing layer to choose a limited number of parameters
to apply to a given input rather than using all the available parameters.
This conditional computation allows us model capacity to increase without also
scaling the compute required for each forward pass. This is useful because
bigger models are more sample efficient and more compute efficient to train.
MoE models are also useful for compartmentalising knowledge and avoiding
negative interference from irrelevant computation.
[Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) is an
open-weights MoE model which is comparable to much larger models. Google DeepMind similarly show that their [Gemini 1.5 Pro](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) based on an MoE architecture is competitive with their much larger Gemini 1 Ultra. Databricks/Mosaic [DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) is another powerful MoE model and it seems that MoE is now the go-to architecture for large models.
[JetMoE](https://research.myshell.ai/jetmoe), based on the [ModuleFormer](https://arxiv.org/pdf/2306.04640.pdf) MoE conception, shows that MoEs can also be effective at smaller scales.
**D2DMoE: Dense to Dynamic-k Mixture-of-Experts Conversion**, Szatkowski et al. (2024), [pdf](https://arxiv.org/pdf/2310.04361) [code](https://github.com/bartwojcik/D2DMoE)
> While MoE models are mostly used for scaling up the parameter count, recently [MoEfication](https://arxiv.org/abs/2110.01786) has shown that static dense models can be converted to MoEs to improve execution time. D2DMoE makes further progress in improving the efficiency of these dense-to-MoE converted models by: 1) showing that the efficiency of the resulting model can be significantly enhanced by enforcement of activation sparsity in the base model; 2) proposing Expert Contribution Routing, a novel objective for the training of the gating networks, which are now tasked to predict the output norm of each expert for the given input, enabling approximation of each expert’s relative contribution; 3) introducing dynamic-k gating, which allows the model to appropriately distribute its computational budget between easy and hard inputs; 4) extending the proposed conversion scheme to any linear layers such as multi-head attention projections.
**Skywork-MoE, Skywork (2024)**
[pdf](https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf)
> An open-source MoE in the style of Switch Transformer. They detail two training tricks for getting better MoE performance. Firstly they normalise the routing logits before it goes through the softmax in order to reduce the entropy in the router and make the router more decisive. Secondly they have a different auxiliary loss coefficient for each layer and this is tuned during training depending on how many tokens were dropped at that layer. This helps to reduce impact of the auxiliary loss as the router becomes more balanced and confident.
**DynMoE, CUHK: Guo et al (2024)**
[pdf](https://arxiv.org/pdf/2405.14297)
[code](https://arxiv.org/pdf/2405.14297)
> One problem with the token choice MoE approach is that every token is allocated the same number of experts. Ideally for true adaptive computation this would be variable depending on how difficult the token is. The authors introduce a routing mechanism which allows for a variable number of experts per token as well as a procedure for dynamically changing the number of experts during training. This allows for better performance with less hyperparameter sweeping.
>
> Also see [Dynamic Routing in MoEs](https://arxiv.org/pdf/2403.07652)
**DeepSeek-v2, DeepSeek (2024)**
[pdf](https://arxiv.org/pdf/2405.04434)
> A large (236B), performant and open MoE with lots of details about training and checkpoints available. Useful for understanding a modern MoE recipe.
**MoEs for Deep-RL, Google DeepMind: Obando-Ceron et al (2024)**
[pdf](https://arxiv.org/pdf/2402.08609.pdf)
> The authors show that MoEs can be used to improve the sample efficiency of popular RL systems such as DQN and Rainbow. The authors show that using MoEs (in particular the SoftMoE variant) improves the ultimate performance of the RL systems. Previously, scaling up the underlying models in RL systems was often wasteful in parameters, but the authors show using MoEs they can get predictable performance improvements with scale. This suggests that scaling laws for Deep RL systems could be possible.
**MoE Design Choices, EPFL: Fan et al (2024)**
[pdf](https://arxiv.org/pdf/2402.13089.pdf)
> The authors ablate some decision decisions for MoEs and shows the benefits of compared to vanilla transformers. They unfortunately only study very small models and so similar analysis for larger models could likely be useful for the community.
🌟 **Routers in Vision MoEs, DeepMind: Liu et al (2024)**
[pdf](https://arxiv.org/pdf/2401.15969.pdf)
> Compares the performance of different routing mechanisms in MoEs trained on Vision tasks. They show that Language Model routers can adapt well to Vision and that for Vision tasks (where the task isn't autoregressive), Soft MoE outperforms. They also reframe some previous routing methods mathematically to more clearly detail the differences. Worth a read for anyone deciding which MoE approach to choose for their application.
**MoE-LLaVA, Peking University: Lin et al (2024)**
[pdf](https://arxiv.org/pdf/2401.15947.pdf)
[code](https://github.com/PKU-YuanGroup/MoE-LLaVA)
> Whilst MoEs have had much success in ViTs and LLMs, the authors also show that they can be effective in LVLMs (Large Vision Language Models). By exploiting the sparsity and increased parameter count of MoEs whilst maintaining FLOPs, we get the expected boost in both performance and hallucination avoidance.
**Mixtral of Experts, Mixtral (2024)**
[pdf](https://arxiv.org/pdf/2401.04088.pdf)
[official code](https://github.com/mistralai/mistral-src)
> The paper describing Mixtral's State of Art LLM based on the MoE paradigm.
**BlackMamba (MoE-Mamba), Zyphra: Anthony et al (2024)**
[pdf](https://arxiv.org/pdf/2402.01771.pdf)
[code](https://github.com/Zyphra/BlackMamba)
[models](https://huggingface.co/Zyphra)
> The authors combine the MoE paradigm with a recent SSM-based architecture
> Mamba. Mamba provides Transformer-like performance and scaling properties
> whilst using a sub-quadratic attention variant allowing for much larger
> sequence lengths. Here we see that this architecture can additionally be
> combined with MoEs to increase performance, similarly to MoEs for transformers
> or RNNs previously. For a general explainer on Mamba see [here](https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html).
🌟 **Offloading for Fast MoE Inference, Moscow: Eliseev & Mazur (2023)**
[pdf](https://arxiv.org/pdf/2312.17238.pdf)
[official code](https://github.com/dvmazur/mixtral-offloading)
> Despite some open-weights MoEs being available, they are not the most popular
> models used for inference due to the large memory footprint required at
> inference time. To address this the authors propose architecture-aware
> quantisation, an LRU cache for experts (to exploit the fact that experts are
> more likely than chance to repeat for two adjacent tokens) and a speculative
> expert loading algorithm. Since the inputs (x) to each layer only differ
> iteratively by a small amount (due to the residual stream carrying information
> from layer to layer), they note that by applying the routing function for the
> subsequent layer at the current layer, you can get a good guess for which
> experts to load. The upshot of this is that MoE models, like Mixtral, can be
> run on consumer grade hardware with much increased generation speed. This is a
> huge win for inference efficiency in the memory constrained and single batch
> regime. [MoE-Infinity](https://arxiv.org/pdf/2401.14361.pdf) is a similar offloading paradigm with code [here](https://github.com/TorchMoE/MoE-Infinity)
🌟 **QMoE, ISTA: Frantar & Alistarh (2023)**
[pdf](https://arxiv.org/pdf/2310.16795.pdf)
[code](https://github.com/IST-DASLab/qmoe)
> Generally MoEs require larger footprints but fewer FLOPs compared to a dense
> model of which achieves similar performance. In the quest to reduce the memory
> footprint, we might seek to perform quantisation. The authors present a
> compression method which takes advantage of the inherent sparsity to compress
> the model at a 20x compression rate whilst retaining most performance. This
> compresses each fp16 weight to the equivalent of less than one bit. For the
> first time it's possible to run a trillion parameter model on consumer
> hardware.
**SparseMixer - Sparse Backpropagation for MoE Training, Microsoft: Liu et al
(2023)**
[pdf](https://arxiv.org/pdf/2310.00811.pdf)
> One of the most important parts of an MoE is the router which allows the
> experts to specialise well. Unfortunately, typical MoE training gives
> suboptimal routers as suggested by Hash routers performing almost as well as
> more principled routing mechanisms. This paper suggests the reason is due to
> MoE training ignoring parts of the gradient and suggests a midpoint-rule based
> gradient approximation which substantially improves training.
**MoV/MoLoRA, Cohere For AI: Zadouri et al (2023)**
[pdf](https://arxiv.org/pdf/2309.05444.pdf),
[official Jax code](https://github.com/for-ai/parameter-efficient-moe)
> Introduces parameter efficient MoE models where instead of routing between
> entire FFN layers, we route between adapters such as LoRAs or $(IA)^3$ with
> the same base model. This allows for much of the benefits of the (Soft) MoE
> paradigm but without the huge memory footprint (particularly compared to
> previous upscaling methods). The
> [HydraMoE](https://github.com/SkunkworksAI/hydra-moe) project also take a
> similar approach.
**Soft Merging of Experts (SMEAR), UNC: Zadouri et al (2023)**
[pdf](https://arxiv.org/pdf/2306.03745.pdf)
> Takes the opposite approach to Soft-MoE and averages the Expert weights rather
> than the tokens. This is interesting given that model merging approaches which
> show that linear combinations of models can perform well on tasks that either
> model was trained for. Note here the averaging operation can become
> prohibitively expensive if we different experts for each token (similar FLOPs
> to forward passes on all experts and ensembling). Hence the method relies on a
> Task-MoE approach of picking the same expert configuration on a per-example
> rather than a per-token basis.
**AutoMoE, UBC/Microsoft: Jawahar et al (2023)**
[pdf](https://arxiv.org/pdf/2210.07535.pdf),
[official PyTorch code](https://github.com/microsoft/AutoMoE)
> One of the promises of MoE is being able to apply different amounts of compute
> to each token. Generally, this has been achieved by different tokens being
> processed and dropped by different numbers of experts per layer. AutoMoE also
> uses differently sized experts to achieve more heterogeneity. They perform an
> Architectural search for optimal architectures given computational
> constraints.
🌟 **Expert Choice MoEs, Google: Zhou et al (2022)**
[pdf](https://arxiv.org/pdf/2202.09368.pdf),
[blog](https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html),
[PyTorch code](https://github.com/koayon/ml-replications/blob/main/mixture_of_experts/expert_choice_layer.py)
> Introduces a principled, truly compute-adaptive MoE model. In traditional MoE
> models the tokens select the top experts that they would most like to be
> processed by. In Expert Choice routing however, the experts choose the top
> tokens that they would like to process. Hence multiple experts can pick the
> same token and give it lots of compute, and similarly all experts can ignore a
> token so it is skipped for that layer. As well as improving training
> efficiency, this approach also has the benefits that it helps with load
> balancing and eliminates the need for auxiliary loss functions.
🌟 **Task Level MoEs, Various (2022)**
[DeMix pdf](https://arxiv.org/pdf/2108.05036.pdf),
[Task-MoE pdf](https://arxiv.org/pdf/2110.03742.pdf)
[c-BTM code](https://github.com/kernelmachine/cbtm)
> Instead of routing each token separately these approaches use the same Expert
> for entire documents based on the task (which is supplied to the network).
> Instead of learning the routing, we supply the routing based on what we know
> about the tasks, inducing our own inductive bias. Also note that this offers
> memory footprint benefits at inference time - if inference is for a limited
> set of tasks, we only need these enough GPU memory for these experts.
> [ELMForest - Branch, Train, Merge (BTM)](https://arxiv.org/pdf/2208.03306.pdf%7D),
> [c-BTM](https://arxiv.org/pdf/2303.14177.pdf) and
> [Branch, Train, Mix (BTX)](https://arxiv.org/pdf/2403.07816.pdf)
> are follow-ups which use
> ensembling approaches from multiple LMs trained independently in a continual
> learning approach.
<!-- There are possibly additional benefits to combining task and token level experts. We could input a task and use this to decide which routers that we want to use in our MoE layers see [Multi-gate](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007).
Alternatively we could concatenate task information to the input of the router network so it can use it if it wants.
It's not clear whether having inductive bias would be better. -->
**No Language Left Behind, Meta (2022)**
[pdf](https://arxiv.org/abs/2207.04672),
[official PyTorch code](https://github.com/facebookresearch/fairseq/tree/nllb)
> Translation is a natural setting for MoEs - some but not all of the parameters
> for English to Chinese translation be relevant in English to French
> translation as well. But using all of the English to Chinese knowledge might
> confuse the model. MoE therefore has useful inductive biases to allow this
> model to use only the relevant parts. Here the researchers show that the MoE
> approach scales even for extremely low-resource languages. Translation may be
> a natural environment for task/document-level rather than token-level routing.
**Hash Routing, Meta: Roller et al (2021)**
[pdf](https://browse.arxiv.org/pdf/2106.04426.pdf)
> Uses a static rather than fixed routing per input token and shows similar
> results to more principled routing methods in some regimes. Suggests that
> previous routing methods may be somewhat under-optimised.
**Switch Transformers, Google: Fedus et al (2021)**
[pdf](https://arxiv.org/pdf/2101.03961.pdf),
[review paper](https://arxiv.org/pdf/2209.01667.pdf),
[PyTorch code](https://nn.labml.ai/transformers/switch/index.html),
[model](https://huggingface.co/docs/transformers/model_doc/switch_transformers)
> Simplifies the MoE routing algorithm with top-1 routing. Shows that we can
> exploit the scaling laws with parameters as well as simply compute and
> develops distributed systems approach to MoE
**WideNet (Go Wider Instead of Deeper), NUS: Xue et al (2021)**
[pdf](https://arxiv.org/pdf/2107.11817.pdf)
> Suggests a parameter sharing approach using a single layer of multiple MoEs
> repeated multiple times as transformer blocks (similar to the Universal
> Transformer but with MoEs). This results in a deep model which has
> O(expert_num) instead of O(layer_depth) parameters. They achieve SoTA results
> with fewer parameters than previous models. More recently, Apple's
> [One Wide Feedforward paper](https://arxiv.org/pdf/2309.01826.pdf) details the
> amount of redundancy across layers. This suggests this approach is
> increasingly fruitful for on-device models.
🌟 **Outrageously Large Neural Networks (aka The Sparse MoE Layer), Google:
Shazeer et al (2017)** [pdf](https://arxiv.org/pdf/1701.06538.pdf)
> Introduces Mixture of Expert models in their modern form using Sparsely Gated
> MoE layer and a trainable gating network. They use RNNs as this is pre
> "Transformers Eating The World".
<!-- A Noam Shazeer, Geoff Hinton collab - two true legends of Deep Learning -->
## Other Modular Architectures
**Stylus Diffusion Adapter Selection, UC Berkeley: Luo et al (2024)**
[pdf](https://arxiv.org/pdf/2404.18928),
[repo](https://github.com/stylus-diffusion/stylus)
> Diffusion model users often use adapters rather than full finetunes to achieve models which perform well on a particular style. The authors here automatically select and compose relevant adapters for the prompt using a model routing approach. A nice application of inference-time routing which we might expect to become more commonplace in the future.
**MoDE - CLIP Data Experts via Clustering, Meta: Ma et al (2024)**
[pdf](https://arxiv.org/pdf/2404.16030),
[pytorch code](https://github.com/facebookresearch/MetaCLIP/tree/main/mode)
> The authors apply a Task-specific MoE based on training parallel CLIP models on restricted domains and ensembling these together. They show increased performance with less compute and note the ability to add in new "experts" for new domains asynchronously and after initial training as a Continual Learning play.
🌟 **Mixture of Depths, DeepMind: Raposo et al (2023)**
[pdf](https://arxiv.org/pdf/2404.02258.pdf),
> The traditional Early-Exit formulation allows "easier" tokens to not go
> through the whole network, reducing compute but it has a couple of problems.
> Firstly, it's not clear that the layers that you want to skip are necessarily
> the final ones (perhaps an easy token should skip some middle layers instead)
> and secondly, Early Exit might not be in practise that much faster on GPUs due
> to the variable compute graph. In an attempt to align Early-Exit work with the
> Hardware Lottery, the authors suggest enforcing a compute budget but fixing the
> computational graph allowing dynamic allocation of FLOPs across tokens in the
> sequence, optimising the allocation along the sequence for different layers
> across the model depth. Because they opt for an expert-choice routing mechanism,
> they also introduce novel sampling methods to ensure validity for autoregressive
> generation which seem broadly applicable to other MoE models.
> This builds on the [LayerDrop](https://arxiv.org/pdf/1909.11556.pdf) approach to structured dropout.
**Fast FeedForward (FFF), ETH Zurich: Belcak et al (2023)**
[pdf1](https://arxiv.org/pdf/2308.14711.pdf),
[pdf2](https://arxiv.org/pdf/2311.10770.pdf),
[official pytorch code1](https://github.com/pbelcak/fastfeedforward),
[official pytorch code2](https://github.com/pbelcak/UltraFastBERT),
[pytorch code](https://github.com/sap-ient-ai/FFF)
> Instead of the usual FeedForward Network, the authors propose a balanced tree
> structure where depending on your path through the tree, a different function
> is applied to the input. Inputs go either left or right through the tree
> depending on the result of a dot product with a learned discriminating vector.
> This approach results in encouraging performance with somewhat limited
> inference FLOPs but high training FLOPs and unstructured inference sparsity
> which must be applied sequentially which falls foul of the
> [Hardware Lottery](https://arxiv.org/pdf/2009.06489.pdf) and doesn't
> parallelise nicely on GPUs.
🌟 **Soft-MoE, Google DeepMind: Puigcerver et al (2023)**
[pdf](https://arxiv.org/pdf/2308.00951.pdf),
[pytorch code](https://github.com/fkodom/soft-mixture-of-experts/blob/main/soft_mixture_of_experts/soft_moe.py)
> Instead of Sparse MoE models, they allow each expert to using its router to
> select weights for a weighted average of input tokens that it wants to
> process. They show SoTA results on Image recognition tasks. Note since the
> approach relies on Expert Choice, it doesn't yet generalise to autoregressive
> generation.
## Early Exit: End-to-End Adaptive Computation
Early Exit approaches ask if we get the output of a neural network without going
through all the layers, particularly if faced with an easier example. This is
typically done by learning an exit probability at each layer.
**EE-LLM, Alibaba: Chen et al (2024)**
[pdf](https://arxiv.org/pdf/2312.04916)
[pytorch code](https://github.com/pan-x-c/EE-LLM)
> The authors extend Megatron into a library which natively supports the Early Exit paradigm taking full advantage of 3D parallelism. Other contributions include methods to efficiently facilitate backprop even when some layers may be unused and methods to handle the fact that using naive early exit would result in missing KV-caches for some tokens. They find that limiting the Early Exit layers to a few intermediate layers substantially improves performance.
**Sparse Universal Transformer (SUT), MILA: Tan et al (2023)**
[pdf](https://arxiv.org/pdf/2310.07096.pdf)
> Combines the Universal Transformer approach (RNN with transformer blocks) with
> the Mixture of Experts paradigm (multiple experts instead of a single FFN
> layer). They also use a new stick-breaking-based dynamic halting mechanism.
> This brings all the benefits of Sparse MoEs (such as less inference compute
> whilst having a lot of parameters) and the benefits of Universal Transformer
> (parameter efficiency, Turing-completeness and generalization ability)
> together.
**AdaTape, Google: Xue et al (2023)**
[pdf](https://arxiv.org/pdf/2301.13195.pdf),
[blog](https://ai.googleblog.com/2023/08/adatape-foundation-model-with-adaptive.html),
[official jax code](https://github.com/google-research/scenic/blob/main/scenic/projects/adatape/adatape_vit/adatape_vit.py)
> Extends the ACT method by giving the model a "tape" which contains some
> additional inputs which may be useful for encoding. For each token the model
> may append a variable number of tape tokens to the input, which allows it to
> regulate how much additional compute we add. The paper shows impressive
> performs on image classification tasks and the 'parity' task on long
> sequences.
**Dataset Pruning Using Early Exit Networks, Görmez et al (2023)**
[pdf](https://openreview.net/pdf?id=Kh114370zL)
> Early Exit Networks naturally learn which input examples are "easy" (can be
> exited early) or "difficult" (require all the layers of the network). The
> authors use this property to prune datasets to use for training and
> finetuning. The algorithm EEPrune achieves SOTA performance for dataset
> pruning in some regimes.
🌟 **PonderNet, DeepMind: Banino et al (2021)**
[pdf](https://arxiv.org/pdf/2107.05407.pdf),
[PyTorch code](https://github.com/koayon/ml-replications/tree/main/ponder)
> Allows the model to exit after each transformer layer if it's confident in the
> answer. It introduces a stable probabilistic policy for halting which provides
> low-variance unbiased gradient updates. This can also be combined with the
> [SkipNet](<[pdf](https://arxiv.org/pdf/1711.09485)>) paradigm where we instead
> of exiting directly, skip to the final few layers to allow our universal
> computation (applied to all inputs) to be at the end as well as the start of
> the network.
**PaBEE, DeepMind: Zhou et al (2020)**
[pdf](https://arxiv.org/pdf/2006.04152.pdf),
[official PyTorch code](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bert-loses-patience)
> Introduces Patient Early Stopping. Whilst ACT has a learned exit probability,
> PABEE instead looks at the output class if it _were_ to exit. We exit if the
> intermediate outputs are the same over multiple layers. Interestingly they
> suggest that the reason for this isn't just speed; they suggest that early
> stopping will _improve_ performance due to lower risk of "overthinking"
> (analogously to stopping training earlier to prevent overfitting).
> [F-PaBEE](https://arxiv.org/pdf/2305.11916.pdf) prevents a slightly more
> flexible approach based on similarity scores.
**Universal Transformer, Google: Dehghani et al (2019)**
[pdf](https://arxiv.org/pdf/1807.03819.pdf)
> Reuses the transformer block recurrently across multiple layers with an
> ACT-like halting mechanism. RNNs can be better than transformers at length
> extrapolation but here we get the best of the Transformer (training
> parallelizability) and the best of the RNN (recurrent inductive bias).
> Universal Transformers can also be shown to be Turing-complete.
**Adaptive Computation Time (ACT) for RNNs, Google: Graves (2016)**
[pdf](https://arxiv.org/pdf/1603.08983.pdf)
> Introduces the ACT approach for models to learn how many computational steps
> they should take before returning an output. This approach is built on and
> refined in many later papers such as PonderNet.
## More Compute Per Output Token
**Masked Diffusion Language Models, Cornell: Sahoo et al (2024)**
[pdf](https://arxiv.org/pdf/2406.07524),
[pytorch code](https://github.com/kuleshov-group/mdlm)
[video](https://www.youtube.com/watch?v=WjAUX23vgfg)
> Autoregressive models sample a single token at a time, regardless of how difficult this token is to predict. Diffusion models however have the benefit that the number of steps from the noised input to the final output can be varied which acts as a knob controlling the amount of compute applied. The authors introduce a simplified method for Diffusion Language models based on BERT which achieves better perplexity than previous Diffusion Language models (though not at autoregressive model levels). This avenue provides a different approach to varying compute per output token.
🌟 **Quiet-STaR, Stanford: Zelikman et al (2024)**
[pdf](https://arxiv.org/pdf/2403.09629.pdf),
> One of the core motivations of Adaptive Computation is noting that for
> difficult tokens we should spend more compute. There have been prompting
> ways to do this (e.g. Chain of Thought) and recurrent ways to do this (e.g.
> Universal Transformers) but ideally we'd want the LLM to just start writing
> down more tokens, using all faculties on difficult tokens, without being told
> when to apply this technique and in a natural next-token prediction way that
> takes advantage of its pretraining. Quiet-STaR is exactly that. The model
> generates hidden `rationale` tokens which it can use to reason but don't get
> shown to the user (or loss function). This generalises the previous `STaR` work
> by the same authors and the `Pause Token` results in a way that is much more
> generally effective.
> This is the real deal folks! I'm extremely excited about this approach.
> And it's a cracked team too, they start the paper with a quote from
> Danish philosopher Søren Kierkegaard. An excellent formulation and one of
> those papers that makes you realise why you got into this field.
## Adaptive Computation for Black-box models
For black box pre-trained models, perhaps those behind an API, there are some
techniques for using Adaptive Computation. These are promising techniques for
those with limited compute budgets.
Prompting techniques such as
[Reflexion](https://browse.arxiv.org/pdf/2303.11366.pdf),
[Debate](https://browse.arxiv.org/pdf/2305.14325.pdf),
[Chain of Thought](https://browse.arxiv.org/pdf/2201.11903.pdf),
[Tree of Thought](https://browse.arxiv.org/pdf/2305.10601.pdf) and
[Chain of Verification](https://browse.arxiv.org/pdf/2309.11495.pdf) can also be
used to improve performance for Black-box models.
**Online Speculative Decoding, Berkeley: Liu et al (2024)**
[pdf](https://arxiv.org/pdf/2310.07177),
[pytorch code](https://github.com/LiuXiaoxuanPKU/OSD)
> The authors propose an Active Learning approach to choosing draft models for speculative decoding. In downtime when the GPUs are not maxed out for inference, they use the capacity to instead finetune a draft model on the recent outputs from the large teacher model. In this way the system can respond to distribution shift and still stay performant by accepting more tokens from the draft model. Depending on the distribution shift this can result in a latency reductions going from 1.22x (naive speculative decoding under distribution shift) to 3.06x (their method).
🌟 **Scaling LLM Test-Time Compute, Deepmind: Snell et al (2024)**
[pdf](https://arxiv.org/pdf/2408.03314),
> They test two strategies for using test-time compute: (1) searching against dense, process-based verifier reward models in a tree-like fashion and (2) utilising Dynamic Evaluation-style updating of the model’s distribution at test time given the prompt. They find that using these strategies they're able to achieve a 4x improvement over using best-of-N with the same compute budget. Even more strikingly, they find a 14x improvement over FLOP-matching with using a larger model. This is a huge win for Adaptive Compute-style approaches. On a different note the paper's motivating setup and styling are well executed which makes the paper a nice read.
🌟 **Jacobi Consistency Large Language Models (CLLMs), SJTU: Kou et al (2024)**
[pdf](https://arxiv.org/pdf/2403.00835),
[blog](https://hao-ai-lab.github.io/blogs/cllm/)
> An interesting approach to the multi-token prediction problem. They give a language model a prompt
> and a random k-tokens to come next. They then run a forward pass: this will give the "correct"
> next token (i.e. directly after the prompt) but there's some chance it also updates one of the following
> tokens to being correct as well. They use a consistency loss (similar to in Diffusion models) to improve the trajectory
> from [k random tokens] --> [k correct tokens] very similar to a diffusion approach. We notes that this can take at most k
> forward passes but ideally can be done fewer forward passes.
> The aim of the game is to repeatedly put the k tokens through the model until they reach a fixed point (i.e. a forward pass doesn't change them). (Also note that on this final forward pass we also have the k+1th token returned). This is an interesting approach that changes fundamentally the language modelling objective from autoregressive prediction to predicting trajectories for multi-token sequences.
🌟 **Multi-Token Prediction, Meta: Gloeckle et al (2024)**
[pdf](https://arxiv.org/pdf/2404.19737)
> Traditionally LLMs predict one token at a time. This is somewhat inhuman and
> inefficient because often once the start of a word/phrase is predicted, the end
> is trivial. The authors here treat subsequent token prediction as an auxiliary task
> and train additional heads to predict further tokens. The real benefit of this approach
> isn't in inference though but in training. Pre-training with this auxiliary task is more
> sample efficient and forces the LM to learn better representations for medium-term dependencies.
> In generative tasks such as coding, these models much outperform traditional LLMs with lower latency.
**Many-Shot In-Context Learning, Google DeepMind: Agarwal et al (2024)**
[pdf](https://arxiv.org/pdf/2404.11018.pdf)
> It has been long observed that language models can learn how to do a new
> task from examples of inputs, reasoning chains and outputs. This is known
> as few-shot Chain of Thought (CoT).
> Historically, the number of examples has been limited by the context window
> though. In this work, the authors suggest that using hundreds or thousands of
> examples (typically model generated) can aid performance and out-of-distribution
> robustness via the In-Context Learning mechanism. In other words, they formalise
> another way to turn inference time compute into better performance.
> Can be added to your DSPy program for a long-context models.
🌟 **Martian LLM Router, Martian: Hu et al (2024)**
[pdf](https://arxiv.org/pdf/2403.12031.pdf),
[blog](https://withmartian.com/)
> Martian provide the first LLM router, which dynamically routes queries
> to the best LLM in real-time, to achieve higher performance and lower cost
> than any individual API. They're able to choose models which might be better
> at a single task and to route away from powerful expensive models when a cheaper
> one will suffice. In order to choose which model to use, they use a new interpretability
> technique known as [model mapping](https://blog.withmartian.com/post/mission#:~:text=Understanding%20Models%20Through%20Model%20Mapping).
> Worth paying attention to.
> Chip Huyen discusses model routing approaches [here](https://huyenchip.com/2024/02/28/predictive-human-preference.html)
**EAGLE, Peking/Microsft: Li et al (2024)**
[pdf](https://arxiv.org/pdf/2401.15077.pdf),
[PyTorch code](https://github.com/SafeAILab/EAGLE)
> An improvement to speculative decoding which uses the fact that upper
> layers in the model have good features for multiple tokens ahead to predict
> future tokens from the current one without using all the layers.
> This approach is typically 50% faster than previous single-model speculative
> decoding efforts and 3x faster than vanilla decoding.
🌟 **Contrastive Decoding, Stanford: Li et al (2023)**
[pdf](https://arxiv.org/pdf/2210.15097.pdf),
[pdf2](https://arxiv.org/pdf/2309.09117.pdf)
> A small helper model generates tokens alongside the main model. Tokens are
> up-weighted if the large model finds them proportionally much more plausible
> than the small model. This approach improves the quality of open-ended
> generations and reasoning ability. To extend this method towards additionally
> adaptive computation, smaller contrastive models could be applied
> conditionally depending on the input.
🌟 **Speculative Sampling, DeepMind: Chen et al (2023)**
[pdf](https://arxiv.org/pdf/2302.01318.pdf),
[pdf2](https://arxiv.org/pdf/2211.17192.pdf),
[blog](https://jaykmody.com/blog/speculative-sampling/),
[PyTorch code](https://github.com/jaymody/speculative-sampling)
[PyTorch blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/)
> A smaller model generates multiple tokens autoregressively and then a larger
> model checks the smaller model against what it would have generated (all in
> one go). We accept only the tokens where the two models agree (by some
> acceptance criteria) and then the larger model's next token. This gives
> exactly the same output as the larger model would have but with significantly
> reduced sampling time. This takes advantage of the fact that we can
> parallelise evaluation whilst generation happens token by token. Additionally
> [Online Speculative Decoding](https://arxiv.org/pdf/2310.07177.pdf) suggests
> we can use any excess compute (at inference time) to retrain the small model
> online on the query distribution with teacher-student distillation.
> Note that the small model need not be a transformer:
> [Recurrent Drafter](https://arxiv.org/pdf/2403.09919.pdf) from Apple suggest
> using a fast RNN for speculative decoding and [large n-gram models](https://arxiv.org/pdf/2401.17377.pdf)
> could also be used as a non-parametric approach. Indeed [REST](https://www.semanticscholar.org/paper/REST%3A-Retrieval-Based-Speculative-Decoding-He-Zhong/532c2c7a247d9e97d20abec1b2f4612984fdab93) suggest retrieving follow-on tokens from the web for the speculative decoding head.
> See also [Accelerated Speculative Sampling (ASpS) with Tree Monte Carlo](https://openreview.net/pdf?id=stMhi1Sn2G) (or [video](https://www.youtube.com/watch?v=53VqZFmOSB8)) for further improvements to this method.
<!-- The general principle here is that it's easier to evaluate than to generate. -->
**FrugalGPT, Stanford: Chen et al (2023)**
[pdf](https://arxiv.org/pdf/2305.05176.pdf)
> Details various approaches for fully black box adaptive computation (i.e. from
> an API where you don't even get logits). They use an LLM Cascade strategy
> where given a prompt they select n models to try sampling with, in order of
> increasing parameter count. The first model samples and we check the
> generation with a scoring function. If the generation is rejected then we
> generate with a more capable model. We continue this process until we accept
> the generation or are using the largest model. Interestingly this approach
> provides some shielding against
> [inverse scaling](https://arxiv.org/pdf/2306.09479.pdf) problems. They also
> use completion caching.
**Beam Search, Google: Sutskever et al (2014)**
[pdf](https://browse.arxiv.org/pdf/1409.3215.pdf)
> Beam search allows LMs to see the probability of choosing a few tokens at a
> time before selecting one by building out a tree. Increasing the number of
> beams increases the number of options explored downstream and hence the amount
> of compute per token.
<!-- Test-Time Compute is the System 2. We want to convert time to think into better outputs. We want to do structured reasoning over trees of possibilities. We want to convert time into better accuracy so when you think for longer, you get monotonically more accurate outputs. -->
## Continual Learning
🌟 **Lifelong-MoE, Google DeepMind: Chen et al (2023)**
[pdf](https://arxiv.org/pdf/2305.12281.pdf)
> Trains a language model for multiple tasks by training for one task, freezing
> these weights and then adding some additional layers which can help to train
> the next task (in combination with the frozen layers) This treats pretrained
> weights more like an API (which you can use but not edit) when training a
> model to do a new task. This helps to eliminate the catastrophic forgetting
> that can happen with naive finetuning.
**Sparse Upcycling, Google Research: Komatsuzaki et al (2023)**
[pdf](https://arxiv.org/pdf/2212.05055.pdf)
> Shows that you can use an pre-trained dense model checkpoints as an
> initialisation for training sparse MoEs. This reduces the overall compute
> budget needed and reduces the sunk costs for already trained models. Sparse
> upcycling can be viewed as an efficient form of finetuning which converts a
> pretrained dense model to a sparse model for inference.
**MuNet, Google: Gesmundo et al (2022-23)**
[pdf](https://arxiv.org/pdf/2205.10937.pdf),
[pdf2](https://arxiv.org/pdf/2205.12755.pdf),
[pdf3](https://arxiv.org/pdf/2209.14745.pdf),
[pdf4](https://arxiv.org/pdf/2302.02721.pdf),
[official jax code](https://github.com/google-research/google-research/tree/master/muNet)
> Defines an evolutionary algorithm which adds different tasks onto an existing
> base model by (1) inserting adapter layers, (2) changing hyperparameters, (3)
> freezing layers and (4) copying layers to retrain. An interesting sketch of
> what Adaptive Computation could look like in the future.
## Tools & Agents
One way of varying compute is on some tokens calling out to an external API for
parts of completions.
**SWE-Agent: Princeton, Yang et al (2024)**
[code](https://github.com/princeton-nlp/SWE-agent)
[demo](https://swe-agent.com/)
> An AI Software Engineer (à la Devin) which takes a GitHub issue and autonomously tries to fix it. Operates fast (couple of minutes) and performs well on the SWE-bench benchmark. One of the first AI agents to actually work in the real world and it's open-source.
🌟 **LLM-Powered Autonomous Agents, OpenAI: Lilian Weng (2023)**
[blog](https://lilianweng.github.io/posts/2023-06-23-agent/)
> An overview of agents as general problem-solvers powered by LLMs such as
> [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT) and
> [GPT-Engineer](https://github.com/AntonOsika/gpt-engineer) Agents typically
> can act within the world are are augmented with the ability to do explicit
> long term planning (via decomposing goals into sub-goals and learning from its
> mistakes), long-term memory (via a vector database) and tool use (calling
> external APIs).
**ChatGPT Plugins, OpenAI (2023)**
[blog](https://openai.com/blog/chatgpt-plugins),
[demo](https://chat.openai.com/?model=gpt-4)
> GPT-4 has access to plugins for tasks where it would be better suited to call
> an API. Examples include Code Interpreter, web browser and Wolfram Alpha.
<!-- RETRO, DeepMind:
> k-Nearest Neighbour approaches
-->
**Toolformer, Meta: Schick et al (2023)**
[pdf](https://arxiv.org/pdf/2302.04761.pdf),
[pdf2](https://arxiv.org/pdf/2305.17126.pdf)
> Trained models to decide which APIs to call, when to call them, what arguments
> to pass, and how to best incorporate the results into future token prediction
> Effectively the LMs teach themselves how to use tools. In the limit case of
> this we simply require LMs/agents to be able to ask the right questions, know
> where to ask them and possibly be able to interpret the answers they receive.
> In other words, we offload the actual computation to external APIs (which may
> themselves be ML models) and use much smaller base models.
## Games
🌟 **Libratus: heads-up no-limit poker, Meta: Brown and Sandholm (2017)**
[pdf](https://www.science.org/doi/epdf/10.1126/science.aao1733),
[pdf2](https://arxiv.org/pdf/1705.02955.pdf),
[video](https://www.youtube.com/watch?v=2dX0lwaQRX0)
> The first AI to beat humans at Texas Hold Em Poker (heads up). An important
> part of the approach was in computing real-time responses to opponent moves,
> spending more compute on less obvious moves.
**AlphaGo/AlphaZero, DeepMind: Silver et al (2016)**
[pdf](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf),
[pdf2](https://www.nature.com/articles/nature24270.epdf),
[film](https://www.youtube.com/watch?v=WXuK6gekU1Y),
[blog](https://www.deepmind.com/research/highlighted-research/alphago)
> This result needs no introduction. In terms of Adaptive Computation, they the
> depth of the Monte Carlo Tree Search (MCTS) was allowed to be variable.
## Pre-cursors to Adaptive Computation
**Dynamic Evaluation**
[blog](https://gwern.net/doc/ai/nn/dynamic-evaluation/index)
> Dynamic evaluation is an inference-time finetuning approach which which allows for online learning to increase performance on a given task. This was popular for RNN approaches but has fallen out of favour due to wanting simple-to-deploy models over an API and the rise of In-Context Learning. Similar approaches have seen some success on the [ARC challenge](https://arcprize.org/arc). See also [Jack Cole interview](https://lab42.global/community-interview-jack-cole/)
**Attention and The Transformer, Vaswani et al (2017)**
[pdf](https://arxiv.org/pdf/1706.03762.pdf)
[pdf2](https://arxiv.org/pdf/1409.0473.pdf)
> Although we don't normally think of it this way, attention can be viewed as a
> conditional computation mechanism. The matrix which is applied to the input is
> dependant on the incoming data.
**Conditional Computation, Bengio et al. (2016)**
[pdf](https://arxiv.org/pdf/1511.06297.pdf)
> They use Reinforcement Learning to train a policy gradient to decide which
> parts of the network to activate, in effect learning a dropout policy for
> sparsity.
**Adaptive Mixtures of Local Experts, Jacobs et al (1991)**
[pdf](https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf)
> Collaborative, learned Mixture of Experts approaches to handle subsets of the
> training set are proposed. It's remarkable how close current approaches are to
> the original gating network. They also show intuitive expert specialisation on
> the task of vowel discrimination.
## Open Source Libraries
🌟 **DSPy, Stanford: Kandpal et al (2023)**
[pdf](https://arxiv.org/pdf/2310.03714.pdf),
[code](https://github.com/stanfordnlp/dspy)
> A framework which allows AI engineers to build LLM pipelines in code. Here we can also
> algorithmically optimize LM prompts and weights using their compilation tools. Within
> this framework pipelines are written like PyTorch code and engineers can write control
> flows to allow for Adaptive Computation.
**git_theta, UNC: Kandpal et al (2023)**
[pdf](https://arxiv.org/pdf/2306.04529.pdf),
[official framework-agnostic code](https://github.com/r-three/git-theta/)
> A git extension which allows tracking and merging changes to model checkpoints
> like git does with code. With git_theta you can see diffs in parameter groups
> and merge model finetuning branches with merging approaches. It's also
> efficient with low-rank changes to parameter groups.
🌟 **MegaBlocks, Stanford/Databricks (2022)**
[pdf](https://arxiv.org/pdf/2211.15841.pdf)
[pytorch code](https://github.com/databricks/megablocks)
> A lightweight library for training MoE models which is well integrated with
> MegaTron-LM. Maintained by Databricks and used by Mistral, it's becoming the
> standard in MoE training. The core of the library is implementing "dropless-MoE"
> efficiently.
🌟 **DeepSpeed-MoE, Microsoft: Rajbhandari et al (2022)**
[blog](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/),
[pdf](https://arxiv.org/pdf/2201.05596.pdf),
[official PyTorch code](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/moe)
> Training and inference solution for distributed MoE models. They also present
> a new MoE architecture PR-MoE which has more experts in higher layers and a
> method for distilling expert models into dense 'student models'.
<!-- Sten (2022) [pdf](https://arxiv.org/pdf/2304.07613.pdf)
> PyTorch implementation of efficient, unstructured sparsity linear algebra operations with gradients.
-->
## AI Safety
With adaptive computation, models can choose to use more compute on harder
problems.
For problems where we're concerned about systems failing by not being able to do
sufficient computation then Adaptive Computation is very positive for Alignment.
We should expect fewer mistakes from a model utilising Adaptive Computation,
even on more difficult problems. Additionally,
[Adaptive Computation based systems are less susceptible to Adversarial Attacks](https://arxiv.org/pdf/2210.10253.pdf).
That is to say Adaptive Computation makes models more `robust`.
However, for problems where we're concerned about systems being deceptive or
mesa-optimising increasing the amount of inference-time compute increases their
ability to do so. Here the failure is not a "mistake" but entirely intentional
from the system's perspective. Inference-time search is one way that a model
could implement [deceptive alignment](https://arxiv.org/pdf/1906.01820.pdf) for
example.
## Scaling Laws
**Toward Inference-Optimal MoEs, UCSD: Yun et al (2024)**
[pdf](https://arxiv.org/pdf/2404.02852)
> The Chinchilla scaling laws focused on how to allocate compute to get the best model for a given amount of training compute. Since then LLama and others have focused on optimising for inference-compute as well as training compute. For MoEs there are additional considerations here - how many experts should you assign for the parameter count given that at inference time cost depends on the active parameters? The authors find that fewer experts are more efficient at inference time but more experts are more efficient at training time.
**Knowledge Capacity Scaling Laws, Meta: Allen-Zhu & Li (2024)**
[pdf](https://arxiv.org/pdf/2404.05405.pdf)
> The authors examine the Physics of Language Models and how much
> data they can store per parameter. They find that typically models can store
> around 2 bits per parameter, and this doesn't reduce too much with MoE models.
> This confirms (since MoEs are typically much larger in parameter count) that
> these models can store a lot more information than traditional models.
> It also suggests that knowledge capacity is relatively independent of
> forward pass compute giving a natural (if imprecise) cleaving of intelligence
> represented in compute applied and knowledge represented in the parameters.
**Sparse Scaling Laws, DeepMind: Frantar et al (2023)**
[pdf](https://arxiv.org/pdf/2309.08520.pdf)
> Scaling laws paper in the style of the
> [Chinchilla paper](https://arxiv.org/pdf/2203.15556.pdf). Details the optimal
> sparsity for a model given the inference FLOPs and training budget. They
> suggest that sparsity is especially important for larger models when seeing
> diminishing returns past Chinchilla optimality. See also
> [Unified Scaling Laws](https://browse.arxiv.org/pdf/2202.01169.pdf).
>
> Further [Scaling Laws For Fine-Grained MoEs](https://arxiv.org/pdf/2402.07871.pdf)
> suggest ways to optimally select the trade-off between the number and size of experts.
**Scaling Scaling Laws with Board Games, Andy Jones (2021)**
[pdf](https://arxiv.org/pdf/2104.03113.pdf)
> [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)
> suggests that there are two general techniques that work well in Machine
> Learning - search and learning. This paper suggests that these can be traded
> off against one another - that is, instead of additional learning you could
> add capable search to achieve similar performance. We can trade off train-time
> and test-time compute depending on our requirements.
## Other
**Blending Is All You Need, Cambridge: Lu et al (2024)**
[pdf](https://arxiv.org/pdf/2401.02994)
> Mark this under "bizarre". They have a multi-modal setup where they completely randomly and uniformly select a model to answer each query in a conversation. The authors report higher user engagement and retention metrics using this approach over each individual model. One hypothesis is that each model can influence models that answer afterwards through the conditioning on previous tokens and there might be some implicit benefits to a model seeing tokens which are slightly off-distribution for what it would have produced. Perhaps they also have different refusal policies too. It's not clear why this should work but this provides a floor for more sophisticated model routing procedures.
**Buffer Overflow in MoEs, DeepMind: Hayes et al (2024)**
[pdf](https://arxiv.org/pdf/2402.05526.pdf)
> Typically with token choice routing methods in MoEs, there are implicitly cross-batch dependencies (i.e. the same token could be routed to different experts if in it's batch most of the other tokens also wanted to go to its preferred expert). The authors show that this batch dependency can be used as an attack surface. They present a few solutions - mostly this shouldn't be a problem if batch sizes are very large (as in inference in a big AI lab) but it's an interesting one to watch out for. We might expect ML systems security to be an increasingly large field of research.
**FLOPs are all you need, Emin Orhan (2023)**
[blog](https://severelytheoretical.wordpress.com/2023/08/14/flops-are-all-you-need-a-conjecture-about-what-really-makes-deep-learning-work/)
> Short post detailing how the success of deep learning models is very
> correlated with the amount of compute that they use per parameter efficiently
> and how they share parameters.
**Review Paper: Dynamic Neural Networks Survey, Han et al (2022)**
[pdf](https://arxiv.org/pdf/2102.04906.pdf)
> A review of Adaptive Computation approaches.
<!--
Lottery Tickets: if we prune we really do get sparsity but the problem is that the sparsity is not useful to us on modern hardware. We need block sparsity to take advantage of this. In the future it might be possible to use less structured sparsity and then this will become very relevant again.
-->
<!--
## Benchmarks
All the usual benchmarks apply but ones that are especially suited to Adaptive Computation methods include:
Parity
Complex logic questions
ContextQA location dataset
ARB (DuckAI benchmark)
Agents benchmarks
Sparsity May Cry (SMC)
-->
<!-- ## Approaches We're Excited To See Explored More
- When we have early exiting we essentially have to train classifiers for each layer in addition to the main model so we have additional overhead for training which is going to save us compute at inference tine. Are there better, more principled ways of early exiting at train time as well so that we don't have to learn very much from easy tokens?
- Current approaches to sparsity are mainly transformer with some sparsity added on the margin. Transformers have worked so well and people are generally leaving them alone and messing with everything else around them - we're interested in paradigm shift approaches which are completely sparse and move further away from the transformer.
-->
<br>
---
<br>
Thanks for reading, if you have any suggestions or corrections please submit a
pull request! And please hit the star button to show your appreciation.
## Citing This Post
If you'd like to cite this article, please use:
```
@misc{ayonrinde_2023_awesome_adaptive_computation,
author = "Kola Ayonrinde",
title = "Awesome Adaptive Computations",
year = 2023,
publisher = "GitHub",
url = "https://github.com/koayon/awesome-adaptive-computation/"
}
```
gitextract_7f14l9l3/ ├── .gitignore ├── LICENSE └── README.md
Condensed preview — 3 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (67K chars).
[
{
"path": ".gitignore",
"chars": 8,
"preview": ".vscode\n"
},
{
"path": "LICENSE",
"chars": 10947,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 55050,
"preview": "# Awesome Adaptive Computation\n\n[](https://awesome.re)\n[. The extraction includes 3 files (64.5 KB), approximately 15.5k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.