Full Code of sebastianruder/NLP-progress for AI

master 379f03ff7568 cached

78 files

550.3 KB

165.4k tokens

13 symbols

1 requests

Download .txt

Showing preview only (577K chars total). Download the full file or copy to clipboard to get everything.

Repository: sebastianruder/NLP-progress
Branch: master
Commit: 379f03ff7568
Files: 78
Total size: 550.3 KB

Directory structure:
gitextract_k09iiau2/

├── .gitignore
├── CITATION.cff
├── CNAME
├── LICENSE
├── README.md
├── _config.yml
├── _includes/
│   ├── chart.html
│   └── table.html
├── arabic/
│   └── language_modeling.md
├── bengali/
│   ├── emotion_detection.md
│   ├── part_of_speech_tagging.md
│   ├── question_answering.md
│   └── sentiment_analysis.md
├── chinese/
│   ├── chinese.md
│   ├── chinese_word_segmentation.md
│   └── question_answering.md
├── english/
│   ├── automatic_speech_recognition.md
│   ├── ccg.md
│   ├── common_sense.md
│   ├── constituency_parsing.md
│   ├── coreference_resolution.md
│   ├── data_to_text_generation.md
│   ├── dependency_parsing.md
│   ├── dialogue.md
│   ├── domain_adaptation.md
│   ├── entity_linking.md
│   ├── grammatical_error_correction.md
│   ├── information_extraction.md
│   ├── intent_detection_slot_filling.md
│   ├── keyphrase_extraction_generation.md
│   ├── language_modeling.md
│   ├── lexical_normalization.md
│   ├── machine_translation.md
│   ├── missing_elements.md
│   ├── multi-task_learning.md
│   ├── multimodal.md
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   ├── paraphrase-generation.md
│   ├── part-of-speech_tagging.md
│   ├── question_answering.md
│   ├── relation_prediction.md
│   ├── relationship_extraction.md
│   ├── semantic_parsing.md
│   ├── semantic_role_labeling.md
│   ├── semantic_textual_similarity.md
│   ├── sentiment_analysis.md
│   ├── shallow_syntax.md
│   ├── simplification.md
│   ├── stance_detection.md
│   ├── summarization.md
│   ├── taxonomy_learning.md
│   ├── temporal_processing.md
│   ├── text_classification.md
│   └── word_sense_disambiguation.md
├── french/
│   ├── question_answering.md
│   └── summarization.md
├── german/
│   ├── question_answering.md
│   └── summarization.md
├── hindi/
│   └── hindi.md
├── jekyll_instructions.md
├── korean/
│   └── question_answering.md
├── nepali/
│   └── nepali.md
├── persian/
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   └── summarization.md
├── portuguese/
│   └── question_answering.md
├── russian/
│   ├── question_answering.md
│   ├── sentiment-analysis.md
│   └── summarization.md
├── spanish/
│   ├── entity_linking.md
│   ├── named_entity_recognition.md
│   └── summarization.md
├── structured/
│   ├── README.md
│   ├── export.py
│   └── requirements.txt
├── turkish/
│   └── summarization.md
└── vietnamese/
    └── vietnamese.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
_site/
Gemfile*
venv
.idea
structured.json


================================================
FILE: CITATION.cff
================================================
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ruder"
  given-names: "Sebastian"
title: "NLP-progress"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2022-02-06
url: "https://nlpprogress.com/"


================================================
FILE: CNAME
================================================
nlpprogress.com

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Sebastian Ruder

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Tracking Progress in Natural Language Processing

## Table of contents

### English

- [Automatic speech recognition](english/automatic_speech_recognition.md)
- [CCG](english/ccg.md)
- [Common sense](english/common_sense.md)
- [Constituency parsing](english/constituency_parsing.md)
- [Coreference resolution](english/coreference_resolution.md)
- [Data-to-Text Generation](english/data_to_text_generation.md)
- [Dependency parsing](english/dependency_parsing.md)
- [Dialogue](english/dialogue.md)
- [Domain adaptation](english/domain_adaptation.md)
- [Entity linking](english/entity_linking.md)
- [Grammatical error correction](english/grammatical_error_correction.md)
- [Information extraction](english/information_extraction.md)
- [Intent Detection and Slot Filling](english/intent_detection_slot_filling.md) 
- [Keyphrase Extraction and Generation](english/keyphrase_extraction_generation.md)
- [Language modeling](english/language_modeling.md)
- [Lexical normalization](english/lexical_normalization.md)
- [Machine translation](english/machine_translation.md)
- [Missing elements](english/missing_elements.md)
- [Multi-task learning](english/multi-task_learning.md)
- [Multi-modal](english/multimodal.md)
- [Named entity recognition](english/named_entity_recognition.md)
- [Natural language inference](english/natural_language_inference.md)
- [Part-of-speech tagging](english/part-of-speech_tagging.md)
- [Paraphrase Generation](english/paraphrase-generation.md)
- [Question answering](english/question_answering.md)
- [Relation prediction](english/relation_prediction.md)
- [Relationship extraction](english/relationship_extraction.md)
- [Semantic textual similarity](english/semantic_textual_similarity.md)
- [Semantic parsing](english/semantic_parsing.md)
- [Semantic role labeling](english/semantic_role_labeling.md)
- [Sentiment analysis](english/sentiment_analysis.md)
- [Shallow syntax](english/shallow_syntax.md)
- [Simplification](english/simplification.md)
- [Stance detection](english/stance_detection.md)
- [Summarization](english/summarization.md)
- [Taxonomy learning](english/taxonomy_learning.md)
- [Temporal processing](english/temporal_processing.md)
- [Text classification](english/text_classification.md)
- [Word sense disambiguation](english/word_sense_disambiguation.md)

### Vietnamese

- [Dependency parsing](vietnamese/vietnamese.md#dependency-parsing)
- [Intent detection and Slot filling](vietnamese/vietnamese.md#intent-detection-and-slot-filling)
- [Machine translation](vietnamese/vietnamese.md#machine-translation)
- [Named entity recognition](vietnamese/vietnamese.md#named-entity-recognition)
- [Part-of-speech tagging](vietnamese/vietnamese.md#part-of-speech-tagging)
- [Semantic parsing](vietnamese/vietnamese.md#semantic-parsing)
- [Word segmentation](vietnamese/vietnamese.md#word-segmentation)

### Hindi

- [Chunking](hindi/hindi.md#chunking)
- [Part-of-speech tagging](hindi/hindi.md#part-of-speech-tagging)
- [Machine Translation](hindi/hindi.md#machine-translation)

### Chinese

- [Entity linking](chinese/chinese.md#entity-linking)
- [Chinese word segmentation](chinese/chinese_word_segmentation.md)
- [Question answering](chinese/question_answering.md)

For more tasks, datasets and results in Chinese, check out the [Chinese NLP](https://chinesenlp.xyz/#/) website.

### French

- [Question answering](french/question_answering.md)
- [Summarization](french/summarization.md)

### Russian

- [Question answering](russian/question_answering.md)
- [Sentiment Analysis](russian/sentiment-analysis.md)
- [Summarization](russian/summarization.md)

### Spanish

- [Named Entity Recognition](spanish/named_entity_recognition.md)
- [Entity linking](spanish/entity_linking.md#entity-linking)
- [Summarization](spanish/summarization.md)

### Portuguese

- [Question Answering](portuguese/question_answering.md)

### Korean

- [Question Answering](korean/question_answering.md)

### Nepali

- [Machine Translation](nepali/nepali.md#machine-translation)

### Bengali
- [Part-of-speech Tagging](bengali/part_of_speech_tagging.md)
- [Emotion Detection](bengali/emotion_detection.md)
- [Sentiment Analysis](bengali/sentiment_analysis.md)

### Persian
- [Named entity recognition](persian/named_entity_recognition.md)
- [Natural language inference](persian/natural_language_inference.md)
- [Summarization](persian/summarization.md)

### Turkish

- [Summarization](turkish/summarization.md)

### German

- [Question Answering](german/question_answering.md)
- [Summarization](german/summarization.md)

### Arabic
- [Language modeling](arabic/language_modeling.md)


This document aims to track the progress in Natural Language Processing (NLP) and give an overview
of the state-of-the-art (SOTA) across the most common NLP tasks and their corresponding datasets.

It aims to cover both traditional and core NLP tasks such as dependency parsing and part-of-speech tagging
as well as more recent ones such as reading comprehension and natural language inference. The main objective
is to provide the reader with a quick overview of benchmark datasets and the state-of-the-art for their
task of interest, which serves as a stepping stone for further research. To this end, if there is a 
place where results for a task are already published and regularly maintained, such as a public leaderboard,
the reader will be pointed there.

If you want to find this document again in the future, just go to [`nlpprogress.com`](https://nlpprogress.com/)
or [`nlpsota.com`](http://nlpsota.com/) in your browser.

### Contributing

#### Guidelines

**Results** &nbsp; Results reported in published papers are preferred; an exception may be made for influential preprints.

**Datasets** &nbsp; Datasets should have been used for evaluation in at least one published paper besides 
the one that introduced the dataset.

**Code** &nbsp; We recommend to add a link to an implementation 
if available. You can add a `Code` column (see below) to the table if it does not exist.
In the `Code` column, indicate an official implementation with [Official](http://link_to_implementation).
If an unofficial implementation is available, use [Link](http://link_to_implementation) (see below).
If no implementation is available, you can leave the cell empty.

#### Adding a new result

If you would like to add a new result, you can just click on the small edit button in the top-right
corner of the file for the respective task (see below).

![Click on the edit button to add a file](img/edit_file.png)

This allows you to edit the file in Markdown. Simply add a row to the corresponding table in the
same format. Make sure that the table stays sorted (with the best result on top). 
After you've made your change, make sure that the table still looks ok by clicking on the
"Preview changes" tab at the top of the page. If everything looks good, go to the bottom of the page,
where you see the below form. 

![Fill out the file change information](img/propose_file_change.png)

Add a name for your proposed change, an optional description, indicate that you would like to
"Create a new branch for this commit and start a pull request", and click on "Propose file change".

#### Adding a new dataset or task

For adding a new dataset or task, you can also follow the steps above. Alternatively, you can fork the repository.
In both cases, follow the steps below:

1. If your task is completely new, create a new file and link to it in the table of contents above.
2. If not, add your task or dataset to the respective section of the corresponding file (in alphabetical order).
3. Briefly describe the dataset/task and include relevant references. 
4. Describe the evaluation setting and evaluation metric.
5. Show how an annotated example of the dataset/task looks like.
6. Add a download link if available.
7. Copy the below table and fill in at least two results (including the state-of-the-art)
  for your dataset/task (change Score to the metric of your dataset). If your dataset/task
  has multiple metrics, add them to the right of `Score`.
1. Submit your change as a pull request.
  
| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
|  |  |  | |


### Wish list

These are tasks and datasets that are still missing:

- Bilingual dictionary induction
- Discourse parsing
- Keyphrase extraction
- Knowledge base population (KBP)
- More dialogue tasks
- Semi-supervised learning
- Frame-semantic parsing (FrameNet full-sentence analysis)

### Exporting into a structured format

You can extract all the data into a structured, machine-readable JSON format with parsed tasks, descriptions and SOTA tables. 

The instructions are in [structured/README.md](structured/README.md).

### Instructions for building the site locally

Instructions for building the website locally using Jekyll can be found [here](jekyll_instructions.md).




================================================
FILE: _config.yml
================================================
theme: jekyll-theme-slate

================================================
FILE: _includes/chart.html
================================================
<style>

.chart div {
  font: 18px sans-serif;
  background-color: steelblue;
  padding: 6px;
  margin: 2px;
  color: white;
  height: 40px;
}

.alignleft {
	float: left;
}
.alignright {
	float: right;
}

</style>

<div class="chart">
{% for result in include.results %}
{% assign score = result[include.score] %}
  <div style="width: {{ score | times: 6.0 }}px;">
    <p class="alignleft">{{ result.authors }} ({{ result.year }})</p>
    <p class="alignright">{{ score }}</p>
  </div>
{% endfor %}
</div>


================================================
FILE: _includes/table.html
================================================
{% assign scores = include.scores | split: "," %}

<table>
  <thead>
    <tr>
      <th>Model</th>
      {% for score in scores %}
      <th style="text-align: center">{{ score }}</th>
      {% endfor %}
      <th>Paper / Source</th>
      <th>Code</th>
    </tr>
  </thead>
  <tbody>
  {% for result in include.results %}
    <tr>
      <td>{% if result.model %} {{ result.model }} by {% endif %} {{ result.authors }} ({{ result.year }})</td>
      {% for score in scores %}
      <td style="text-align: center">{{ result[score] }}</td>
      {% endfor %}
      <td><a href="{{ result.url }}">{{ result.paper }}</a></td>
      <td>
      {% for el in result.code %}
        <a href="{{ el.url }}">{{ el.name }}</a>
      {% endfor %}
      </td>
    </tr>
  {% endfor %}
  </tbody>
</table>


================================================
FILE: arabic/language_modeling.md
================================================
# Language modeling

Language modeling is the task of predicting the next word or character in a document.


| Model           | Paper / Source | Code |
| ------------- | :-----:| :-----: |
| Zen 2.0: Continue training and adaption for n-gram enhanced text encoders | [ZEN](https://arxiv.org/abs/2105.01279) | [Official](https://github.com/sinovation/ZEN2) |
|hULMonA: The Universal Language Model in Arabic|[hULMonA](https://aclanthology.org/W19-4608/) | [Official](https://github.com/aub-mind/hULMonA) |
|AraBERT: Transformer-based Model for Arabic Language Understanding|[AraBERT](https://arxiv.org/abs/2003.00104) | [Official](https://github.com/aub-mind/araBERT) |


================================================
FILE: bengali/emotion_detection.md
================================================
# Fine-grained Emotion Detection

Fine-grained Emotion Detection is the task of detecting one or multiple emotion of a given text.

## EmoNoBa

[EmoNoBa: A Dataset for Analyzing Fine-Grained Emotions on Noisy Bangla Texts](https://aclanthology.org/2022.aacl-short.17.pdf) is a dataset which contains 22,698 instances with each labeled with one or atmost all 6 emotions. The dataset is available [here](https://www.kaggle.com/datasets/saifsust/emonoba). The models are evaluated based on Macro Average F1-score.

| Model | F1-score | Paper / Source | Code |
| ------------ | ------------- | ------------ | ------------- |
| W1 + W2 + W3+ W4 + C1 + C2 + C3 | 42.81 | [EmoNoBa: A Dataset for Analyzing Fine-Grained Emotions on Noisy Bangla Texts](https://aclanthology.org/2022.aacl-short.17.pdf) | [Official](https://github.com/KhondokerIslam/EmoNoBa) |


================================================
FILE: bengali/part_of_speech_tagging.md
================================================

# Part-of-speech Tagging
Part-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of speech.
A part of speech is a category of words with similar grammatical properties. Common English
parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

## Linguistic Data Consortium: Indian Bengali
Indian Language Part-of-Speech Tagset: Bengali, Linguistic Data Consortium (LDC) catalog number LDC2010T16 and isbn 1-58563-561-8, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general.

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Deep Learning(Fasihul et al. 2016) | 93.33 | [Deep learning based parts of speech tagger for Bengali](https://ieeexplore.ieee.org/abstract/document/7760098) | --- |



================================================
FILE: bengali/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [Bangla-SQuAD](#Bangla-SQuAD)
  
## Reading comprehension
  
### Bangla-SQuAD

The [Bengali Question Answering Dataset (Bengali-SQuAD)](https://zenodo.org/record/4557874#.YaEUp9BBxPY) is an automatically translated (using Google Translate) and preprocessed subset of the large-scale reading comprehension
dataset English [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) introduced in the paper ["Deep Learning Based Question Answering System
in Bengali"](https://www.tandfonline.com/doi/full/10.1080/24751839.2020.1833136)


Example:

| Document  | Question | Answer |
| ------------- | -----:| -----: |
| চার্লসটন আমেরিকা যুক্তরাষ্ট্রের দক্ষিণ ক্যারোলাইনা রাজ্যের প্রাচীনতম এবং দ্বিতীয় বৃহত্তম শহর, চার্লসটন কাউন্টির কাউন্টি আসন এবং চার্লসটন – নর্থ চার্লসটন – সামারভিলে মেট্রোপলিটন স্ট্যাটিস্টিকাল এরিয়ার প্রধান শহর  শহরটি দক্ষিণ ক্যারোলিনার উপকূলরেখার ভৌগলিক মিডপয়েন্টের ঠিক দক্ষিণে অবস্থিত এবং অ্যাশলে এবং কুপার নদীর নদীর সংগম দ্বারা গঠিত আটলান্টিক মহাসাগরের একটি খাঁটি চার্লস্টন হারবারে অবস্থিত, অথবা স্থানীয়ভাবে প্রকাশিত হয়েছে, \"যেখানে কুপার এবং অ্যাশলে রয়েছে। নদীগুলি একত্র হয়ে আটলান্টিক মহাসাগর গঠনে আসে।|চার্লসটন হারবার কোন মহাসাগরের খাঁড়ি? |আটলান্টিক মহাসাগরের|


================================================
FILE: bengali/sentiment_analysis.md
================================================
# Sentiment analysis

Sentiment Analysis is the task of classifying polarity of a given text.

## SentNoB

[SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts](https://aclanthology.org/2021.findings-emnlp.278.pdf) is a dataset which contains 15,728 instances with each labeled with one of three-class labels. This work also proposes, <em>unique word percentage</em>, a new evaluation metric for datasets. Models are evaluated based on micro-averaged F1 score.

| Model | F1-score | Paper / Source | Code |
| ------------ | ------------- | ------------ | ------------- |
| U + B + T + C2 + C3 + C4 + C5 | 64.61 | [SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts](https://aclanthology.org/2021.findings-emnlp.278.pdf) | [Official](https://github.com/KhondokerIslam/SentNoB) |


================================================
FILE: chinese/chinese.md
================================================
# Chinese NLP tasks

## Entity linking

See [here](../english/entity_linking.md) for more information about the task.

### Datasets

#### AIDA CoNLL-YAGO Dataset

##### Disambiguation-Only Models

|  Model | Micro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: |
| Sil et al. (2018) | 84.4 | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | |
| Tsai & Roth (2016) | 83.6 | [Cross-lingual wikification using multilingual embeddings](http://cogcomp.org/papers/TsaiRo16b.pdf) | |

[Go back to the README](../README.md)


================================================
FILE: chinese/chinese_word_segmentation.md
================================================
# Chinese Word Segmentation

## Task
Chinese word segmentation is the task of
splitting Chinese text (a sequence of Chinese characters)
into words.

Example:
```
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
```

## Systems
&spades; marks the system that uses character unigram as input.
&clubs; marks the system that uses character bigram as input.

- Tian et al. (2020): ZEN + key-value memory networks &spades;
- Huang et al. (2019): BERT + model compression + multi-criterial learing &spades;
- Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings &spades;&clubs; 
- Ma et al. (2018): BiLSTM-CRF + hyper-params search&spades;&clubs;
- Yang et al. (2017): Transition-based + Beam-search + Rich pretrain&spades;&clubs; 
- Zhou et al. (2017): Greedy Search + word context&spades;
- Chen et al. (2017): BiLSTM-CRF + adv. loss&spades;&clubs;
- Cai et al. (2017): Greedy Search+Span representation&spades;
- Kurita et al. (2017): Transition-based + Joint model&spades;
- Liu et al. (2016): neural semi-CRF&spades;
- Cai and Zhao (2016): Greedy Search&spades;
- Chen et al. (2015a): Gated Recursive NN&spades;&clubs;
- Chen et al. (2015b): BiLSTM-CRF&spades;&clubs;

## Evaluation

### Metrics

F1-score

### Dataset
#### Chinese Treebank 6

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Huang et al. (2019) | 97.6 |[Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)||
| Tian et al. (2020) | 97.3 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Ma et al. (2018) | 96.7 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2018) | 96.3 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Yang et al. (2017) | 96.2 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Zhou et al. (2017) | 96.2 | [Word-Context Character Embeddings for Chinese Word Segmentation](https://www.aclweb.org/anthology/D17-1079)| |
| Chen et al. (2017) | 96.2 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 95.5 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Chen et al. (2015b) | 96.0 | [Long Short-Term Memory Neural Networks for Chinese Word Segmentation](http://www.aclweb.org/anthology/D15-1141) | [Github](https://github.com/FudanNLP/CWS_LSTM) |

#### Chinese Treebank 7

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Ma et al. (2018) | 96.6 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Kurita et al. (2017) | 96.2 | [Neural Joint Model for Transition-based Chinese Syntactic Analysis](http://www.aclweb.org/anthology/P17-1111) | |

#### AS

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 96.6 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Ma et al. (2018) | 96.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2017) | 95.7 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) |[Github](https://github.com/jiesutd/RichWordSegmentor) |
| Cai et al. (2017) | 95.3 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 94.8 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |

#### CityU

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 97.9 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Huang et al. (2019) | 97.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Ma et al. (2018) | 97.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2017) | 96.9 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Cai et al. (2017) | 95.6 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 95.6 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |

#### PKU

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Tian et al. (2020) | 96.5 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Yang et al. (2017) | 96.3 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Ma et al. (2018) | 96.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2018) | 95.9 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Cai et al. (2017) | 95.8 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 94.3 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 95.7 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Cai and Zhao (2016) | 95.7 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) |

#### MSR

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 98.4 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Ma et al. (2018) | 98.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Huang et al. (2019) | 97.9 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Yang et al. (2018) | 97.8 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Yang et al. (2017) | 97.5 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Cai et al. (2017) | 97.1 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 96.0 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 97.6 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Cai and Zhao (2016) | 96.4 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) |

[Go back to the README](../README.md)


================================================
FILE: chinese/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [CMRC2018](#cmrc-2018)
  - [DRCD](#drcd)
  - [DuReader](#dureader)
  
## Reading comprehension

### CMRC 2018

The [Chinese Machine Reading Comprehension (CMRC 2018)](https://www.aclweb.org/anthology/D19-1600/) is a SQuAD-like
reading comprehension dataset that consists of 20,000 questions annotated on Wikipedia paragraphs by human experts. The
dataset can be downloaded [here](https://github.com/ymcui/cmrc2018). Below we show the F1 and EM scores both on the
test set and the challenge set. 

| Model           | Test F1 | Test EM | Challenge F1 | Challenge EM | Paper |
| ------------- | :-----:| :-----:| :-----:| :-----:| --- |
| Human performance | 97.9 | 92.4 | 95.2 | 90.4 | [A Span-Extraction Dataset for Chinese Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1600/) |
| Dual BERT (w / SQuAD; Cui et al., 2019) | 90.2 | 73.6 | 55.2 | 27.8 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
| Dual BERT (Cui et al., 2019) | 88.1 | 70.4 | 47.9 | 23.8 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |

### DRCD

The [Delta Reading Comprehension Dataset (DRCD)](https://arxiv.org/abs/1806.00920) is a SQuAD-like reading 
comprehension dataset that contains 30,000+ questions on 10,014 paragraphs from 2,108 Wikipedia articles. The dataset
can be downloaded [here](https://github.com/DRCKnowledgeTeam/DRCD).

| Model           | F1 | EM |  Paper |
| ------------- | :-----:| :-----:| --- |
| Human performance | 93.3 | 80.4 | [DRCD: a Chinese Machine Reading Comprehension Dataset](https://arxiv.org/abs/1806.00920) |
| Dual BERT (w / SQuAD; Cui et al., 2019) | 91.6 | 85.4 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
| Dual BERT (Cui et al., 2019) | 90.3 | 83.7 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
  
### DuReader

[DuReader](https://www.aclweb.org/anthology/W18-2605/) is a large-scale reading comprehension dataset that is based on
the logs of Baidu Search and contains 200k questions, 420k answers, and 1M documents. For more information, refer to
[its website](https://ai.baidu.com/broad/introduction?dataset=dureader) to see the introduction. You can download the
dataset [here](https://ai.baidu.com/broad/download?dataset=dureader). The best models can be view on the 
[public leaderboard](https://ai.baidu.com/broad/leaderboard?dataset=dureader).


================================================
FILE: english/automatic_speech_recognition.md
================================================
# Automatic speech recognition (ASR)

Automatic speech recognition is the task of automatically recognizing speech. You 
can find a repository tracking the state-of-the-art [here](https://github.com/syhw/wer_are_we).


================================================
FILE: english/ccg.md
================================================
# Combinatory Categorical Grammar

Combinatory Categorical Grammar (CCG; [Steedman, 2000](http://www.citeulike.org/group/14833/article/8971002)) is a
highly lexicalized formalism. The standard parsing model of [Clark and Curran (2007)](https://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.4.493)
uses over 400 lexical categories (or _supertags_), compared to about 50 part-of-speech tags for typical parsers.

Example:

| Vinken | , | 61 | years | old |
| --- | ---| --- | --- | --- |
| N| , | N/N | N | (S[adj]\ NP)\ NP |

## Parsing

CCG parsing is evaluated in terms of labeled dependency F-score, which "take\[s\] into account the lexical category containing the dependency relation, the argument slot, the word associated with the lexical category, and the argument head word: All four must be correct to score a point" ([Clark & Curran, 2007](https://doi.org/10.1162/coli.2007.33.4.493)).
Besides the word forms, some popular parsers (like the C&C parser) take POS tags as input. For fair comparison, systems should use automatically obtained POS as input, though some papers additionally report performance with oracle gold-standard POS features.

### CCGBank

The CCGBank is a corpus of CCG derivations and dependency structures extracted from the Penn Treebank by
[Hockenmaier and Steedman (2007)](http://www.aclweb.org/anthology/J07-3004). Sections 2-21 are used for training,
section 00 for development, and section 23 as in-domain test set.

| Model           | Labeled F-score |  Paper / Source |
| ------------- | :-----:| --- |
| Prange et al. (2021), non-constructive | 90.91 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Bhargava and Penn (2020), constructive | 90.9 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Prange et al. (2021), constructive | 90.79 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Vaswani et al. (2016) | 88.32 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) |
| Lewis et al. (2016) | 88.1 | [LSTM CCG Parsing](https://aclweb.org/anthology/N/N16/N16-1026.pdf) |
| Xu et al. (2015) | 87.04 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | 85.95 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Clark and Curran (2007) | 85.45 | [Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models](https://www.aclweb.org/anthology/J07-4004) |

### Wikipedia

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| Xu et al. (2015) | 82.49 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | 81.7 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |

### Bioinfer

| Model         | Bio specifc taggers? | Accuracy |  Paper / Source |
| ------------- | -------------------- | :-------:| --- |
| Kummerfeld et al. (2010), with additional unlabeled data | Yes | 82.3 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Rimell and Clark (2008) | Yes | 81.5 | [Adapting a Lexicalized-Grammar Parser to Contrasting Domains](https://aclweb.org/anthology/papers/D/D08/D08-1050/) |
| Xu et al. (2015) | No | 77.74 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | No | 76.1 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Rimell and Clark (2008) | No | 76.0 | [Adapting a Lexicalized-Grammar Parser to Contrasting Domains](https://aclweb.org/anthology/papers/D/D08/D08-1050/) |

## Supertagging

To mitigate sparsity, CCG supertaggers have traditionally been trained only on categories that occur 10 times or more in the CCGBank training data, which amounts to the 425 most frequent categories. In more recent work, using this threshold is becoming less common. In any case, supertagging evaluation is always measured for all supertags occurring in the test set. Models are evaluated based on token accuracy.

### Constructive supertagging

A constructive tagger models the internal structure of supertags rather than treating each supertag type as opaque ([Kogkalidis et al., 2019](https://www.aclweb.org/anthology/W19-4314/)). Supertags are constructed from minimal pieces (which for CCG are slashes and atomic categories) and there is no frequency cutoff.

### CCGBank

Like for parsing, sections 2-21 are used for training, section 00 for development, and section 23 as in-domain test set.

| Model           | Accuracy |  Paper / Source |
| ----------------- | :-----:| --- |
| Kogkalidis and Moortgat (2022), constructive | 96.29 | [Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions](https://arxiv.org/abs/2203.12235) | 
| Tian et al. (2020), non-constructive | 96.25 | [Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks](https://aclanthology.org/2020.emnlp-main.487/) |
| Prange et al. (2021), non-constructive | 96.22 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), constructive | 96.09 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Clark et al. (2018) | 96.05 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) |
| Bhargava and Penn (2020), constructive | 96.00 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Lewis et al. (2016) | 94.7 | [LSTM CCG Parsing](https://aclweb.org/anthology/N/N16/N16-1026.pdf) |
| Vaswani et al. (2016) | 94.24 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) |
| Low supervision (Søgaard and Goldberg, 2016) | 93.26 | [Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038) |
| Xu et al. (2015) | 93.00 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Clark and Curran (2004) | 92.00 | [The Importance of Supertagging for Wide-Coverage CCG Parsing](https://aclweb.org/anthology/papers/C/C04/C04-1041/) (result from Lewis et al. (2016)) |

#### Rare and unseen supertags

| Model           | Acc on tags seen 1-9 times | Acc on unseen tags |  Paper / Source |
| ------------- | :-----: | :-----: | --- |
| Bhargava and Penn (2020), constructive | - | 5.00 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Prange et al. (2021), constructive | 37.40 | 3.03 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), non-constructive | 23.17 | 0.00 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Kogkalidis and Moortgat (2022), constructive | 34.45 | 4.55 | [Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions](https://arxiv.org/abs/2203.12235) |

### Wikipedia

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----: | --- |
| Prange et al. (2021), non-constructive | 92.54 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), constructive | 92.46 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Xu et al. (2015) | 90.00 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |

## Conversion to PTB

There has been interest in converting CCG derivations to phrase structure parses for comparison with phrase structure parsers (since CCGBank is based on the PTB).

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| Kummerfeld et al. (2012) | 96.30 | [Robust Conversion of CCG Derivations to Phrase Structure Trees](https://www.aclweb.org/anthology/P12-2021) |
| Zhang et al. (2012) | 95.71 | [A Machine Learning Approach to Convert CCGbank to Penn Treebank](https://www.aclweb.org/anthology/C12-3067)
| Clark and Curran (2009) | 94.64 | [Comparing the Accuracy of CCG and Penn Treebank Parsers](https://aclweb.org/anthology/papers/P/P09/P09-2014/) |

[Go back to the README](../README.md)


================================================
FILE: english/common_sense.md
================================================
# Common sense

Common sense reasoning tasks are intended to require the model to go beyond pattern 
recognition. Instead, the model should use "common sense" or world knowledge
to make inferences.

### Event2Mind

Event2Mind is a crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
Given an event described in a short free-form text, a model should reason about the likely intents and reactions of the
event's participants. Models are evaluated based on average cross-entropy (lower is better).

| Model           | Dev  | Test  |  Paper / Source | Code | 
| ------------- | :-----:| :-----:|--- | --- | 
| BiRNN 100d (Rashkin et al., 2018) | 4.25 | 4.22 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |
| ConvNet (Rashkin et al., 2018) | 4.44 | 4.40 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |

### SWAG

Situations with Adversarial Generations (SWAG) is a dataset consisting of 113k multiple
choice questions about a rich spectrum of grounded situations.

| Model           | Dev  | Test  |  Paper / Source | Code | 
| ------------- | :-----:| :-----:|--- | --- | 
| BERT Large (Devlin et al., 2018) | 86.6 | 86.3 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| BERT Base (Devlin et al., 2018) | 81.6 | - | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| ESIM + ELMo (Zellers et al., 2018) | 59.1 | 59.2 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |
| ESIM + GloVe (Zellers et al., 2018) | 51.9 | 52.7 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |

### Winograd Schema Challenge

The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
is a dataset for common sense reasoning. It employs Winograd Schema questions that
require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
are evaluated based on accuracy.

Example:

The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
Answer 0: the trophy. Answer 1: the suitcase

| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | |
| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | |
| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) | |

### Winograd NLI (WNLI)

WNLI is a relaxation of the Winograd Schema Challenge proposed as part of the [GLUE benchmark](https://arxiv.org/abs/1804.07461) and a conversion to the natural language inference (NLI) format. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. While the training set is balanced between two classes (entailment and not entailment), the test set is imbalanced between them (35% entailment, 65% not entailment). The majority baseline is thus 65%, while for the Winograd Schema Challenge it is 50% ([Liu et al., 2017](https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392)). The latter is more challenging.

Results are available at the [GLUE leaderboard](https://gluebenchmark.com/leaderboard). Here is a subset of results of recent models:

| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| XLNet-Large (ensemble) (Yang et al., 2019) | 90.4 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 89.0 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 65.1 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |

### Visual Common Sense

Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.
With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered 
pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring 
higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense 
Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a 
rationale explaining why its answer is true.

| Model | Q->A  | QA->R  | Q->AR  | Paper / Source | Code |
| ------ | :-------:| :-------: | :-------:| ------ |  ------ | 
| Human Performance University of Washington (Zellers et al. '18) | 91.0 | 93.0 | 85.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) | | 
| Recognition to Cognition Networks University of Washington | 65.1 | 67.3 | 44.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) |  https://github.com/rowanz/r2c |
| BERT-Base Google AI Language (experiment by Rowan) | 53.9 | 64.5 | 35.0 | | https://github.com/google-research/bert |
| MLB Seoul National University (experiment by Rowan) | 46.2 | 36.8 | 17.2 | | https://github.com/jnhwkim/MulLowBiVQA |
| Random Performance | 25.0 | 25.0 | 6.2 | | | 

### ReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd] and is part of the [SuperGLUE benchmark](https://arxiv.org/pdf/1905.00537.pdf).

| Model | EM  | F1  | Paper / Source | Code |
| ------ | ------- | ------- | ------ |  ------ | 
| Human Performance Johns Hopkins University (Zhang et al. '18) | 91.31 | 91.69 | [ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension](https://arxiv.org/pdf/1810.12885.pdf) | | 
| LUKE (Yamada et al., 2020) | 90.64 | 91.21 | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523) | [Official](https://github.com/studio-ousia/luke) |
| RoBERTa (Facebook AI)  | 90.0 | 90.6 | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf) | [Official](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 
| XLNet + MTL + Verifier (ensemble)  | 83.09 | 83.74 | | | 
| CSRLM (single model) | 81.78 | 82.58 | | |


================================================
FILE: english/constituency_parsing.md
================================================
# Constituency parsing

Constituency parsing aims to extract a constituency-based parse tree from a sentence that
represents its syntactic structure according to a [phrase structure grammar](https://en.wikipedia.org/wiki/Phrase_structure_grammar).

Example:

                 Sentence (S)
                     |
       +-------------+------------+
       |                          |
     Noun (N)                Verb Phrase (VP)
       |                          |
     John                 +-------+--------+
                          |                |
                        Verb (V)         Noun (N)
                          |                |
                        sees              Bill

[Recent approaches](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf)
convert the parse tree into a sequence following a depth-first traversal in order to
be able to apply sequence-to-sequence models to it. The linearized version of the
above parse tree looks as follows: (S (N) (VP V N)).

### Penn Treebank

The Wall Street Journal section of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is used for
evaluating constituency parsers. Section 22 is used for development and Section 23 is used for evaluation.
Models are evaluated based on F1. Most of the below models incorporate external data or features.
For a comparison of single models trained only on WSJ, refer to [Kitaev and Klein (2018)](https://arxiv.org/abs/1805.01052).

| Model                                                                              | F1 score | Paper / Source                                                                                                                    | Code                                                  |
| ---------------------------------------------------------------------------------- | :------: | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| Span Attention + XLNet (Tian et al., 2020) | 96.40 | [Improving Constituency Parsing with Span Attention](https://aclanthology.org/2020.findings-emnlp.153/) | [Official](https://github.com/cuhksz-nlp/SAPar) |
| Label Attention Layer + HPSG + XLNet (Mrini et al., 2020)                          |  96.38   | [Rethinking Self-Attention: Towards Interpretability for Neural Parsing](https://www.aclweb.org/anthology/2020.findings-emnlp.65.pdf) | [Official](https://github.com/KhalilMrini/LAL-Parser) |
| Attach-Juxtapose Parser + XLNet (Yang and Deng, 2020)                              |  96.34   | [Strongly Incremental Constituency Parsing with Graph Neural Networks](https://arxiv.org/abs/2010.14568) | [Official](https://github.com/princeton-vl/attach-juxtapose-parser) |
| Head-Driven Phrase Structure Grammar Parsing (Joint) + XLNet (Zhou and Zhao, 2019) |  96.33   | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://arxiv.org/pdf/1907.02684.pdf)                             |                                                       |
| Head-Driven Phrase Structure Grammar Parsing (Joint) + BERT (Zhou and Zhao, 2019)  |  95.84   | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://arxiv.org/pdf/1907.02684.pdf)                             |                                                       |
| CRF Parser + BERT (Zhang et al., 2020)                                             |  95.69   | [Fast and Accurate Neural CRF Constituency Parsing](https://www.ijcai.org/Proceedings/2020/560)                                   | [Official](https://github.com/yzhangcs/crfpar)        |
| Self-attentive encoder + ELMo (Kitaev and Klein, 2018)                             |  95.13   | [Constituency Parsing with a Self-Attentive Encoder](https://arxiv.org/abs/1805.01052)                                            | [Official](https://github.com/nikitakit/self-attentive-parser) |
| Model combination (Fried et al., 2017)                                             |  94.66   | [Improving Neural Parsing by Disentangling Model Combination and Reranking Effects](https://arxiv.org/abs/1707.03058)             |                                                       |
| LSTM Encoder-Decoder + LSTM-LM (Takase et al., 2018)                               |  94.47   | [Direct Output Connection for a High-Rank Language Model](http://aclweb.org/anthology/D18-1489)                                   |                                                       |
| LSTM Encoder-Decoder + LSTM-LM (Suzuki et al., 2018)                               |  94.32   | [An Empirical Study of Building a Strong Baseline for Constituency Parsing](http://aclweb.org/anthology/P18-2097)                 |                                                       |
| In-order (Liu and Zhang, 2017)                                                     |   94.2   | [In-Order Transition-based Constituent Parsing](http://aclweb.org/anthology/Q17-1029)                                             |                                                       |
| CRF Parser (Zhang et al., 2020)                                                    |  94.12   | [Fast and Accurate Neural CRF Constituency Parsing](https://www.ijcai.org/Proceedings/2020/560)                                   | [Official](https://github.com/yzhangcs/crfpar)        |
| Semi-supervised LSTM-LM (Choe and Charniak, 2016)                                  |   93.8   | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257)                                                          |                                                       |
| Stack-only RNNG (Kuncoro et al., 2017)                                             |   93.6   | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774)                                 |                                                       |
| RNN Grammar (Dyer et al., 2016)                                                    |   93.3   | [Recurrent Neural Network Grammars](https://www.aclweb.org/anthology/N16-1024)                                                    |                                                       |
| Transformer (Vaswani et al., 2017)                                                 |   92.7   | [Attention Is All You Need](https://arxiv.org/abs/1706.03762)                                                                     |                                                       |
| Combining Constituent Parsers (Fossum and Knight, 2009)                            |   92.4   | [Combining constituent parsers via parse selection or parse hybridization](https://dl.acm.org/citation.cfm?id=1620923)            |                                                       |
| Semi-supervised LSTM (Vinyals et al., 2015)                                        |   92.1   | [Grammar as a Foreign Language](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf)                              |                                                       |
| Self-trained parser (McClosky et al., 2006)                                        |   92.1   | [Effective Self-Training for Parsing](https://pdfs.semanticscholar.org/6f0f/64f0dab74295e5eb139c160ed79ff262558a.pdf)             |                                                       |

[Go back to the README](../README.md)


================================================
FILE: english/coreference_resolution.md
================================================
# Coreference resolution

Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.

Example:

```
               +-----------+
               |           |
I voted for Obama because he was most aligned with my values", she said.
 |                                                 |            |
 +-------------------------------------------------+------------+
```

"I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster.

### CoNLL 2012

Experiments are conducted on the data of the [CoNLL-2012 shared task](http://www.aclweb.org/anthology/W12-4501), which
uses OntoNotes coreference annotations. Papers
report the precision, recall, and F1 of the MUC, B3, and CEAFφ4 metrics using the official
CoNLL-2012 evaluation scripts. The main evaluation metric is the average F1 of the three metrics.

| Model           | Avg F1 |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| wl-coref + RoBERTa | 81.0 | [Word-Level Coreference Resolution](https://arxiv.org/abs/2109.04127) | [Official](https://github.com/vdobrovolskii/wl-coref) |
| s2e+Longformer-Large | 80.3 | [Coreference Resolution without Span Representations](https://arxiv.org/abs/2101.00434) | [Official](https://github.com/yuvalkirstain/s2e-coref) |
| Xu et al. (2020) | 80.2 | [Revealing the Myth of Higher-Order Inference in Coreference Resolution](https://arxiv.org/abs/2009.12013) |[Official](https://github.com/emorynlp/coref-hoi) |
| Joshi et al. (2019)<sup>[1](#myfootnote1)</sup> | 79.6 | [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529) |[Official](https://github.com/facebookresearch/SpanBERT) |
| Joshi et al. (2019)<sup>[2](#myfootnote2)</sup> | 76.9 | [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) | [Official](https://github.com/mandarjoshi90/coref) |
| Kantor and Globerson (2019) | 76.6 | [Coreference Resolution with Entity Equalization](https://www.aclweb.org/anthology/P19-1066/) | [Official](https://github.com/kkjawz/coref-ee) |
| Fei et al. (2019) | 73.8 | [End-to-end Deep Reinforcement Learning Based Coreference Resolution](https://www.aclweb.org/anthology/P19-1064/) | |
| (Lee et al., 2017)+ELMo (Peters et al., 2018)+coarse-to-fine & second-order inference (Lee et al., 2018) | 73.0 | [Higher-order Coreference Resolution with Coarse-to-fine Inference](http://aclweb.org/anthology/N18-2108) | [Official](https://github.com/kentonl/e2e-coref) |
| (Lee et al., 2017)+ELMo (Peters et al., 2018) | 70.4 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | |
| Lee et al. (2017) | 67.2 | [End-to-end Neural Coreference Resolution](https://arxiv.org/abs/1707.07045) | |

<a name="myfootnote1">[1]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+SpanBERT (Joshi et al., 2019)

<a name="myfootnote2">[2]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+BERT (Devlin et al., 2019)

### Gendered Ambiguous Pronoun Resolution

Experiments are conducted on [GAP dataset](https://github.com/google-research-datasets/gap-coreference). 
Metrics used are F1 score on Masculine (M) and Feminine (F) examples, Overall, and a Bias factor calculated as F / M.

| Model           | Overall F1 | Masculine F1 (M) | Feminine F1 (F) | Bias (F/M) | Paper / Source | Code |
| ------------- | :-----:| :-----:| :-----:| :-----:| --- | --- |
| Attree et al. (2019) | 92.5 | 94.0 | 91.1 | 0.97 | [Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling](https://arxiv.org/abs/1906.00839) | [GREP](https://github.com/sattree/gap) |
| Chada et al. (2019) | 90.2 | 90.9 | 89.5 | 0.98 | [Gendered Pronoun Resolution using BERT and an extractive question answering formulation](https://arxiv.org/abs/1906.03695) | [CorefQA](https://github.com/rakeshchada/corefqa) |


[Go back to the README](../README.md)


================================================
FILE: english/data_to_text_generation.md
================================================
# Data-to-Text Generation

**Data-to-Text Generation (D2T NLG)** can be described as Natural Language Generation from structured input.
<!-- is a task of NLG where the **textual output** is generated using **structured input** (such as tables or graphs). -->
Unlike other NLG tasks such as, Machine Translation or Question Answering (also referred as **Text-to-Text Generation or T2T NLG**) where requirement is to generate textual output using some unstructured textual input, in D2T NLG the requirement is to generate textual output from the input provided in a structured format such as: tables; or knowledge graphs; or JSONs <sup>[[1]](#myfootnote1)</sup>.

## RotoWire
The [dataset](https://github.com/harvardnlp/boxscore-data/blob/master/rotowire.tar.bz2) consists of articles summarizing NBA basketball games, paired with their corresponding box- and line-score tables. It is professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but structured, and targets an audience primarily interested in game statistics <sup>[[2]](#myfootnote2)</sup>.

The performance is evaluated on two different automated metrics: first, **BLEU score**; and second, a family of **Extractive Evaluations (EE)**. EE contains three different submetrics evaluating three different aspects of the generation:

1. **Content Selection (CS)**: precision (P%) and recall (R%) of unique relations extracted from generated text that are also extracted from golden text. This measures how well the generated document matches the gold document in terms of selecting which records to generate.

2. **Relation Generation (RG)**: precision (P%) and number of unique relations (#) extracted from generated text that also appear in structured input provided. This measures how well the system is able to generate text containing factual (i.e., correct) records.

3. **Content Ordering (CO)**: normalized Damerau-Levenshtein Distance (DLD%) between the sequences of records extracted from golden text and that extracted from generated text. This measures how well the system orders the records it chooses to discuss.

| Model           | BLEU | CS (P% & R%) | RG (P% & #) | CO (DLD%) |  Paper / Source | Code |
| ------------- | :-----: | :-----: | :-----: | :-----:| --- | --- |
| Rebuffel, Clément, et al. (2020)<sup>[[4]](#myfootnote4)</sup> | 17.50 | 39.47 & 51.64 | 89.46 & 21.17 | 18.90 | [A Hierarchical Model for Data-to-Text Generation](https://link.springer.com/chapter/10.1007/978-3-030-45439-5_5) |[Official](https://github.com/KaijuML/data-to-text-hierarchical) |
| Puduppully et al. (2019)<sup>[[3]](#myfootnote3)</sup> | 16.50 | 34.18 & 51.22 | 87.47 & 34.28 | 18.58 | [Data-to-text generation with content selection and planning](https://www.aaai.org/ojs/index.php/AAAI/article/view/4668) |[Official](https://github.com/ratishsp/data2text-plan-py) |
| Puduppully and Lapata (2021)<sup>[[10]](#myfootnote10)</sup> | 15.46 | 34.1 & 57.8 |  97.6 & 42.1 | 17.7 | [Data-to-text generation with macro planning](https://doi.org/10.1162/tacl_a_00381) |[Official](https://github.com/ratishsp/data2text-macro-plan-py) |
| Wiseman et al. (2017)<sup>[[2]](#myfootnote2)</sup> | 14.49 | 22.17 & 27.16 | 71.82 & 12.82 | 8.68 | [Challenges in Data-to-Document Generation](https://www.aclweb.org/anthology/D17-1239.pdf) |[Official](https://github.com/harvardnlp/data2text) |

## WebNLG
The [WebNLG challenge](https://webnlg-challenge.loria.fr/) consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples. For example, given the three DBpedia triples (as shown in [a]), the aim is to generate a text (as shown in [b]):

* **[a]**. (John_E_Blaha birthDate 1942_08_26) (John_E_Blaha birthPlace San_Antonio) (John_E_Blaha occupation Fighter_pilot)

* **[b]**. John E Blaha, born in San Antonio on 1942-08-26, worked as a fighter pilot.

The performance is evaluated on the basis of **BLEU, METEOR and TER scores**. The data from WebNLG Challenge 2017 can be downloaded [here](https://gitlab.com/shimorina/webnlg-dataset).

| Model           | BLEU | METEOR | TER |  Paper / Source | Code |
| ------------- | :-----: | :-----: | :-----: | --- | --- |
| Kale, Mihir. (2020) <sup>[[9]](#myfootnote9)</sup> | 57.1 | 0.44 |  | [Text-to-Text Pre-Training for Data-to-Text Tasks](https://arxiv.org/pdf/2005.10433v2.pdf) |  |
| Moryossef et al. (2019) <sup>[[5]](#myfootnote5)</sup> | 47.4 | 0.391 | 0.631 | [Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation](https://www.aclweb.org/anthology/N19-1236.pdf) | [Official](https://github.com/AmitMY/chimera) |
| Baseline | 33.24 | 0.235436 | 0.613080 | [Baseline system provided during the challenge](https://webnlg-challenge.loria.fr/challenge_2017/#webnlg-baseline-system) |[Official](https://gitlab.com/webnlg/webnlg-baseline) |

**P.S.**: The **test dataset** of WebNLG consists of **total 15 categories**, out of which 10 (**seen**) catgories are used for training while 5 (**unseen**) are not. The results reported here are those obtained on overall test data, i.e., all 15 categories.

## Meaning Representations

The dataset was first provided for the [E2E Challenge](http://www.macs.hw.ac.uk/InteractionLab/E2E/) in 2017. It is a crowd-sourced data set of 50k instances in the restaurant domain.Each instance consist of a dialogue act-based meaning representations (MR) and up to 5 references in natural language (NL). For example:

* **MR**: name[The Eagle], eatType[coffee shop], food[French], priceRange[moderate], customerRating[3/5], area[riverside], kidsFriendly[yes], near[Burger King]

* **NL**: “The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The performance is evaluated using **BLEU, NIST, METEOR, ROUGE-L, CIDEr scores**. The data from E2E Challenge 2017 can be downloaded [here](https://github.com/tuetschek/e2e-dataset/releases/download/v1.0.0/e2e-dataset.zip).

| Model           | BLEU | NIST | METEOR | ROUGE-L | CIDEr |  Paper / Source | Code |
| ------------- | :-----: | :-----: |:-----: |:-----: | :-----: | --- | --- |
| Shen, Sheng, et al. (2019) <sup>[[7]](#myfootnote6)</sup> | 68.60 | 8.73 | 45.25 | 70.82 | 2.37 | [Pragmatically Informative Text Generation](https://www.aclweb.org/anthology/N19-1410.pdf) |[Official](https://github.com/sIncerass/prag_generation) |
| Elder, Henry, et al. (2019) <sup>[[8]](#myfootnote8)</sup> | 67.38 | 8.7277 | 45.72 | 71.52 | 2.2995 | [Designing a Symbolic Intermediate Representation for Neural Surface Realization](https://www.aclweb.org/anthology/W19-2308.pdf) | |
| Gehrmann, Sebastian, et al. (2018) <sup>[[6]](#myfootnote7)</sup> | 66.2 | 8.60 | 45.7 | 70.4 | 2.34 | [End-to-End Content and Plan Selection for Data-to-Text Generation](https://www.aclweb.org/anthology/W18-6505.pdf) |[Official](https://github.com/sebastianGehrmann/diverse_ensembling) |
| Baseline | 65.93 | 8.61 | 44.83 | 68.50 | 2.23 | [Baseline system provided during the challenge](http://www.macs.hw.ac.uk/InteractionLab/E2E/#baseline) |[Official](https://github.com/UFAL-DSG/tgen/tree/master/e2e-challenge) |

<!-- ## WikiBio  -->

## References
<a name="myfootnote1">[1]</a> Albert Gatt and Emiel Krahmer. 2018. [Survey of the state of the art in natural language generation: core tasks, applications and evaluation](https://www.jair.org/index.php/jair/article/download/11173/26378/). J. Artif. Int. Res. 61, 1 (January 2018), 65–170.

<a name="myfootnote2">[2]</a> Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. "[Challenges in Data-to-Document Generation](https://www.aclweb.org/anthology/D17-1239.pdf)." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

<a name="myfootnote3">[3]</a> Puduppully, Ratish, Li Dong, and Mirella Lapata. "[Data-to-text generation with content selection and planning](https://www.aaai.org/ojs/index.php/AAAI/article/view/4668)." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

<a name="myfootnote4">[4]</a> Rebuffel, Clément, et al. "[A Hierarchical Model for Data-to-Text Generation](https://link.springer.com/chapter/10.1007/978-3-030-45439-5_5)." European Conference on Information Retrieval. Springer, Cham, 2020.

<a name="myfootnote5">[5]</a> Moryossef, Amit, Yoav Goldberg, and Ido Dagan. "[Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation](https://www.aclweb.org/anthology/N19-1236.pdf)." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

<a name="myfootnote6">[6]</a> Gehrmann, Sebastian, et al. "[End-to-End Content and Plan Selection for Data-to-Text Generation](https://www.aclweb.org/anthology/W18-6505.pdf)." Proceedings of the 11th International Conference on Natural Language Generation. 2018.

<a name="myfootnote7">[7]</a> Shen, Sheng, et al. "[Pragmatically Informative Text Generation](https://www.aclweb.org/anthology/N19-1410.pdf)." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

<a name="myfootnote8">[8]</a> Elder, Henry, et al. "[Designing a Symbolic Intermediate Representation for Neural Surface Realization](https://www.aclweb.org/anthology/W19-2308.pdf)." Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. 2019.

<a name="myfootnote9">[9]</a> Kale, Mihir. "[Text-to-Text Pre-Training for Data-to-Text Tasks](https://arxiv.org/pdf/2005.10433v2.pdf)" arXiv preprint arXiv:2005.10433 (2020).

<a name="myfootnote10">[10]</a> Puduppully, Ratish and Mirella Lapata. "[Data-to-text generation with macro planning](https://doi.org/10.1162/tacl_a_00381)." Transactions of the Association for Computational Linguistics 2021; 9 510–527.

[Go back to the README](../README.md)


================================================
FILE: english/dependency_parsing.md
================================================
# Dependency parsing

Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical
structure and defines the relationships between "head" words and words, which modify those heads.

Example:

```
     root
      |
      | +-------dobj---------+
      | |                    |
nsubj | |   +------det-----+ | +-----nmod------+
+--+  | |   |              | | |               |
|  |  | |   |      +-nmod-+| | |      +-case-+ |
+  |  + |   +      +      || + |      +      | |
I  prefer  the  morning   flight  through  Denver
```

Relations among the words are illustrated above the sentence with directed, labeled
arcs from heads to dependents (+ indicates the dependent).

### Penn Treebank

Models are evaluated on the [Stanford Dependency](https://nlp.stanford.edu/software/dependencies_manual.pdf)
conversion (**v3.3.0**) of the Penn Treebank with __predicted__ POS-tags. Punctuation symbols
are excluded from the evaluation. Evaluation metrics are unlabeled attachment score (UAS) and labeled attachment score (LAS). UAS does not consider the semantic relation (e.g. Subj) used to label the attachment between the head and the child, while LAS requires a semantic correct label for each attachment.Here, we also mention the predicted POS tagging accuracy.

| Model                                                                        |  POS  |  UAS  |  LAS  | Paper / Source                                                                                                                    | Code                                                                           |
| ---------------------------------------------------------------------------- | :---: | :---: | :---: | --------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| Label Attention Layer + HPSG + XLNet (Mrini et al., 2019)                    | 97.3  | 97.42 | 96.26 | [Rethinking Self-Attention: Towards Interpretability for Neural Parsing](https://khalilmrini.github.io/Label_Attention_Layer.pdf) | [Official](https://github.com/KhalilMrini/LAL-Parser)                          |
| Pre-training + XLNet (Tian et al. 2022) | - | 97.30 | 95.92 | [Enhancing Structure-aware Encoder with Extremely Limited Data for Graph-based Dependency Parsing](https://aclanthology.org/2022.coling-1.483/) | [Official](https://github.com/synlp/DMPar) |
| ACE + fine-tune (Wang et al., 2020) | - | 97.20 | 95.80 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| HPSG Parser (Joint) + XLNet (Zhou et al, 2020)                            | 97.3  | 97.20 | 95.72 | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://www.aclweb.org/anthology/2020.findings-emnlp.398.pdf)                        | [Official](https://github.com/DoodleJZ/HPSG-Neural-Parser)                     |
| Second-Order MFVI + BERT (Wang et al., 2020) | - | 96.91 | 95.34 | [Second-Order Neural Dependency Parsing with Message Passing and End-to-End Training](https://arxiv.org/pdf/2010.05003.pdf) | [Official](https://github.com/wangxinyu0922/Second_Order_Parsing)|
| CVT + Multi-Task (Clark et al., 2018)                                        | 97.74 | 96.61 | 95.02 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370)                                    | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) |
| CRF Parser (Zhang et al., 2020)                                              |   -   | 96.14 | 94.49 | [Efficient Second-Order TreeCRF for Neural Dependency Parsing](https://www.aclweb.org/anthology/2020.acl-main.302)                | [Official](https://github.com/yzhangcs/crfpar)                                 |
| Second-Order MFVI (Wang et al., 2020) | - | 96.12 | 94.47 | [Second-Order Neural Dependency Parsing with Message Passing and End-to-End Training](https://arxiv.org/pdf/2010.05003.pdf) | [Official](https://github.com/wangxinyu0922/Second_Order_Parsing)|
| Left-to-Right Pointer Network (Fernández-González and Gómez-Rodríguez, 2019) | 97.3  | 96.04 | 94.43 | [Left-to-Right Dependency Parsing with Pointer Networks](https://www.aclweb.org/anthology/N19-1076)                               | [Official](https://github.com/danifg/Left2Right-Pointer-Parser)                |
| Graph-based parser with GNNs (Ji et al., 2019)                               | 97.3  | 95.97 | 94.31 | [Graph-based Dependency Parsing with Graph Neural Networks](https://www.aclweb.org/anthology/P19-1237)                            |                                                                                |
| Deep Biaffine (Dozat and Manning, 2017)                                      | 97.3  | 95.74 | 94.08 | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)                                         | [Official](https://github.com/tdozat/Parser-v1)                                |
| jPTDP (Nguyen and Verspoor, 2018)                                            | 97.97 | 94.51 | 92.87 | [An improved neural network model for joint POS tagging and dependency parsing](https://arxiv.org/abs/1807.03955)                 | [Official](https://github.com/datquocnguyen/jPTDP)                             |
| Andor et al. (2016)                                                          | 97.44 | 94.61 | 92.79 | [Globally Normalized Transition-Based Neural Networks](https://www.aclweb.org/anthology/P16-1231)                                 |                                                                                |
| Distilled neural FOG (Kuncoro et al., 2016)                                  | 97.3  | 94.26 | 92.06 | [Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser](https://arxiv.org/abs/1609.07561)                       |                                                                                |
| Distilled transition-based parser (Liu et al., 2018)                         | 97.3  | 94.05 | 92.14 | [Distilling Knowledge for Search-based Structured Prediction](http://aclweb.org/anthology/P18-1129)                               | [Official](https://github.com/Oneplus/twpipe)                                  |
| Weiss et al. (2015)                                                          | 97.44 | 93.99 | 92.05 | [Structured Training for Neural Network Transition-Based Parsing](http://anthology.aclweb.org/P/P15/P15-1032.pdf)                 |                                                                                |
| BIST transition-based parser (Kiperwasser and Goldberg, 2016)                | 97.3  | 93.9  | 91.9  | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)  | [Official](https://github.com/elikip/bist-parser/tree/master/barchybrid/src)   |
| Arc-hybrid (Ballesteros et al., 2016)                                        | 97.3  | 93.56 | 91.42 | [Training with Exploration Improves a Greedy Stack-LSTM Parser](https://arxiv.org/abs/1603.03793)                                 |                                                                                |
| BIST graph-based parser (Kiperwasser and Goldberg, 2016)                     | 97.3  | 93.1  | 91.0  | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)  | [Official](https://github.com/elikip/bist-parser/tree/master/bmstparser/src)   |

### Universal Dependencies

The focus of the task is learning syntactic dependency parsers that can work in a real-world setting, starting from raw text, and that can work over many typologically different languages, even low-resource languages for which there is little or no training data, by exploiting a common syntactic annotation standard. This task has been made possible by the Universal Dependencies initiative (UD, http://universaldependencies.org), which has developed treebanks for 60+ languages with cross-linguistically consistent annotation and recoverability of the original raw texts.

Participating systems will have to find labeled syntactic dependencies between words, i.e. a syntactic head for each word, and a label classifying the type of the dependency relation. In addition to syntactic dependencies, prediction of morphology and lemmatization will be evaluated. There will be multiple test sets in various languages but all data sets will adhere to the common annotation style of UD. Participants will be asked to parse raw text where no gold-standard pre-processing (tokenization, lemmas, morphology) is available. Data preprocessed by a baseline system (UDPipe, https://ufal.mff.cuni.cz/udpipe) was provided so that the participants could focus on improving just one part of the processing pipeline. The organizers believed that this made the task reasonably accessible for everyone.

| Model                     |  LAS  | MLAS  | BLEX  | Paper / Source                                                                                                                                              | Code                                                                 |
| ------------------------- | :---: | :---: | :---: | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| Stanford (Qi et al.)      | 74.16 | 62.08 | 65.28 | [Universal Dependency Parsing from Scratch](https://arxiv.org/pdf/1901.10457.pdf)                                                                           | [Official](https://github.com/stanfordnlp/stanfordnlp)               |
| UDPipe Future (Straka)    | 73.11 | 61.25 | 64.49 | [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020)                                                              | [Official](https://github.com/CoNLL-UD-2018/UDPipe-Future)           |
| HIT-SCIR (Che et al.)     | 75.84 | 59.78 | 65.33 | [Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation](https://arxiv.org/abs/1807.03121)                    |                                                                      |
| TurkuNLP (Kanerva et al.) | 73.28 | 60.99 | 66.09 | [Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/proceedings/pdf/K18-2013.pdf) | [Official](https://github.com/TurkuNLP/Turku-neural-parser-pipeline) |

The following results are just for references:

| Model                                                                  |  UAS  |  LAS  | Note                           | Paper / Source                                                                                    |
| ---------------------------------------------------------------------- | :---: | :---: | ------------------------------ | ------------------------------------------------------------------------------------------------- |
| Stack-only RNNG (Kuncoro et al., 2017)                                 | 95.8  | 94.6  | Constituent parser             | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774) |
| Deep Biaffine (Dozat and Manning, 2017)                                | 95.75 | 94.22 | Stanford conversion **v3.5.0** | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)         |
| Semi-supervised LSTM-LM (Choe and Charniak, 2016) (Constituent parser) | 95.9  | 94.1  | Constituent parser             | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257)                          |

# Cross-lingual zero-shot dependency parsing

Cross-lingual zero-shot parsing is the task of inferring the dependency parse of sentences from one language without any labeled training trees for that language.

## Universal Dependency Treebank

Models are evaluated against the [Universal Dependency Treebank v2.0](https://github.com/ryanmcd/uni-dep-tb). For each of the 6 target languages, models can use the trees of all other languages and English and are evaluated by the UAS and LAS on the target. The final score is the average score across the 6 target languages. The most common evaluation setup is to use
gold POS-tags.

| Model                                      |  UAS  |  LAS  | Paper / Source                                                                                                                               | Code                                                          |
| ------------------------------------------ | :---: | :---: | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| XLM-R + SubDP (Shi et al., 2022) | --- | 79.6* | [Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing](https://aclanthology.org/2022.acl-long.452/) | [Official](https://aclanthology.org/attachments/2022.acl-long.452.software.zip)
| Cross-Lingual ELMo (Schuster et al., 2019) | 84.2  | 77.3  | [Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing](https://arxiv.org/abs/1902.09492) | [Official](https://github.com/TalSchuster/CrossLingualELMo)   |
| MALOPA (Ammar et al., 2016)                |       | 70.5  | [Many Languages, One Parser](https://www.transacl.org/ojs/index.php/tacl/article/view/892)                                                   | [Official](https://github.com/clab/language-universal-parser) |
| Guo et al. (2016)                          | 76.7  | 69.9  | [A representation learning framework for multi-source transfer parsing](https://dl.acm.org/citation.cfm?id=3016100.3016284)                  |

*: Evaluated on four target languages.

# Unsupervised dependency parsing

Unsupervised dependency parsing is the task of inferring the dependency parse of sentences without any labeled training data.

## Penn Treebank

As with supervised parsing, models are evaluated against the Penn Treebank. The most common evaluation setup is to use
gold POS-tags as input and to evaluate systems using the unlabeled attachment score (also called 'directed dependency
accuracy').

| Model                                                |  UAS  | Paper / Source                                                                                                                                        |
| ---------------------------------------------------- | :---: | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| Iterative reranking (Le & Zuidema, 2015)             | 66.2  | [Unsupervised Dependency Parsing - Let’s Use Supervised Parsers](http://www.aclweb.org/anthology/N15-1067)                                            |
| Combined System (Spitkovsky et al., 2013)            | 64.4  | [Breaking Out of Local Optima with Count Transforms and Model Recombination - A Study in Grammar Induction](http://www.aclweb.org/anthology/D13-1204) |
| Tree Substitution Grammar DMV (Blunsom & Cohn, 2010) | 55.7  | [Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing](http://www.aclweb.org/anthology/D10-1117)                               |
| Shared Logistic Normal DMV (Cohen & Smith, 2009)     | 41.4  | [Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction](http://www.aclweb.org/anthology/N09-1009)           |
| DMV (Klein & Manning, 2004)                          | 35.9  | [Corpus-Based Induction of Syntactic Structure - Models of Dependency and Constituency](http://www.aclweb.org/anthology/P04-1061)                     |

[Go back to the README](../README.md)


================================================
FILE: english/dialogue.md
================================================
# Dialogue

Dialogue is notoriously hard to evaluate. Past approaches have used human evaluation.

## Dialogue act classification

Dialogue act classification is the task of classifying an utterance with respect to the function it serves in a dialogue, i.e. the act the speaker is performing. Dialogue acts are a type of speech acts (for Speech Act Theory, see [Austin (1975)](http://www.hup.harvard.edu/catalog.php?isbn=9780674411524) and [Searle (1969)](https://www.cambridge.org/core/books/speech-acts/D2D7B03E472C8A390ED60B86E08640E7)).

### Switchboard corpus
The [Switchboard-1 corpus](https://catalog.ldc.upenn.edu/ldc97s62) is a telephone speech corpus, consisting of about 2,400 two-sided telephone conversation among 543 speakers with about 70 provided conversation topics. The dataset includes the audio files and the transcription files, as well as information about the speakers and the calls.

The Switchboard Dialogue Act Corpus (SwDA) [[download](https://web.stanford.edu/~jurafsky/swb1_dialogact_annot.tar.gz)] extends the Switchboard-1 corpus with tags from the [SWBD-DAMSL tagset](https://web.stanford.edu/~jurafsky/ws97/manual.august1.html), which is an augmentation to the Discourse Annotation and Markup System of Labeling (DAMSL) tagset. The 220 tags were reduced to 42 tags by clustering in order to improve the language model on the Switchboard corpus. A subset of the Switchboard-1 corpus consisting of 1155 conversations was used. The resulting tags include dialogue acts like statement-non-opinion, acknowledge, statement-opinion, agree/accept, etc.  
Annotated example:  
*Speaker:* A, *Dialogue Act:* Yes-No-Question, *Utterance:* So do you go to college right now?  

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| SGNN (Ravi et al., 2018) | 83.1 | [Self-Governing Neural Networks for On-Device Short Text Classification](https://www.aclweb.org/anthology/D18-1105.pdf) |[Link](https://github.com/glicerico/SGNN) |
| CASA (Raheja et al., 2019) | 82.9 | [Dialogue Act Classification with Context-Aware Self-Attention](https://www.aclweb.org/anthology/N19-1373.pdf)|[Link](https://github.com/macabdul9/CASA-Dialogue-Act-Classifier)|
| DAH-CRF (Li et al., 2019) | 82.3 | [A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification](https://www.aclweb.org/anthology/K19-1036.pdf)
| ALDMN (Wan et al., 2018) | 81.5 | [Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training](https://arxiv.org/pdf/1811.05021.pdf)
| CRF-ASN (Chen et al., 2018) | 81.3 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | |
| Bi-LSTM-CRF (Kumar et al., 2017) | 79.2 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) |
| RNN with 3 utterances in context (Bothe et al., 2018) | 77.34 | [A Context-based Approach for Dialogue Act Recognition using Simple Recurrent Neural Networks](https://arxiv.org/abs/1805.06280) | |


### ICSI Meeting Recorder Dialog Act (MRDA) corpus
The [MRDA corpus](http://www1.icsi.berkeley.edu/Speech/mr/) [[download](http://www.icsi.berkeley.edu/~ees/dadb/icsi_mrda+hs_corpus_050512.tar.gz)] consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts.   
Annotated example:  
*Time:* 2804-2810, *Speaker:* c6, *Dialogue Act:* s^bd, *Transcript:* i mean these are just discriminative.  
Multiple dialogue acts are separated by "^".

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| DAH-CRF (Li et al., 2019) | 92.2 | [A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification](https://www.aclweb.org/anthology/K19-1036.pdf)
| CRF-ASN (Chen et al., 2018) | 91.7 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | |
| CASA (Raheja et al., 2019) | 91.1 | [Dialogue Act Classification with Context-Aware Self-Attention](https://www.aclweb.org/anthology/N19-1373.pdf)
| Bi-LSTM-CRF (Kumar et al., 2017) | 90.9 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) |
| SGNN (Ravi et al., 2018) | 86.7 | [Self-Governing Neural Networks for On-Device Short Text Classification](https://www.aclweb.org/anthology/D18-1105.pdf)

## Dialogue state tracking

Dialogue state tacking consists of determining at each turn of a dialogue the
full representation of what the user wants at that point in the dialogue,
which contains a goal constraint, a set of requested slots, and the user's dialogue act.

### Second dialogue state tracking challenge

For goal-oriented dialogue, the dataset of the [second Dialogue Systems Technology Challenges](http://www.aclweb.org/anthology/W14-4337)
(DSTC2) is a common evaluation dataset. The DSTC2 focuses on the restaurant search domain. Models are
evaluated based on accuracy on both individual and joint slot tracking.

| Model           | Request | Area  |  Food  |  Price  |  Joint  |  Paper / Source |
| ------------- | :-----: | :-----:| :-----:| :-----:| :-----:| --- |
| Zhong et al. (2018) | 97.5 | - | - | - | 74.5| [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) |
| Liu et al. (2018) | - | 90 | 84 | 92 | 72 | [Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems](https://arxiv.org/abs/1804.06512) |
| Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 90 | 84 | 94 | 73.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) |
| RNN (Henderson et al., 2014) | 95.7 | 92 | 86 | 86 | 69 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) |

### Wizard-of-Oz

The [WoZ 2.0 dataset](https://arxiv.org/pdf/1606.03777.pdf) is a newer dialogue state tracking dataset whose evaluation is detached from the noisy output of speech recognition systems. Similar to DSTC2, it covers the restaurant search domain and has identical evaluation.


| Model           | Request  |  Joint  |  Paper / Source |
| ------------- |  :-----:| :-----:| --- |
| BERT-based tracker (Lai et al., 2020) | 97.6 | 90.5 | [A Simple but Effective BERT Model for Dialog State Tracking on Resource-Limited Systems](https://ieeexplore.ieee.org/document/9053975) |
| GCE (Nouri et al., 2018) | 97.4 | 88.5 | [Toward Scalable Neural Dialogue State Tracking Model](https://arxiv.org/abs/1812.00899) |
| Zhong et al. (2018) | 97.1 | 88.1 | [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) |
| Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 84.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) |
| RNN (Henderson et al., 2014) | 87.1 | 70.8 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) |


### MultiWOZ

The [MultiWOZ dataset](https://arxiv.org/abs/1810.00278) is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The dialogue are set between a tourist and a clerk in the information. It spans over 7 domains.

#### Belief Tracking
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th></th><th colspan="2">MultiWOZ 2.0</th><th colspan="2">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>Joint Accuracy</th><th>Slot</th><th>Joint Accuracy</th><th>Slot</th></tr></thead>
<tbody>
<tr><td><a href="https://www.aclweb.org/anthology/P18-2069">MDBT</a> (Ramadan et al., 2018) </td><td>15.57 </td><td>89.53</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/abs/1805.09655">GLAD</a> (Zhong et al., 2018)</td><td>35.57</td><td>95.44 </td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1812.00899.pdf">GCE</a> (Nouri and Hosseini-Asl, 2018)</td><td>36.27</td><td>98.42</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1908.01946.pdf">Neural Reading</a> (Gao et al, 2019)</td><td>41.10</td><td></td><td></td><td></td></tr>

<tr><td><a href="https://arxiv.org/pdf/1907.00883.pdf">HyST</a> (Goel et al, 2019)</td><td>44.24</td><td></td><td></td><td></td></tr>
<tr><td><a href="https://www.aclweb.org/anthology/P19-1546/">SUMBT</a> (Lee et al, 2019)</td><td>46.65</td><td>96.44</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1905.08743.pdf">TRADE</a> (Wu et al, 2019)</td><td>48.62</td><td>96.92</td><td>45.60</td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1909.00754.pdf">COMER</a> (Ren et al, 2019)</td><td>48.79</td><td></td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.06192.pdf">DSTQA</a> (Zhou et al, 2019)</td><td>51.44</td><td>97.24</td><td>51.17</td><td>97.21</td></tr>
<tr><td><a href="https://arxiv.org/pdf/1910.03544.pdf">DST-Picklist</a> (Zhang et al, 2019)</td><td></td><td></td><td>53.3</td><td></td></tr>
<tr><td><a href="https://www.aaai.org/Papers/AAAI/2020GB/AAAI-ChenL.10030.pdf">SST</a> (Chen et al. 2020)</td><td></td><td></td><td>55.23</td><td></td></tr>
<tr><td><a href="https://arxiv.org/abs/2005.02877">TripPy</a> (Heck et al. 2020)</td><td></td><td></td><td>55.3</td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td></td><td></td><td>55.72</td><td></td></tr>

</tbody>
</table>
</div>

#### Policy Optimization
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th>(INFORM	+ SUCCESS)*0.5 +	BLEU</th><th colspan="3">MultiWOZ 2.0</th><th colspan="3">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th></tr></thead>
<tbody>
 <tr><td><a href="https://arxiv.org/pdf/1907.05346.pdf">TokenMoE</a> (Pei et al. 2019)</td><td>75.30</td><td> 59.70</td><td> 16.81 </td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://pdfs.semanticscholar.org/47d0/1eb59cd37d16201fcae964bd1d2b49cfb55e.pdf">Baseline</a> (Budzianowski et al. 2018)</td><td>71.29</td><td> 60.96 </td><td> 18.8 </td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1907.10016.pdf">Structured Fusion</a> (Mehri et al. 2019)</td><td>82.70</td><td>72.10</td><td> 16.34</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/abs/1902.08858">LaRL</a> (Zhao et al. 2019)</td><td>82.8</td><td>79.2</td><td> 12.8</td><td> </td><td> </td><td> </td></tr>
  <tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td>88.9</td><td>67.1</td><td> 16.9</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.08151.pdf">MoGNet</a> (Pei et al. 2019)</td><td>85.3</td><td>73.30</td><td> 20.13</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1905.12866.pdf">HDSA</a> (Chen et al. 2019)</td><td>82.9</td><td>68.9</td><td> 23.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/abs/1910.03756">ARDM</a> (Wu et al. 2019)</td><td>87.4</td><td>72.8</td><td> 20.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.10484.pdf">DAMD</a> (Zhang et al. 2019)</td><td>89.2</td><td>77.9</td><td> 18.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/2005.05298.pdf">SOLOIST</a> (Peng et al. 2020)</td><td>89.60</td><td> 79.30</td><td> 18.3</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/2004.12363.pdf">MarCo</a> (Wang et al. 2020)</td><td>92.30</td><td> 78.60</td><td> 20.02</td><td> 92.50</td><td> 77.80</td><td> 19.54</td></tr>
<tfoot> </tfoot>
</tbody>
</table>
</div>

#### Natural Language Generation
<div class="datagrid" style="width:500px;"><table>
<thead><tr><th>Model</th><th>SER</th><th>BLEU</th></tr></thead>
<tbody>
<tr><td><a href="https://pdfs.semanticscholar.org/47d0/1eb59cd37d16201fcae964bd1d2b49cfb55e.pdf">Baseline</a> (Budzianowski et al. 2018)</td><td>2.99 </td><td> 0.632</td></tr>
</tbody>
</table>
</div>

#### End-to-End Modelling
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th>(INFORM	+ SUCCESS)*0.5 +	BLEU</th><th colspan="3">MultiWOZ 2.0</th><th colspan="3">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th></tr></thead>
<tbody>
<tr><td><a href="https://arxiv.org/pdf/1911.10484.pdf">DAMD</a> (Zhang et al. 2019)</td><td>76.3</td><td>60.4</td><td> 18.6</td><td> </td><td> </td><td> </td></tr>
 <tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td>84.4</td><td>70.1</td><td> 15.01</td><td> </td><td></td><td></td></tr>
 <tr><td><a href="https://arxiv.org/pdf/2005.05298.pdf">SOLOIST</a> (Peng et al. 2020)</td><td>85.50</td><td>72.90</td><td> 16.54</td><td> </td><td></td><td> </td></tr>

<tfoot> </tfoot>
</tbody>
</table>
</div>

## Retrieval-based Chatbots
These systems take as input a context and a list of possible responses and rank the responses, returning the highest ranking one.

### Ubuntu IRC Data

There are several corpra based on the [Ubuntu IRC Channel Logs](https://irclogs.ubuntu.com):

- [Uthus and Aha (2013)](), available [here](https://daviduthus.org/UCC/), the first dataset to use the resource, but not for retrieval-based chatbot research.
- UDC v1, [Lowe et al. (2015)](https://arxiv.org/abs/1506.08909), available [here](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/), the first version of the Ubuntu Dialogue Corpus.
- UDC v2, [Lowe et al. (2017)](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698), available [here](https://arxiv.org/abs/1506.08909), the second version of the Ubuntu Dialogue Corpus.
- DSTC 7, [Gunasekara et al. (2019)](http://workshop.colips.org/dstc7/papers/dstc7_task1_final_report.pdf), available [here](https://ibm.github.io/dstc-noesis/public/index.html), the data from DSTC 7 track 1.
- DSTC 8, [Gunasekara et al. (2020)](http://jkk.name/pub/dstc20task2.pdf), available [here](https://github.com/dstc8-track2/NOESIS-II/), the data from DSTC 8 track 2.

Each version of the dataset contains a set of dialogues from the IRC channel, extracted by automatically disentangling conversations occurring simultaneously. See below for results on the disentanglement process.

The exact tasks used vary slightly, but all consider variations of Recall_N@K, which means how often the true answer is in the top K options when there are N total candidates.

| Data   | Model           |  R_100@1    |  R_100@10   |  R_100@50   |  MRR        |  Paper / Source |
| ------ | -------------   | :---------: | :---------: | :---------: | :---------: |---------------|
| DSTC 8 (main) | Wu et. al., (2020) | 76.1 | 97.9 | - | 84.8 | Enhancing Response Selection with Advanced Context Modeling and Post-training |
| DSTC 8 (subtask 2) | Wu et. al., (2020) | 70.6 | 95.7 | - | 79.9 | Enhancing Response Selection with Advanced Context Modeling and Post-training |
| DSTC 7 | Seq-Att-Network (Chen and Wang, 2019) | 64.5 | 90.2 | 99.4 | 73.5 | [Sequential Attention-based Network for Noetic End-to-End Response Selection](http://workshop.colips.org/dstc7/papers/07.pdf) |

| Data   | Model           | R_2@1       |  R_10@1      |  Paper / Source |
| ------ | -------------   | :---------: | :---------: |---------------|
| UDC v2 | DAM (Zhou et al. 2018) | 93.8 | 76.7 | [Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://www.aclweb.org/anthology/P18-1103) |
| UDC v2  | SMN (Wu et al. 2017) | 92.3 | 72.3 | [Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots](https://arxiv.org/pdf/1612.01627.pdf) |
| UDC v2  | Multi-View (Zhou et al. 2017) | 90.8 | 66.2 | [Multi-view Response Selection for Human-Computer Conversation](https://aclweb.org/anthology/D16-1036) |
| UDC v2  | Bi-LSTM (Kadlec et al. 2015) | 89.5 | 63.0 | [Improved Deep Learning Baselines for Ubuntu Corpus Dialogs](https://arxiv.org/pdf/1510.03753.pdf) |

Additional results can be found in the DSTC task reports linked above.

### Reddit Corpus
The [Reddit Corpus](https://arxiv.org/abs/1904.06472) contains 726 million multi-turn dialogues from the Reddit board. Reddit  is an American social news aggregation website, where users can post links, and take partin discussions on these post. The task of Reddit Corpus is to select the correct response from 100 candidates (others are negatively sampled) by considering previous conversation history.  Models are evaluated with the Recall 1 at 100 metric (the 1-of-100 ranking accuracy). You can find more details at [here](https://github.com/PolyAI-LDN/conversational-datasets).

| Model           |   R_1@100   |  Paper / Source |
| -------------   |   :---------:|---------------|
| PolyAI Encoder (Henderson et al. 2019) |  61.3 | [A Repository of Conversational Dataset](https://arxiv.org/pdf/1904.06472.pdf) |
| USE (Cer et al. 2018) | 47.7 | [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) |
| BERT (Devlin et al. 2017) | 24.0 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| ELMO (Peters et al. 2018) | 19.3 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) |

### Advising Corpus
The [Advising Corpus](http://workshop.colips.org/dstc7/papers/dstc7_task1_final_report.pdf), available [here](https://ibm.github.io/dstc-noesis/public/index.html), contains a collection of conversations between a student and an advisor at the University of Michigan. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.

| Model           |  R_100@1    |  R_100@10   |  R_100@50   |  MRR        |  Paper / Source |
| -------------   | :---------: | :---------: | :---------: | :---------: |---------------|
| Yang et. al., (2020) | 56.4 | 87.8 | - | 67.7 | Transformer-based Semantic Matching Model for Noetic Response Selection |
| Seq-Att-Network (Chen and Wang, 2019) | 21.4 | 63.0 | 94.8 | 33.9 | [Sequential Attention-based Network for Noetic End-to-End Response Selection](http://workshop.colips.org/dstc7/papers/07.pdf)


## Generative-based Chatbots
The main task of generative-based chatbot is to generate consistent and engaging response given the context.
### Personalized Chit-chat

The task of persinalized chit-chat dialogue generation is first proposed by [PersonaChat](https://arxiv.org/pdf/1801.07243.pdf). The motivation is to enhance the engagingness and consistency of chit-chat bots via endowing explicit personas to agents. Here the `persona` is defined as several profile natural language sentences like "I weight 300 pounds.". NIPS 2018 has hold a competition [The Conversational Intelligence Challenge 2 (ConvAI2)](http://convai.io/) based on the dataset. The Evaluation metric is F1, Hits@1 and ppl. F1 evaluates on the word-level, and Hits@1 represents the probability of the real next utterance ranking the highest according to the model, while ppl is perplexity for language modeling. The following results are reported on dev set (test set is still hidden), almost of them are borrowed from [ConvAI2 Leaderboard](https://github.com/DeepPavlov/convai/blob/master/leaderboards.md).

| Model           | F1 | Hits@1 | ppl | Paper / Source | Code |
| -------------   | :---------: | :---------:| :--------: | ---------------| ------------- |
| P^2 Bot (Liu et al. 2020) | 19.77 | 81.9 | 15.12 | [You Impress Me: Dialogue Generation via Mutual Persona Perception](https://arxiv.org/pdf/2004.05388.pdf) | [Code](https://github.com/SivilTaram/Persona-Dialogue-Generation) |
| TransferTransfo (Thomas et al. 2019) | 19.09 | 82.1 | 17.51 | [TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents](https://arxiv.org/pdf/1901.08149.pdf) | [Code](https://github.com/huggingface/transfer-learning-conv-ai) |
| Lost In Conversation | 17.79 | - | 17.3 | [NIPS 2018 Workshop Presentation](http://convai.io/NeurIPSParticipantSlides.pptx) | [Code](https://github.com/atselousov/transformer_chatbot) |
| Seq2Seq + Attention (Dzmitry et al. 2014) | 16.18 | 12.6 | 29.8 | [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/seq2seq) |
| KV Profile Memory (Zhang et al. 2018) | 11.9 | 55.2 | - | [Personalizing Dialogue Agents: I have a dog, do you have pets too?](https://arxiv.org/pdf/1801.07243.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/kvmemnn)

## Disentanglement

As noted for the Ubuntu data above, sometimes multiple conversations are mixed together in a single channel. Work on conversation disentanglement aims to separate out conversations. There are two main resources for the task.

This can be formultated as a clustering problem, with no clear best metric. Several metrics are considered:

- Variation of Information
- F-1 over 1-1 matched clusters using max-flow
- Precision, Recall, and F-score on exact match for clusters
- Local overlap
- Another form of F-1 defined by [Shen et al. (2006)](https://dl.acm.org/citation.cfm?doid=1148170.1148180)

### Ubuntu IRC

Manually labeled by [Kummerfeld et al. (2019)](https://www.aclweb.org/anthology/P19-1374), this data is available [here](https://jkk.name/irc-disentanglement/).

| Model                                            | VI   | 1-1  | Precision | Recall | F-Score | Paper / Source | Code      |
| ------------------------------------------------ | :--: | :--: | :-------: | :----: | :-----: | ---------------| --------- |
| BERT + BiLSTM                                    | 93.3 |    - |      44.3 |   49.6 |    46.8 | Pre-Trained and Attention-Based Neural Networks for Building Noetic Task-Oriented Dialogue Systems | - |
| FF ensemble: Vote      (Kummerfeld et al., 2019) | 91.5 | 76.0 |      36.3 |   39.7 |    38.0 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Feedforward            (Kummerfeld et al., 2019) | 91.3 | 75.6 |      34.6 |   38.0 |    36.2 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| FF ensemble: Intersect (Kummerfeld et al., 2019) | 69.3 | 26.6 |      67.0 |   21.1 |    32.1 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Linear               (Elsner and Charniak, 2008) | 82.1 | 51.4 |      12.1 |   21.5 |    15.5 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Heuristic            (Lowe et al., 2015)         | 80.6 | 53.7 |      10.8 |    7.6 |     8.9 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |

### Linux IRC

This data has been manually annotated three times:

- By [Elsner and Charniak (2008)](https://www.aclweb.org/anthology/P08-1095), available [here](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz).
- A portion by [Mehri and Carenini (2017)](https://aclweb.org/anthology/I17-1062/), available [here](http://shikib.com/td_annotations).
- By [Kummerfeld et al. (2019)](https://www.aclweb.org/anthology/P19-1374), available [here](https://jkk.name/irc-disentanglement/).

| Data | Model           | 1-1        | Local | Shen F-1 | Paper / Source | Code          |
| ---- | -------------   | :---------:| :---: | :------: | ---------------| ------------- |
| Kummerfeld | Linear     (Elsner and Charniak, 2008) | 59.7 | 80.8 | 63.0 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Kummerfeld | Feedforward (Kummerfeld et al., 2019)  | 57.7 | 80.3 | 59.8 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Kummerfeld | Heuristic   (Lowe et al., 2015)        | 43.4 | 67.9 | 50.7 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |
| Elsner | Linear     (Elsner and Charniak, 2008)     | 53.1 | 81.9 | 55.1 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Elsner | Feedforward (Kummerfeld et al., 2019)      | 52.1 | 77.8 | 53.8 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Elsner | Wang and Oard (2009) | 47.0 | 75.1 | 52.8 | [Context-based Message Expansion for Disentanglement of Interleaved Text Conversations](https://www.aclweb.org/anthology/N09-1023/) | - |
| Elsner | Heuristic   (Lowe et al., 2015)            | 45.1 | 73.8 | 51.8 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |


================================================
FILE: english/domain_adaptation.md
================================================
# Domain adaptation

## Sentiment analysis

### Multi-Domain Sentiment Dataset

The [Multi-Domain Sentiment Dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) is a common
evaluation dataset for domain adaptation for sentiment analysis. It contains product reviews from
Amazon.com from different product categories, which are treated as distinct domains.
Reviews contain star ratings (1 to 5 stars) that are generally converted into binary labels. Models are
typically evaluated on a target domain that is different from the source domain they were trained on, while only
having access to unlabeled examples of the target domain (unsupervised domain adaptation). The evaluation
metric is accuracy and scores are averaged across each domain.

| Model           | DVD | Books | Electronics | Kitchen | Average |  Paper / Source |
| ------------- | :-----:| :-----:| :-----:| :-----:| :-----:| --- |
| Multi-task tri-training (Ruder and Plank, 2018) | 78.14 | 74.86 | 81.45 | 82.14 | 79.15 | [Strong Baselines for Neural Semi-supervised Learning under Domain Shift](https://arxiv.org/abs/1804.09530) |
| Asymmetric tri-training (Saito et al., 2017) | 76.17 | 72.97 | 80.47 | 83.97 | 78.39 | [Asymmetric Tri-training for Unsupervised Domain Adaptation](https://arxiv.org/abs/1702.08400) |
| VFAE (Louizos et al., 2015) | 76.57 | 73.40 | 80.53 | 82.93 | 78.36 | [The Variational Fair Autoencoder](https://arxiv.org/abs/1511.00830) |
| DANN (Ganin et al., 2016) | 75.40 | 71.43 | 77.67 | 80.53 | 76.26 | [Domain-Adversarial Training of Neural Networks](https://arxiv.org/abs/1505.07818) |

## Financial Technology and Natural Language Processing (FinNLP) 

The [FinNLP Progress](https://github.com/YangLinyi/FinNLP-Progress) is a repository to track the progress in Natural Language Processing (NLP) related to the domain of Finance, including the datasets, papers, and current state-of-the-art results for the most popular tasks. Examples include Financial Event Prediction, Financial Index Forecasting, Financial Risk Analysis, Financial Text Mining, Fraud Detection, etc.

[Go back to the README](../README.md)


================================================
FILE: english/entity_linking.md
================================================
# Entity Linking

## Task

Entity Linking (EL) is the task of recognizing (cf. [Named Entity Recognition](named_entity_recognition.md)) and disambiguating (Named Entity Disambiguation) named entities to a knowledge base (e.g. Wikidata, DBpedia, or YAGO). It is sometimes also simply known as Named Entity Recognition and Disambiguation.

EL can be split into two classes of approaches:
* *End-to-End*: processing a piece of text to extract the entities (i.e. Named Entity Recognition) and then disambiguate these extracted entities to the correct entry in a given knowledge base (e.g. Wikidata, DBpedia, YAGO).
* *Disambiguation-Only*: contrary to the first approach, this one directly takes gold standard named entities as input and only disambiguates them to the correct entry in a given knowledge base.

Example:

| Barack | Obama | was | born | in | Hawaï |
| --- | ---| --- | --- | --- | --- |
| https://en.wikipedia.org/wiki/Barack_Obama | https://en.wikipedia.org/wiki/Barack_Obama | O | O | O | https://en.wikipedia.org/wiki/Hawaii |

More in details can be found in this [survey](http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf).

## Current SOTA
[Raiman][Raiman] is the current SOTA in Cross-lingual Entity Linking for WikiDisamb30 and TAC KBP 2010 datasets (note: [Mulang’ et al. 2020](https://arxiv.org/pdf/2008.05190.pdf) is the current Sota for ConLL-AIDA dataset). They construct a type system, and use it to constrain the outputs of a neural network to respect the symbolic structure. They achieve this by reformulating the design problem into a mixed integer problem: create a type system and subsequently train a neural network with it. They propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system
informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters. They apply DeepType to the problem of Entity Linking on three standard datasets (i.e. WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings, while explicitly using symbolic information lets it integrate new entities without retraining.

## Evaluation

### Metrics

#### Disambiguation-Only Approach

* Micro-Precision: Fraction of correctly disambiguated named entities in the full corpus.
* Macro-Precision: Fraction of correctly disambiguated named entities, averaged by document.

#### End-to-End Approach

* Gerbil Micro-F1 - strong matching: micro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.
* Gerbil Macro-F1 - strong matching: macro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.

### Datasets

#### AIDA CoNLL-YAGO Dataset

The [AIDA CoNLL-YAGO][AIDACoNLLYAGO] Dataset by [[Hoffart]](http://www.aclweb.org/anthology/D11-1072) contains assignments of entities to the mentions of named entities annotated for the original [[CoNLL]](http://www.aclweb.org/anthology/W03-0419.pdf) 2003 NER task. The entities are identified by [YAGO2](http://yago-knowledge.org/) entity identifier, by [Wikipedia URL](https://en.wikipedia.org/), or by [Freebase mid](http://wiki.freebase.com/wiki/Machine_ID).

##### Disambiguation-Only Models
   
|  Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| Mulang’ et al. (2020) | 94.94 | - | [Evaluating the Impact of Knowledge Graph Context on Entity Disambiguation Models](https://arxiv.org/pdf/2008.05190.pdf) | -  |
| Raiman et al. (2018) | 94.88 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) |
| Sil et al. (2018) | 94.0 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | |
| Radhakrishnan et al. (2018) | 93.0 | 93.7 | [ELDEN: Improved Entity Linking using Densified Knowledge Graphs](http://aclweb.org/anthology/N18-1167) | |
| Le et al. (2018) | 93.07 | - | [Improving Entity Linking by Modeling Latent Relations between Mentions](http://aclweb.org/anthology/P18-1148) |[Official](https://github.com/lephong/mulrel-nel)
| Ganea and Hofmann (2017) | 92.22 | - | [Deep Joint Entity Disambiguation with Local Neural Attention](https://www.aclweb.org/anthology/D17-1277) | [Link](https://github.com/dalab/deep-ed) |
| Hoffart et al. (2011) | 82.29 | 82.02 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) |  |

##### End-to-End Models
   
|  Paper / Source | Micro-F1-strong | Macro-F1-strong | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| van Hulst et al. (2020) | **83.3** | 81.3  | [REL: An Entity Linker Standing on the Shoulders of Giants](https://arxiv.org/abs/2006.01969) | [Official](https://github.com/informagi/REL) |
| Kolitsas et al. (2018) | 82.6 | **82.4** | [End-to-End Neural Entity Linking](https://arxiv.org/pdf/1808.07699.pdf) | [Official](https://github.com/dalab/end2end_neural_el) |
| Kannan Ravi et al. (2021) | 83.1| - | [CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata](https://arxiv.org/pdf/2101.09969.pdf) | [Official](https://github.com/ManojPrabhakar/CHOLAN) |
| Piccinno et al. (2014) | 70.8 | 73.0 | [From TagME to WAT: a new entity annotator](https://dl.acm.org/citation.cfm?id=2634350) | |
| Hoffart et al. (2011) | 71.9 | 72.8 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) | |

#### TAC KBP English Entity Linking Comprehensive and Evaluation Data 2010 

The Knowledge Base Population (KBP) Track at [TAC 2010](https://tac.nist.gov/2010) will explore extraction of information about entities with reference to an external knowledge source. Using basic schema for persons, organizations, and locations, nodes in an ontology must be created and populated using unstructured information found in text. A collection of [Wikipedia Infoboxes](http://en.wikipedia.org/wiki/Help:Infobox) will serve as a rudimentary initial knowledge representation. You can download the dataset from [LDC](https://www.ldc.upenn.edu/) or [here](https://github.com/ChrisLeeJ/TAC_KBP_English_EL_2010).

##### Disambiguation-Only Models

|  Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| Raiman et al. (2018) | 90.85 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) |
| Sil et al. (2018) | 87.4 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) |      |
| Yamada et al. (2016) | 85.2 | - | [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/pdf/1601.01343.pdf) |      |

### Platforms

Evaluating Entity Linking systems in a manner that allows for direct comparison of performance can be difficult. The precise definition of a "correct" annotation can be somewhat subjective and it is easy to make mistakes. To provide a simple example, given the input surface form **"Tom Waits"**, an evaluation dataset might record the dbpedia resource `http://dbpedia.org/resource/Tom_Waits` as the correct referent. Yet an annotation system which returns a reference to `http://dbpedia.org/resource/PEHDTSCKJBMA` has technically provided an appropriate annotation as this resource is a redirect to `http://dbpedia.org/resource/Tom_Waits`. Alternatively if evaluating an End-to-End EL system, then accuracy with respect to word boundaries must be considered e.g. if a system only annotates **"Obama"** with the URI `http://dbpedia.org/resource/Barack_Obama` in the surface form **"Barack Obama"**, then is the system correct or incorrect in its annotation?

Furthermore, the performance of an EL system can be strongly affected by the nature of the content on which the evaluation is performed e.g. news content versus Tweets. Hence comparing the relative performance of two EL systems which have been tested on two different corpora can be fallicious. Rather than allowing these little subjective points to creep into the evaluation of EL systems, it is better to make use of a standard evaluation platform where these assumptions are known and made explicit in the configuration of the experiment.

[GERBIL][GERBIL], developed by [AKSW][AKSW] is an evaluation platform that is based on the [BAT framework][Cornolti]. It defines a number of standard experiments which may be run for any given EL service. These experiment types determine how strict the evaluation is with respect to measures such as word boundary alignment and also dictates how much responsibility is assigned to the EL service with respect to Entity Recognition, etc. GERBIL hosts 38 evaluation datasets obtained from a variety of different EL challenges. At present it also has hooks for 17 different EL services which may be included in an experiment.

GERBIL may be used to test your own EL system either by downloading the source code and deploying GERBAL locally, or by making your service available on the web and giving GERBIL a link to your API endpoint. The only condition is that your API must accept input and respond with output in [NIF][NIF] format. It is also possible to upload your own evaluation dataset if you would like to test these services on your own content. Note the dataset must also be in NIF format. The [DBpedia Spotlight evaluation dataset][SpotlightEvaluation] is a good example of how to structure your content.

GERBIL does have a number of shortcomings, the most notable of which are:
1. There is no way to view the annotations returned by each system you test. These are handled internally by GERBIL and then discarded. This can make it difficult to determine the source of error with an EL system.
2. There is no way to observe the candidate list considered for each surface form. This is, of course, a standard problem with any third party EL API, but if one is conducting a detailed investigation into the performance of an EL system, it is important to know if the source of error was the EL algorithm itself, or the candidate retrieval process which failed to identify the correct referent as a candidate. This was listed as an important consideration by [Hachey et al][Hachey].

Nevertheless, GERBIL is an excellent resource for standardising how EL systems are tested and compared. It is also a good starting point for anyone new to Entity Linking as it contains links to a wide variety of EL resources. For more information, see the research paper by [[Usbeck]](http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf).

## References

[Hoffart] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust Disambiguation of Named Entities in Text. EMNLP 2011. http://www.aclweb.org/anthology/D11-1072

[CoNLL] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL 2003. http://www.aclweb.org/anthology/W03-0419.pdf

[Usbeck] Usbeck et al. GERBIL - General Entity Annotator Benchmarking Framework. WWW 2015. http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf

[Go back to the README](../README.md)

[Sil]: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101 "Neural Cross-Lingual Entity Linking"
[Shen]: http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions"
[AIDACoNLLYAGO]: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/ "AIDA CoNLL-YAGO Dataset"
[YAGO2]: http://yago-knowledge.org/ "YAGO2"
[Wikipedia]: https://en.wikipedia.org/ "Wikipedia"
[Freebase]: http://wiki.freebase.com/wiki/Machine_ID "Freebase"
[Radhakrishnan]: http://aclweb.org/anthology/N18-1167 "ELDEN: Improved Entity Linking using Densified Knowledge Graphs"
[Le]: https://arxiv.org/abs/1804.10637
[NIF]: http://persistence.uni-leipzig.org/nlp2rdf/ "NLP Interchange Formt"
[SpotlightEvaluation]: http://apps.yovisto.com/labs/ner-benchmarks/data/dbpedia-spotlight-nif.ttl "GERBIL DBpedia Spotlight Dataset"
[Cornolti]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40749.pdf "A Framework for Benchmarking Entity-Annotation Systems"
[GERBIL]: http://aksw.org/Projects/GERBIL.html "General Entity Annotator Benchmarking framework"
[AKSW]: http://aksw.org/About.html "Agile Knowledge Engineering and Semantic Web"
[Hachey]: http://benhachey.info/pubs/hachey-aij12-evaluating.pdf "Evaluating Entity Linking with Wikipedia"
[Raiman]: https://arxiv.org/pdf/1802.01021.pdf "DeepType: Multilingual Entity Linking by Neural Type System Evolution"


================================================
FILE: english/grammatical_error_correction.md
================================================
# Grammatical Error Correction

Grammatical Error Correction (GEC) is the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors. 

GEC is typically formulated as a sentence correction task. A GEC system takes a potentially erroneous sentence as input and is expected to transform it to its corrected version. See the example given below: 

| Input (Erroneous)          | Output (Corrected)     |
| -------------------------  | ---------------------- |
|She see Tom is catched by policeman in park at last night. | She saw Tom caught by a policeman in the park last night.|

### CoNLL-2014 Shared Task

The [CoNLL-2014 shared task test set](https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz) is the most widely used dataset to benchmark GEC systems. The test set contains 1,312 English sentences with error annotations by 2 expert annotators. Models are evaluated with MaxMatch scorer ([Dahlmeier and Ng, 2012](http://www.aclweb.org/anthology/N12-1067)) which computes a span-based F<sub>β</sub>-score (β set to 0.5 to weight precision twice as recall).

The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (_**Restricted**_) and those that made use of large, non-public datasets (_**Unrestricted**_).

**Restricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Majority-voting ensemble (7 systems) (Omelianchuk et al., BEA 2024) | 72.8 | [Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models](https://arxiv.org/abs/2404.14914) | [official](https://github.com/grammarly/pillars-of-gec) |
| GRECO (Qorib and Ng, EMNLP 2023) | 71.12 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785) | [official](https://github.com/nusnlp/greco) |
| ESC (Qorib et al., NAACL 2022) | 69.51 | [Frustratingly Easy System Combination for Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.143/) | [official](https://github.com/nusnlp/esc) |
| T5 ([t5.1.1.xxl](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md)) trained on [cLang-8](https://github.com/google-research-datasets/clang8) (Rothe et al., ACL-IJCNLP 2021) | 68.87 | [A Simple Recipe for Multilingual Grammatical Error Correction](https://arxiv.org/pdf/2106.03830.pdf) | [T5](https://github.com/google-research/text-to-text-transfer-transformer), [cLang-8](https://github.com/google-research-datasets/clang8) |
| Tagged corruptions - ensemble (Stahlberg and Kumar, 2021)| 68.3 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa + ELECTRA + RoBERTa ensemble (Mesham et al., EACL 2023) | 67.93 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| TMTC (Lai et al., ACL Findings 2022) | 67.02 | [Type-Driven Multi-Turn Corrections for Grammatical Error Correction](https://aclanthology.org/2022.findings-acl.254) | [official](https://github.com/DeepLearnXMU/TMTC) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + (BERT, RoBERTa, XLNet), ensemble (Omelianchuk et al., BEA 2020) | 66.5 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Shallow Aggressive Decoding with BART (12+2), single model (beam=1) (Sun et al., ACL 2021) | 66.4 | [Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding](https://aclanthology.org/2021.acl-long.462.pdf) | [Official](https://github.com/AutoTemp/Shallow-Aggressive-Decoding) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa (Mesham et al., EACL 2023) | 66.06 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| DeBERTa(L) + RoBERTa(L) + XLNet (Tarnavskyi et al., ACL 2022) | 65.3 | [Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction](https://aclanthology.org/2022.acl-long.266) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + XLNet, single model (Omelianchuk et al., BEA 2020) | 65.3 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 65.2 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) | 65.0 | [An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction](https://arxiv.org/abs/1909.00502) | [Official](https://github.com/butsugiri/gec-pseudodata) |
| Seq2Edits ensemble + Full sequence rescoring  (Stahlberg and Kumar, EMNLP 2020) | 62.7 | [Seq2Edits: Sequence Transduction Using Span-level Edit Operations](https://aclanthology.org/2020.emnlp-main.418.pdf) | [Official](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/transformer_seq2edits.py) |
| Sequence Labeling with edits using BERT, Faster inference (Ensemble)  (Awasthi et al., EMNLP 2019) | 61.2 | [Parallel Iterative Edit Models for Local Sequence Transduction](https://www.aclweb.org/anthology/D19-1435.pdf) | [Official](https://github.com/awasthiabhijeet/PIE) |
| Copy-Augmented Transformer + Pre-train (Zhao and Wang, NAACL 2019) | 61.15 | [Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data](https://arxiv.org/pdf/1903.00138.pdf) | [Official](https://github.com/zhawe01/fairseq-gec) |
| Sequence Labeling with edits using BERT, Faster inference (Single Model) (Awasthi et al., EMNLP 2019) | 59.7 | [Parallel Iterative Edit Models for Local Sequence Transduction](https://www.aclweb.org/anthology/D19-1435.pdf) | [Official](https://github.com/awasthiabhijeet/PIE) |
| CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) | 56.52 | [Neural Quality Estimation of Grammatical Error Correction](http://aclweb.org/anthology/D18-1274) | [Official](https://github.com/nusnlp/neuqe/) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  56.25 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| Transformer (Junczys-Dowmunt et al., 2018) | 55.8 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| [Official](https://github.com/grammatical/neural-naacl2018) |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 54.79 | [A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |

**Unrestricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  61.34 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### CoNLL-2014 10 Annotations

[Bryant and Ng, 2015](http://aclweb.org/anthology/P15-1068) released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set ([link](http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip)).

**Restricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| GRECO (Qorib and Ng, EMNLP 2023) | 85.21 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785/) | [official](https://github.com/nusnlp/greco) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  72.04 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 70.14 (measured by Ge et al., 2018) | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |

**Unrestricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  76.88 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### JFLEG

[JFLEG test set](https://github.com/keisks/jfleg) released by [Napoles et al., 2017](http://aclweb.org/anthology/E17-2037) consists of 747 English sentences with 4 references for each sentence. Models are evaluated with [GLEU](https://github.com/cnap/gec-ranking/) metric ([Napoles et al., 2016](https://arxiv.org/pdf/1605.02592.pdf)).


**Restricted**:  

| Model           | GLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Tagged corruptions (Stahlberg and Kumar, 2021)| 64.7 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 62.0 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  61.50 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| Transformer (Junczys-Dowmunt et al., 2018) | 59.9 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| NA |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 57.47 | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |


**Unrestricted**:

| Model           | GLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) |  62.42 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### BEA Shared Task - 2019
[BEA shared task - 2019 dataset](https://www.cl.cam.ac.uk/research/nl/bea2019st/) released for the BEA Shared Task on Grammatical Error Correction provides a newer and bigger dataset for evaluating GEC models in 3 tracks, based on the datasets used for training:
- [Restricted track](https://competitions.codalab.org/competitions/20228)
- [Unrestricted track](https://competitions.codalab.org/competitions/20229)
- [Low-resource track](https://competitions.codalab.org/competitions/20230)   


Training and dev sets are released publicly and a GEC model's performance is evaluated by F-0.5 score. The model outputs on the test-set have to be uploaded to Codalab(publicly available) where category-wise error metrics are displayed. The test set consists of 4477 sentences(larger and diverse than the CoNLL-14 dataset) and the outputs are scored via [ERRANT](https://github.com/chrisjbryant/errant) toolkit. The released data are collected from 2 sources: 
  - Write & Improve, an online web platform that assists non-native English students with their writing.
  - LOCNESS, a corpus consisting of essays written by native English students.   



The description of tracks from the BEA [site](https://www.cl.cam.ac.uk/research/nl/bea2019st/#tracks) is given below:   


_**Restricted Track:**_
In the restricted track, participants may only use the following learner datasets:
  - FCE (Yannakoudakis et al., 2011)
  - Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012)
  - NUCLE (Dahlmeier et al., 2013)
  - W&I+LOCNESS (Bryant et al., 2019; Granger, 1998)   
Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.   


_**Unrestricted Track:**_
In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.   


_**Low Resource Track (formerly Unsupervised Track):**_
In the low resource track, participants may only use the following learner dataset: W&I+LOCNESS development set.   

Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.   


### Results on WI-LOCNESS test set:
**Restricted track**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Majority-voting ensemble (7 systems) (Omelianchuk et al., BEA 2024) | 81.4 | [Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models](https://arxiv.org/abs/2404.14914) | [official](https://github.com/grammarly/pillars-of-gec) |
| GRECO (Qorib and Ng, EMNLP 2023) | 80.84 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785) | [official](https://github.com/nusnlp/greco) |
| ESC (Qorib et al., NAACL 2022) | 79.90| [Frustratingly Easy System Combination for Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.143/) | [official](https://github.com/nusnlp/esc) |
| TMTC (Lai et al., ACL Findings 2022) | 77.93 | [Type-Driven Multi-Turn Corrections for Grammatical Error Correction](https://aclanthology.org/2022.findings-acl.254) | [official](https://github.com/DeepLearnXMU/TMTC) |
| RedPenNet (Didenko & Sameliuk, UNLP 2023) | 77.60 | [RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans](https://aclanthology.org/2023.unlp-1.15/) | [official](https://github.com/WebSpellChecker/unlp-2023-shared-task) |
| RoBERTa(L) + EditScorer (Sorokin, EMNLP 2022) | 77.1 | [Improved grammatical error correction by ranking elementary edits](https://aclanthology.org/2022.emnlp-main.785) | [official](https://github.com/AlexeySorokin/EditScorer) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa + ELECTRA + RoBERTa ensemble (Mesham et al., EACL 2023) | 76.17 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| DeBERTa(L) + RoBERTa(L) + XLNet (Tarnavskyi et al., ACL 2022) | 76.05 | [Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction](https://aclanthology.org/2022.acl-long.266) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| GECToR large without synthetic pre-training - ensemble (Tarnavskyi and Omelianchuk, 2021) | 76.05 | [Improving Sequence Tagging for Grammatical Error Correction](https://er.ucu.edu.ua/handle/1/2707) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| T5 ([t5.1.1.xxl](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md)) trained on [cLang-8](https://github.com/google-research-datasets/clang8) (Rothe et al., ACL-IJCNLP 2021) | 75.88 | [A Simple Recipe for Multilingual Grammatical Error Correction](https://arxiv.org/pdf/2106.03830.pdf) | [T5](https://github.com/google-research/text-to-text-transfer-transformer), [cLang-8](https://github.com/google-research-datasets/clang8) |
| Tagged corruptions - ensemble (Stahlberg and Kumar, 2021)| 74.9 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + (BERT, RoBERTa, XLNet), ensemble (Omelianchuk et al., BEA 2020) | 73.6 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| BEA Combination | 73.18 | [Learning to Combine Grammatical Error Corrections ](https://www.aclweb.org/anthology/W19-4414/) | [official](https://github.com/IBM/learning-to-combine-grammatical-error-corrections) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa (Mesham et al., EACL 2023) | 73.09 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| Shallow Aggressive Decoding with BART (12+2), single model (beam=1) (Sun et al., ACL 2021) | 72.9 | [Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding](https://aclanthology.org/2021.acl-long.462.pdf) | [Official](https://github.com/AutoTemp/Shallow-Aggressive-Decoding) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + XLNet, single model (Omelianchuk et al., BEA 2020) | 72.4 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) | 70.2 | [An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction](https://arxiv.org/abs/1909.00502) | NA |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 69.8 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| Transformer | 69.47  | [Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data](https://www.aclweb.org/anthology/W19-4427)| [Official: Code to be updated soon](https://github.com/grammatical/pretraining-bea2019) |
| Transformer | 69.00  | [A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning](https://www.aclweb.org/anthology/W19-4423)| [Official](https://github.com/kakaobrain/helo_word/) |
| Ensemble of models | 66.78  | [The LAIX Systems in the BEA-2019 GEC Shared Task](https://www.aclweb.org/anthology/W19-4416)| NA |

**Low-resource track**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Transformer | 64.24  | [Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data](https://www.aclweb.org/anthology/W19-4427)| [Official: Code to be updated soon](https://github.com/grammatical/pretraining-bea2019) |
| Transformer | 58.80  | [A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning](https://www.aclweb.org/anthology/W19-4423)| [Official](https://github.com/kakaobrain/helo_word/) |
| Ensemble of models | 51.81  | [The LAIX Systems in the BEA-2019 GEC Shared Task](https://www.aclweb.org/anthology/W19-4416)| NA |

 
 **Reference**:
 - Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Torsten Zesch, in [Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications](https://www.aclweb.org/anthology/W19-44)
 - Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of Error Types for Grammatical Error Correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.


================================================
FILE: english/information_extraction.md
================================================
# Information Extraction

## Open Knowledge Graph Canonicalization

Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\<Barack Obama, was born in, Honolulu\>* and *\<Obama, took birth in, Honolulu\>* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB.

### Datasets 

| Datasets                                 | # Gold Entities | #NPs  | #Relations | #Triples |
| ---------------------------------------- | :-------------: | ----- | ---------- | -------- |
| [Base](https://suchanek.name/work/publications/cikm2014.pdf) |       150       | 290   | 3K         | 9K       |
| [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) |       446       | 717   | 11K        | 37K      |
| [ReVerb45K](https://github.com/malllabiisc/cesi) |      7.5K       | 15.5K | 22K        | 45K      |

### Noun Phrase Canonicalization

| **Model**                     |               | Base Dataset |        |               | Ambiguous dataset |        |               | ReVerb45k  |        | **Paper**/Source                         |
| :---------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: | ---------------------------------------- |
|                               | **Precision** |  **Recall**  | **F1** | **Precision** |    **Recall**     | **F1** | **Precision** | **Recall** | **F1** |                                          |
| CESI (Vashishth et al., 2018) |     98.2      |     99.8     |  99.9  |     66.2      |       92.4        | 91.9   |     62.7      |    84.4    |  81.9  | [CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information](https://github.com/malllabiisc/cesi) |
| Galárraga et al., 2014 ( IDF) |     94.8      |     97.9     |  98.3  |     67.9      |       82.9        | 79.3   |     71.6      |    50.8    |  0.5   | [Canonicalizing Open Knowledge Bases](https://suchanek.name/work/publications/cikm2014.pdf) |

[Go back to the README](../README.md)


================================================
FILE: english/intent_detection_slot_filling.md
================================================
# Intent Detection and Slot Filling
Intent Detection and Slot Filling is the task of interpreting user commands/queries by extracting the intent and the relevant slots.

Example (from ATIS):
```
Query: What flights are available from pittsburgh to baltimore on thursday morning
Intent: flight info
Slots: 
    - from_city: pittsburgh
    - to_city: baltimore
    - depart_date: thursday
    - depart_time: morning
```

## ATIS
ATIS (Air Travel Information System) (Hemphill et al.) is a dataset by Microsoft CNTK. Available from the [github page](https://github.com/microsoft/CNTK/tree/master/Examples/LanguageUnderstanding/ATIS). The slots are labeled in the BIO ([Inside Outside Beginning](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))) format (similar to NER). This dataset contains only air travel related commands. Most of the ATIS results are based on the work [here](https://github.com/zhenwenzhang/Slot_Filling).

| Model | Slot F1 Score | Intent Accuracy | Paper / Source | Code |
| ------ | ------ | ------ | ------ | ------ |
| Bi-model with decoder | 96.89 | 98.99  | [A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling](https://arxiv.org/abs/1812.10235) |
| CTRAN | 98.46 | 98.07  | [CTRAN: CNN-Transformer-based network for natural language understanding](https://www.sciencedirect.com/science/article/abs/pii/S0952197623011971) | [Official](https://github.com/rafiepour/CTran/)|
| SlotRefine + BERT | 96.16 | 97.74  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| SlotRefine | 96.22 | 97.11  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Stack-Propagation + BERT | 96.10 | 97.50 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| JointBERT-CAE | 96.1 | 97.50 | [CAE: Mechanism to Diminish the Class Imbalanced in SLU Slot Filling Task](https://link.springer.com/chapter/10.1007/978-3-031-16210-7_12)|[Official](https://github.com/phuongnm94/JointBERT_CAE)|
| Co-interactive Transformer | 95.90 | 97.70 | [A Co-Interactive Transformer for Joint Slot Filling and Intent Detection](https://arxiv.org/abs/2010.03880)|[Official](https://github.com/kangbrilliant/DCA-Net)|
| Heterogeneous Attention | 95.58 | 97.76 | [Joint agricultural intent detection and slot filling based on enhanced heterogeneous attention mechanism](https://www.sciencedirect.com/science/article/abs/pii/S0168169923001448)|  |
| Stack-Propagation | 95.90 | 96.90 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| Attention Encoder-Decoder NN | 95.87 | 98.43 | [Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1609.01454)|
| SF-ID (BLSTM) network | 95.80 | 97.76 | [A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1907.00390) | [Official](https://github.com/ZephyrChenzf/SF-ID-Network-For-NLU) |
| Context Encoder | 95.80 | NA | [Improving Slot Filling by Utilizing Contextual Information](https://arxiv.org/pdf/1911.01680.pdf) |
| Capsule-NLU | 95.20 | 95.00 | [Joint Slot Filling and Intent Detection via Capsule Neural Networks](https://arxiv.org/abs/1812.09471) | [Official](https://github.com/czhang99/Capsule-NLU) |
| Joint GRU model(W) | 95.49 | 98.10  |[A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding](https://www.ijcai.org/Proceedings/16/Papers/425.pdf)|
| Slot-Gated BLSTM with Attension | 95.20 | 94.10 | [Slot-Gated Modeling for Joint Slot Filling and Intent Prediction](https://www.csie.ntu.edu.tw/~yvchen/doc/NAACL18_SlotGated.pdf)| [Official](https://github.com/MiuLab/SlotGated-SLU) |
| Joint model with recurrent slot label context  | 94.64 |  98.40 | [Joint Online Spoken Language Understanding and Language Modeling with Recurrent Neural Networks](https://arxiv.org/pdf/1609.01462.pdf) | [Official](https://github.com/HadoopIt/joint-slu-lm) |
| Recursive NN  | 93.96 | 95.40 | [JOINT SEMANTIC UTTERANCE CLASSIFICATION AND SLOT FILLING WITH RECURSIVE NEURAL NETWORKS](https://www.microsoft.com/en-us/research/wp-content/uploads/2014/12/RecNNSLU.pdf) | |
| Encoder-labeler Deep LSTM | 95.66 | NA  | [Leveraging Sentence-level Information with Encoder LSTM for Natural Language Understanding](https://arxiv.org/abs/1601.01530) |
| RNN with Label Sampling  | 94.89 | NA | [Recurrent Neural Network Structured Output Prediction for Spoken Language Understanding](http://speech.sv.cmu.edu/publications/liu-nipsslu-2015.pdf) | |
| Hybrid RNN | 95.06 | NA | [Using recurrent neural networks for slot filling in spoken language understanding.](http://www.iro.umontreal.ca/~lisa/pointeurs/taslp_RNNSLU_final_doubleColumn.pdf) | |
| RNN-EM | 95.25 |  NA  | [Recurrent neural networks with external memory for language understanding](https://arxiv.org/abs/1506.00195) |
| CNN-CRF | 94.35 | NA  | [Convolutional neural network based triangular crf for joint intent detection and slot filling](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/12/IEEE-ASRU-2013.pdf) | |


## SNIPS
SNIPS is a dataset by Snips.ai for Intent Detection and Slot Filling benchmarking. Available from the [github page](https://github.com/snipsco/nlu-benchmark). This dataset contains several day to day user command categories (e.g. play a song, book a restaurant).

| Model | Slot F1 Score | Intent Accuracy | Paper / Source | Code |
| ------ | ------ | ------ | ------ | ------ |
| CTRAN | 98.30 | 99.42  | [CTRAN: CNN-Transformer-based Network for Natural Language Understanding](https://www.sciencedirect.com/science/article/abs/pii/S0952197623011971) | [Official](https://github.com/rafiepour/CTran/)|
| SlotRefine + BERT | 97.05 | 99.04  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Stack-Propagation + BERT | 97.00 | 99.00 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| JointBERT-CAE | 97.00 | 98.30 | [CAE: Mechanism to Diminish the Class Imbalanced in SLU Slot Filling Task](https://link.springer.com/chapter/10.1007/978-3-031-16210-7_12)|[Official](https://github.com/phuongnm94/JointBERT_CAE)|
| Heterogeneous Attention | 96.32 | 98.29 | [Joint agricultural intent detection and slot filling based on enhanced heterogeneous attention mechanism](https://www.sciencedirect.com/science/article/abs/pii/S0168169923001448)|  |
| Co-interactive Transformer | 95.90 | 98.80 | [A Co-Interactive Transformer for Joint Slot Filling and Intent Detection](https://arxiv.org/abs/2010.03880)|[Official](https://github.com/kangbrilliant/DCA-Net)|
| Stack-Propagation | 94.20 | 98.00 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| SlotRefine | 93.72 | 97.44  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Context Encoder | 93.60 | NA | [Improving Slot Filling by Utilizing Contextual Information](https://arxiv.org/pdf/1911.01680.pdf) |
| SF-ID (BLSTM) network | 92.23 | 97.43 | [A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1907.00390) | [Official](https://github.com/ZephyrChenzf/SF-ID-Network-For-NLU) |
| Capsule-NLU | 91.80 | 97.70 | [Joint Slot Filling and Intent Detection via Capsule Neural Networks](https://arxiv.org/abs/1812.09471) | [Official](https://github.com/czhang99/Capsule-NLU) |
| Slot-Gated BLSTM with Attention | 88.80 | 97.00 | [Slot-Gated Modeling for Joint Slot Filling and Intent Prediction](https://www.csie.ntu.edu.tw/~yvchen/doc/NAACL18_SlotGated.pdf)| [Official](https://github.com/MiuLab/SlotGated-SLU) |


================================================
FILE: english/keyphrase_extraction_generation.md
================================================
# Keyphrase Extraction and Generation

Keyphrase extraction is the NLP task of identifying **key** phrases in the document, and has a wide range of applications applications such as information retrieval, question answering, text summarization etc. There are two aspects to keyphrases - some of them are directly occuring in the document, and are termed **present** keyphrases in the literature. Some of the keyphrases don't occur in the document, but can still function as appropriate summaries/tags for a given document, and they are termed **absent** keyphrases. Traditionally, NLP research addressed extracting the **present** keyphrases, while the post-deep learning approaches are also considering **absent** keyphrases. Thus, while Keyphrase Extraction (KPE) can be termed a "sequence labeling" problem, Keyphrase Generation (KPG) is treated as a "sequence to sequence" generation problem. Another dominant approach is to treat both of them together as a generation problem in an integrated approach.  

Two recent surveys summarizing all research on this topic:
1. "A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models". [Song et.al., 2023](https://aclanthology.org/2023.findings-eacl.161/). EACL 2023. 
2. "From statistical methods to deep learning, automatic keyphrase prediction: A survey". [Xie et.al., 2023](https://www.sciencedirect.com/science/article/pii/S030645732300119X). Information Processing and Management 60(4). 

### Standard Datasets and Evaluation Measures

There are several open datasets for this task, and they generally consists of text instances, followed by a list of assigned keyphrases per text. Keyphrases are either manually annotated or extracted automatically from pre-tagged web content in the training data. Keyphrases can be either *present* or *absent* in the text itself. 

#### **Commonly used Datasets**

#### KP20K
This dataset was first described in [Meng et.al., 2017](https://aclanthology.org/P17-1054/) and contains the titles, abstracts, and keyphrases of 20,000 scientific articles in computer science extracted automaticallly, and it can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/kp20k).

#### Inspec
The dataset consists of 2000 English scientific abstracts from the [Inspec](https://en.wikipedia.org/wiki/Inspec) database, with keyphrases annotated by professional indexers. The dataset is described in [Hulth, 2003](https://aclanthology.org/W03-1028/) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/inspec). 

#### Krapivin
Krapivin consists of 2000 English scientific articles (full text) from computer science domain, with keyphrases annotated by the authors, and verified by the reviewers. The dataset is described in [Krapivin et.al., 2010](https://link.springer.com/chapter/10.1007/978-3-642-13654-2_12) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/krapivin). 

#### NUS
NUS consists of about 200 English scientific publications (full text), with keyphrases annotated by the authors, as well as an independent set of annotators. The dataset is described in [Nguyen and Kan, 2007](https://link.springer.com/chapter/10.1007/978-3-540-77094-7_41) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/nus). 

#### SemEval
SemEval dataset was originally used in the [SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications](https://aclanthology.org/S17-2091/), and consists of 500 English open-access scientific publications from ScienceDirect. Keyphrases are annotated by a set of student volunteers followed by a second annotation by an expert annotator. It can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/semeval2017). 

#### Other Datasets

#### DUC
This dataset [Wan and Xiao, 2008](https://dl.acm.org/doi/10.5555/1620163.1620205) consists of around 300 English news articles with their keyphrases, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/duc2001).

#### KPTimes
KPTimes [Gallina et.al., 2019](https://aclanthology.org/W19-8617/) is a large dataset of 279,923 news articles from NYTimes and 10,000 articles from JPTimes, with curated keyphrase annotations by editors, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/kptimes)

#### OpenKP
OpenKP [Xiong et.al., 2019](https://aclanthology.org/D19-1521/) consists of approximately 150K web documents with manually annotated keyphrases, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/openkp). 

**Evaluation Measures**

Macro Precision/Recall/F1 score are calculated for top-k matches while comparing the ground-truth keyphrases and the model output. While F1\@k where k= 5 or 10 are commonly reported, variants such as F1@/O/M are also reported. F1\@O uses the number of gold keyphrases as k, and F1\@M uses the number of predicted keyphrases as k. For "absent" keyphrases, some papers also report R\@10/50. The following tables will rank the models in terms of F1\@5, for the five most commonly reported datasets, KP20K, Inspec, Krapivin, NUS, SemEval [Most recent research reports experiments using KP20K as training data, and testing on KP20k, NUS, Semeval, Inspec and Krapivin]. 

Here are a few notes on results:
 - Asterisk indicates the paper reported **Micro** scores, instead of Macro.  
 - Exclamation indicates the paper does not mention whether they report a macro or a micro measure.  
 - All results are from the original results reported in the paper that describes the model.    
 - Some papers report on a scale of 0-1 and some on 0-100 (sometimes, the same paper uses different scales in different tables!). All results below are changed to 0-1 to maintain uniformity. 

#### KP20K 

| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|0.232 (!) |0.044 (!) |[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.351(!) | 0.032(!)  | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.370 | 0.050 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.363 | 0.067 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |  
|SetTrans (Ye et.al., 2021) | 0.358| 0.036| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.408 (!) | 0.047 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) |0.311 |0.016 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |- |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.333 | -| [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |  


#### SemEval
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|-|- |[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.329 (!)| 0.028 (!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.360 | 0.043 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.343 | 0.053 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |
|SetTrans (Ye et.al., 2021) | 0.331| 0.026| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.416 (!) | 0.030 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) |
|ExHiRD-h (Chen et.al., 2020) |0.284 | 0.017|[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |0.320 | - | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.293 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### Inspec

| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|0.352 (!)|0.049 (!)|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.26 (!) | 0.017(!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) | 0.330 | 0.025 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.322 | 0.036 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |  
|SetTrans (Ye et.al., 2021) | 0.285| 0.021| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|  
| UniKeyphrase (Wu et.al., 2021)| 0.29 (!) | 0.029 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) |0.253 |0.011 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) | - |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.278 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### Krapivin
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
| P-AKG (Wu et.al., 2022) | - |- | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|ChatGPT (Martinez et.al., 2023)|-|-|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
|WR-SetTrans (Xie et.al., 2022) | 0.360|0.057 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.323 | 0.078 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) | 
|SetTrans (Ye et.al., 2021) | 0.326| 0.047| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| - | - | [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) | 0.286 | 0.022 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) | 0.318|- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.311 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### NUS 
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|-|-|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.412 (!)| 0.036(!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.428 |0.057 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.418 | 0.079 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |
|SetTrans (Ye et.al., 2021) | 0.406| 0.042| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.434 (!) | 0.037 (!) | [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) | - | - |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |0.358 |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)| 
|CopyRNN (Meng et.al., 2017) |0.334 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

[Go back to the README](../README.md)


================================================
FILE: english/language_modeling.md
================================================
# Language modeling

Language modeling is the task of predicting the next word or character in a document.

\* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. ([Mikolov et al., (2010)](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), [Krause et al., (2017)](https://arxiv.org/pdf/1709.07432))

## Word Level Models

### Penn Treebank

A common evaluation dataset for language modeling is the Penn Treebank,
as pre-processed by [Mikolov et al., (2011)](https://www.isca-speech.org/archive/archive_papers/interspeech_2011/i11_0605.pdf).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with `<eos>`,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an `<unk>` token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).

| Model           | Validation perplexity | Test perplexity | Number of params |  Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)            | 42.9  | 42.9  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)      | 44.9  | 44.8  | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 46.63 | 46.01 | 22M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 47.38 | 46.54 | 22M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) |
| AWD-LSTM-DOC x5 (Takase et al., 2018) | 48.63 | 47.17 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| Mogrifier RLSTM (Melis, 2022)                           | 48.9  | 47.9  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)                     | 51.4  | 50.1  | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | 24M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 53.79 | 52.00 | 23M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) |
| AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | 23M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | 24M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| Trellis Network (Bai et al., 2019) |   -   | 54.19 | 34M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet)
| AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 56.44 | 54.33 | 22M | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| AWD-LSTM-MoS + finetune (Yang et al., 2018) | 56.54 | 54.44 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| Transformer-XL (Dai et al., 2018) ***under review*** | 56.72 | 54.52 | 24M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| AWD-LSTM-MoS (Yang et al., 2018) | 58.08 | 55.97 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) |  58.9 | 56.8 | 24M | [Fraternal dropout](https://arxiv.org/pdf/1711.00066.pdf) | [Official](https://github.com/kondiz/fraternal-dropout) |
| AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | 24M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |

### WikiText-2

[WikiText-2](https://arxiv.org/abs/1609.07843) has been proposed as a more realistic
benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2
consists of around 2 million words extracted from Wikipedia articles.

| Model           | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)            | 39.3  | 38.0  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)      | 40.2  | 38.6  | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 40.27 | 38.65 | 35M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 40.85 | 39.14 | 35M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) |
| AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | 33M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | 33M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| AWD-LSTM-DOC x5 (Takase et al., 2018) | 54.19 | 53.09 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| Mogrifier RLSTM (Melis, 2022)                           | 56.7  | 55.0  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)                     | 57.3  | 55.1  | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 60.16 | 57.85 | 37M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) |
| AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | 37M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) |  66.8 | 64.1 | 34M | [Fraternal dropout](https://arxiv.org/pdf/1711.00066.pdf) | [Official](https://github.com/kondiz/fraternal-dropout) |
| AWD-LSTM + ATOI (Kocher et al., 2019) | 67.47 | 64.73 | 33M | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | 33M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |

### WikiText-103

[WikiText-103](https://arxiv.org/abs/1609.07843) The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.

| Model           | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :---:| :---:| :---:| -------- | --- |
| Routing Transformer (Roy et al., 2020)* ***arxiv preprint*** | - | 15.8 | - | [Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/pdf/2003.05997.pdf) | - |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 15.8 | 16.4 | 257M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Compressive Transformer (Rae et al., 2019)* ***arxiv preprint*** | 16.0 | 17.1(16.1 with basic dynamic evaluation) | ~257M | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/pdf/1911.05507.pdf) | - |
| SegaTransformer-XL (Bai et al., 2020) | - | 17.1 | 257M | [Segatron: Segment-Aware Transformer for Language Modeling and Understanding](https://arxiv.org/abs/2004.14996) | [Official](https://github.com/rsvp-ai/segatron_aaai) |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 17.7 | 18.3 | 257M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer with tied adaptive embeddings (Baevski and Auli, 2018) | 19.8 | 20.5 | 247M | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) |
| TaLK Convolutions (Lioutas et al., 2020)| - | 23.3 | 240M | [Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184) | [Official](https://github.com/lioutasb/TaLKConvolutions) |
| Transformer-XL Standard (Dai et al., 2018) ***under review*** | 23.1 | 24.0 | 151M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| AdvSoft + 4 layer QRNN + dynamic eval (Wang et al., 2019) | 27.2 | 28.0 |  | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) | 29.0 | 29.2 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| Trellis Network (Bai et al., 2019) |   -   | 30.35 | 180M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet)
| AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 31.92 | 32.85 | | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| LSTM + Hebbian (Rae et al., 2018) | 34.1 | 34.3 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| LSTM (Rae et al., 2018) | 36.0 | 36.4 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| Gated CNN (Dauphin et al., 2016) | - | 37.2 | | [Language modeling with gated convolutional networks](https://arxiv.org/abs/1612.08083) ||
| Neural cache model (size = 2,000) (Grave et al., 2017) | - | 40.8 | | [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/pdf/1612.04426.pdf) | [Link](https://github.com/kaishengtai/torch-ntm) |
| Temporal CNN (Bai et al., 2018) | - | 45.2 | | [Convolutional sequence modeling revisited](https://openreview.net/forum?id=BJEX-H1Pf) ||
| LSTM (Grave et al., 2017) | - | 48.7 | | [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/pdf/1612.04426.pdf) | [Link](https://github.com/kaishengtai/torch-ntm) |

### 1B Words / Google Billion Word benchmark

[The One-Billion Word benchmark](https://arxiv.org/pdf/1312.3005.pdf) is a large dataset derived from a news-commentary site.
The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words.
Importantly, sentences in this model are shuffled and hence context is limited.

| Model         | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :-----:| :-----:| --------- | --- |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 21.8 | 0.8B | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer-XL Base (Dai et al., 2018) ***under review*** | 23.5 | 0.46B | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018) | 23.7 | 0.8B | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) 
| 10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ***ensemble*** | 23.7 | 43B? | [Exploring the Limits of Language Modeling](https://arxiv.org/pdf/1602.02410.pdf) | [Official](https://github.com/rafaljozefowicz/lm) |
| Transformer with shared adaptive embeddings (Baevski and Auli, 2018) | 24.1 | 0.46B | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) 
| Big LSTM+CNN inputs (Jozefowicz et al., 2016) | 30.0 | 1.04B | [Exploring the Limits of Language Modeling](https://arxiv.org/pdf/1602.02410.pdf) ||
| Gated CNN-14Bottleneck (Dauphin et al., 2017) | 31.9 | ? | [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083.pdf) ||
| BIGLSTM baseline (Kuchaiev and Ginsburg, 2018) | 35.1 | 0.151B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |
| BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018) | 36.3 | 0.052B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |
| BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018) | 39.4 | 0.035B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |


## Character Level Models

### Hutter Prize

[The Hutter Prize](http://prize.hutter1.net) Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the
first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset.
Within these 100 million bytes are 205 unique tokens.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)                                  | 0.935 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 0.94 | 277M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Compressive Transformer (Rae et al., 2019) ***arxiv preprint*** | 0.97 | - | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/pdf/1911.05507.pdf) | - |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)            | 0.988 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| 24-layer Transformer-XL (Dai et al., 2018) ***under review*** | 0.99 | 277M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Longformer Large (Beltagy, Peters, and Cohan; 2020) | 0.99 | 102M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.00 | 41M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| 18-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.03 | 88M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Mogrifier RLSTM (Melis, 2022)                                 | 1.042 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| 12-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.06 | 41M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| 64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.06 | 235M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| 12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.11 | 44M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| Mogrifier LSTM (Melis et al., 2019)            | 1.122 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| 3-layer AWD-LSTM (Merity et al., 2018)  | 1.232 | 47M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |
| Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474) | [Official](https://github.com/jzilly/RecurrentHighwayNetworks) |
| FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |

### Text8
[The text8 dataset](http://mattmahoney.net/dc/textdata.html) is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 1.038 | 277M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)               | 1.044 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 1.08 | 277M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Mogrifier RLSTM (Melis, 2022)                              | 1.096 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.10 | 41M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| 64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.13 | 235M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| 12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.18 | 44M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474) | [Official](https://github.com/jzilly/RecurrentHighwayNetworks) |
| LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 |  35M | [Hierarchical Multiscale Recurrent Neural Networks](https://arxiv.org/abs/1609.01704) ||
| BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | [Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025) | [Official](https://github.com/cooijmanstim/recurrent-batch-normalization) |
| Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |

### Penn Treebank
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset.  This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)      | 1.061 | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.083 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier RLSTM (Melis, 2022)                     | 1.096 | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)               | 1.120 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Trellis Network (Bai et al., 2019) | 1.159 | 13.4M | [Trellis

Download .txt

gitextract_k09iiau2/

├── .gitignore
├── CITATION.cff
├── CNAME
├── LICENSE
├── README.md
├── _config.yml
├── _includes/
│   ├── chart.html
│   └── table.html
├── arabic/
│   └── language_modeling.md
├── bengali/
│   ├── emotion_detection.md
│   ├── part_of_speech_tagging.md
│   ├── question_answering.md
│   └── sentiment_analysis.md
├── chinese/
│   ├── chinese.md
│   ├── chinese_word_segmentation.md
│   └── question_answering.md
├── english/
│   ├── automatic_speech_recognition.md
│   ├── ccg.md
│   ├── common_sense.md
│   ├── constituency_parsing.md
│   ├── coreference_resolution.md
│   ├── data_to_text_generation.md
│   ├── dependency_parsing.md
│   ├── dialogue.md
│   ├── domain_adaptation.md
│   ├── entity_linking.md
│   ├── grammatical_error_correction.md
│   ├── information_extraction.md
│   ├── intent_detection_slot_filling.md
│   ├── keyphrase_extraction_generation.md
│   ├── language_modeling.md
│   ├── lexical_normalization.md
│   ├── machine_translation.md
│   ├── missing_elements.md
│   ├── multi-task_learning.md
│   ├── multimodal.md
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   ├── paraphrase-generation.md
│   ├── part-of-speech_tagging.md
│   ├── question_answering.md
│   ├── relation_prediction.md
│   ├── relationship_extraction.md
│   ├── semantic_parsing.md
│   ├── semantic_role_labeling.md
│   ├── semantic_textual_similarity.md
│   ├── sentiment_analysis.md
│   ├── shallow_syntax.md
│   ├── simplification.md
│   ├── stance_detection.md
│   ├── summarization.md
│   ├── taxonomy_learning.md
│   ├── temporal_processing.md
│   ├── text_classification.md
│   └── word_sense_disambiguation.md
├── french/
│   ├── question_answering.md
│   └── summarization.md
├── german/
│   ├── question_answering.md
│   └── summarization.md
├── hindi/
│   └── hindi.md
├── jekyll_instructions.md
├── korean/
│   └── question_answering.md
├── nepali/
│   └── nepali.md
├── persian/
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   └── summarization.md
├── portuguese/
│   └── question_answering.md
├── russian/
│   ├── question_answering.md
│   ├── sentiment-analysis.md
│   └── summarization.md
├── spanish/
│   ├── entity_linking.md
│   ├── named_entity_recognition.md
│   └── summarization.md
├── structured/
│   ├── README.md
│   ├── export.py
│   └── requirements.txt
├── turkish/
│   └── summarization.md
└── vietnamese/
    └── vietnamese.md

Download .txt

SYMBOL INDEX (13 symbols across 1 files)

FILE: structured/export.py
  function extract_dataset_desc_links (line 10) | def extract_dataset_desc_links(desc:List[str]) -> List:
  function sanitize_subdataset_name (line 33) | def sanitize_subdataset_name(name:str):
  function extract_lines_before_tables (line 48) | def extract_lines_before_tables(lines:List[str]):
  function handle_multiple_sota_table_exceptions (line 76) | def handle_multiple_sota_table_exceptions(section:List[str], sota_tables...
  function extract_title_and_link (line 111) | def extract_title_and_link(md_link:str) -> Tuple:
  function extract_model_name_and_author (line 124) | def extract_model_name_and_author(md_name:str) -> Tuple:
  function extract_paper_title_and_link (line 145) | def extract_paper_title_and_link(paper_md:str) -> Tuple:
  function extract_code_links (line 166) | def extract_code_links(code_md:str) -> List[Dict]:
  function extract_sota_table (line 187) | def extract_sota_table(table_lines:List[str]) -> Dict:
  function get_line_no (line 272) | def get_line_no(sections:List[str], section_index:int, section_line=0) -...
  function extract_dataset_desc_and_sota_table (line 287) | def extract_dataset_desc_and_sota_table(md_lines:List[str]) -> Tuple:
  function parse_markdown_file (line 321) | def parse_markdown_file(md_file:str) -> List:
  function parse_markdown_directory (line 447) | def parse_markdown_directory(path:str):

Download .json

Condensed preview — 78 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (575K chars).

[
  {
    "path": ".gitignore",
    "chars": 43,
    "preview": "_site/\nGemfile*\nvenv\n.idea\nstructured.json\n"
  },
  {
    "path": "CITATION.cff",
    "chars": 261,
    "preview": "cff-version: 1.2.0\nmessage: \"If you use this software, please cite it as below.\"\nauthors:\n- family-names: \"Ruder\"\n  give"
  },
  {
    "path": "CNAME",
    "chars": 15,
    "preview": "nlpprogress.com"
  },
  {
    "path": "LICENSE",
    "chars": 1072,
    "preview": "MIT License\n\nCopyright (c) 2018 Sebastian Ruder\n\nPermission is hereby granted, free of charge, to any person obtaining a"
  },
  {
    "path": "README.md",
    "chars": 8883,
    "preview": "# Tracking Progress in Natural Language Processing\n\n## Table of contents\n\n### English\n\n- [Automatic speech recognition]("
  },
  {
    "path": "_config.yml",
    "chars": 25,
    "preview": "theme: jekyll-theme-slate"
  },
  {
    "path": "_includes/chart.html",
    "chars": 506,
    "preview": "<style>\n\n.chart div {\n  font: 18px sans-serif;\n  background-color: steelblue;\n  padding: 6px;\n  margin: 2px;\n  color: wh"
  },
  {
    "path": "_includes/table.html",
    "chars": 792,
    "preview": "{% assign scores = include.scores | split: \",\" %}\n\n<table>\n  <thead>\n    <tr>\n      <th>Model</th>\n      {% for score in"
  },
  {
    "path": "arabic/language_modeling.md",
    "chars": 670,
    "preview": "# Language modeling\n\nLanguage modeling is the task of predicting the next word or character in a document.\n\n\n| Model    "
  },
  {
    "path": "bengali/emotion_detection.md",
    "chars": 851,
    "preview": "# Fine-grained Emotion Detection\n\nFine-grained Emotion Detection is the task of detecting one or multiple emotion of a g"
  },
  {
    "path": "bengali/part_of_speech_tagging.md",
    "chars": 936,
    "preview": "\n# Part-of-speech Tagging\nPart-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of "
  },
  {
    "path": "bengali/question_answering.md",
    "chars": 1344,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [Reading compreh"
  },
  {
    "path": "bengali/sentiment_analysis.md",
    "chars": 805,
    "preview": "# Sentiment analysis\n\nSentiment Analysis is the task of classifying polarity of a given text.\n\n## SentNoB\n\n[SentNoB: A D"
  },
  {
    "path": "chinese/chinese.md",
    "chars": 611,
    "preview": "# Chinese NLP tasks\n\n## Entity linking\n\nSee [here](../english/entity_linking.md) for more information about the task.\n\n#"
  },
  {
    "path": "chinese/chinese_word_segmentation.md",
    "chars": 9144,
    "preview": "# Chinese Word Segmentation\n\n## Task\nChinese word segmentation is the task of\nsplitting Chinese text (a sequence of Chin"
  },
  {
    "path": "chinese/question_answering.md",
    "chars": 2634,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [Reading compreh"
  },
  {
    "path": "english/automatic_speech_recognition.md",
    "chars": 217,
    "preview": "# Automatic speech recognition (ASR)\n\nAutomatic speech recognition is the task of automatically recognizing speech. You "
  },
  {
    "path": "english/ccg.md",
    "chars": 8834,
    "preview": "# Combinatory Categorical Grammar\n\nCombinatory Categorical Grammar (CCG; [Steedman, 2000](http://www.citeulike.org/group"
  },
  {
    "path": "english/common_sense.md",
    "chars": 7499,
    "preview": "# Common sense\n\nCommon sense reasoning tasks are intended to require the model to go beyond pattern \nrecognition. Instea"
  },
  {
    "path": "english/constituency_parsing.md",
    "chars": 7438,
    "preview": "# Constituency parsing\n\nConstituency parsing aims to extract a constituency-based parse tree from a sentence that\nrepres"
  },
  {
    "path": "english/coreference_resolution.md",
    "chars": 4051,
    "preview": "# Coreference resolution\n\nCoreference resolution is the task of clustering mentions in text that refer to the same under"
  },
  {
    "path": "english/data_to_text_generation.md",
    "chars": 10158,
    "preview": "# Data-to-Text Generation\n\n**Data-to-Text Generation (D2T NLG)** can be described as Natural Language Generation from st"
  },
  {
    "path": "english/dependency_parsing.md",
    "chars": 16206,
    "preview": "# Dependency parsing\n\nDependency parsing is the task of extracting a dependency parse of a sentence that represents its "
  },
  {
    "path": "english/dialogue.md",
    "chars": 26313,
    "preview": "# Dialogue\n\nDialogue is notoriously hard to evaluate. Past approaches have used human evaluation.\n\n## Dialogue act class"
  },
  {
    "path": "english/domain_adaptation.md",
    "chars": 2118,
    "preview": "# Domain adaptation\n\n## Sentiment analysis\n\n### Multi-Domain Sentiment Dataset\n\nThe [Multi-Domain Sentiment Dataset](htt"
  },
  {
    "path": "english/entity_linking.md",
    "chars": 13356,
    "preview": "# Entity Linking\n\n## Task\n\nEntity Linking (EL) is the task of recognizing (cf. [Named Entity Recognition](named_entity_r"
  },
  {
    "path": "english/grammatical_error_correction.md",
    "chars": 21904,
    "preview": "# Grammatical Error Correction\n\nGrammatical Error Correction (GEC) is the task of correcting different kinds of errors i"
  },
  {
    "path": "english/information_extraction.md",
    "chars": 2435,
    "preview": "# Information Extraction\n\n## Open Knowledge Graph Canonicalization\n\nOpen Information Extraction approaches leads to crea"
  },
  {
    "path": "english/intent_detection_slot_filling.md",
    "chars": 8643,
    "preview": "# Intent Detection and Slot Filling\nIntent Detection and Slot Filling is the task of interpreting user commands/queries "
  },
  {
    "path": "english/keyphrase_extraction_generation.md",
    "chars": 13585,
    "preview": "# Keyphrase Extraction and Generation\n\nKeyphrase extraction is the NLP task of identifying **key** phrases in the docume"
  },
  {
    "path": "english/language_modeling.md",
    "chars": 27940,
    "preview": "# Language modeling\n\nLanguage modeling is the task of predicting the next word or character in a document.\n\n\\* indicates"
  },
  {
    "path": "english/lexical_normalization.md",
    "chars": 3453,
    "preview": "# Lexical Normalization\n\nLexical normalization is the task of translating/transforming a non standard text to a standard"
  },
  {
    "path": "english/machine_translation.md",
    "chars": 4458,
    "preview": "# Machine translation\n\nMachine translation is the task of translating a sentence in a source language to a different tar"
  },
  {
    "path": "english/missing_elements.md",
    "chars": 3332,
    "preview": "# Missing Elements\n\nMissing elements are a collection of phenomenon that deals with things that are meant, but not expli"
  },
  {
    "path": "english/multi-task_learning.md",
    "chars": 969,
    "preview": "# Multi-task learning\n\nMulti-task learning aims to learn multiple different tasks simultaneously while maximizing\nperfor"
  },
  {
    "path": "english/multimodal.md",
    "chars": 7659,
    "preview": "# Multimodal\n\n`Multimodal` NLP involves the **combination of different types of information, such as text, speech, image"
  },
  {
    "path": "english/named_entity_recognition.md",
    "chars": 14760,
    "preview": "# Named entity recognition\n\nNamed entity recognition (NER) is the task of tagging entities in text with their correspond"
  },
  {
    "path": "english/natural_language_inference.md",
    "chars": 4399,
    "preview": "# Natural language inference\n\nNatural language inference is the task of determining whether a \"hypothesis\" is \ntrue (ent"
  },
  {
    "path": "english/paraphrase-generation.md",
    "chars": 3377,
    "preview": "# Paraphrase Generation\n[Paraphrase generation](https://arxiv.org/abs/1908.07831) is the task of generating an output se"
  },
  {
    "path": "english/part-of-speech_tagging.md",
    "chars": 5968,
    "preview": "# Part-of-speech tagging\n\nPart-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of "
  },
  {
    "path": "english/question_answering.md",
    "chars": 35963,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [ARC](#arc)\n- [S"
  },
  {
    "path": "english/relation_prediction.md",
    "chars": 9098,
    "preview": "# Relation Prediction\n\n## Task\n\nRelation Prediction is the task of recognizing a named relation between two named semant"
  },
  {
    "path": "english/relationship_extraction.md",
    "chars": 20380,
    "preview": "# Relationship Extraction\n\nRelationship extraction is the task of extracting semantic relationships from a text. Extract"
  },
  {
    "path": "english/semantic_parsing.md",
    "chars": 37896,
    "preview": "# Semantic parsing\n\n### Table of contents\n\n- [AMR parsing](#amr-parsing)\n  - [LDC2014T12](#ldc2014t12)\n  - [LDC2015E86]("
  },
  {
    "path": "english/semantic_role_labeling.md",
    "chars": 2565,
    "preview": "# Semantic role labeling\n\nSemantic role labeling aims to model the predicate-argument structure of a sentence\nand is oft"
  },
  {
    "path": "english/semantic_textual_similarity.md",
    "chars": 5174,
    "preview": "# Semantic textual similarity\n\nSemantic textual similarity deals with determining how similar two pieces of texts are.\nT"
  },
  {
    "path": "english/sentiment_analysis.md",
    "chars": 20016,
    "preview": "# Sentiment analysis\n\nSentiment analysis is the task of classifying the polarity of a given text.\n\n### IMDb\n\nThe [IMDb d"
  },
  {
    "path": "english/shallow_syntax.md",
    "chars": 3554,
    "preview": "# Shallow syntax\n\nShallow syntactic tasks provide an analysis of a text on the level of the syntactic structure \nof the "
  },
  {
    "path": "english/simplification.md",
    "chars": 26400,
    "preview": "# Simplification\n\nSimplification consists of modifying the content and structure of a text in order to make it easier to"
  },
  {
    "path": "english/stance_detection.md",
    "chars": 1332,
    "preview": "# Stance detection\n\nStance detection is the extraction of a subject's reaction to a claim made by a primary actor. It is"
  },
  {
    "path": "english/summarization.md",
    "chars": 29392,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "english/taxonomy_learning.md",
    "chars": 13012,
    "preview": "# Taxonomy Learning\n\nTaxonomy learning is the task of hierarchically classifying concepts in an automatic manner from te"
  },
  {
    "path": "english/temporal_processing.md",
    "chars": 7775,
    "preview": "# Temporal Processing\n\n## Document Dating (Time-stamping)\n\nDocument Dating is the problem of automatically predicting th"
  },
  {
    "path": "english/text_classification.md",
    "chars": 6180,
    "preview": "# Text classification\n\nText classification is the task of assigning a sentence or document an appropriate category.\nThe "
  },
  {
    "path": "english/word_sense_disambiguation.md",
    "chars": 14687,
    "preview": "# Word Sense Disambiguation\n\nThe task of Word Sense Disambiguation (WSD) consists of associating words in context with t"
  },
  {
    "path": "french/question_answering.md",
    "chars": 1389,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [Reading compreh"
  },
  {
    "path": "french/summarization.md",
    "chars": 6387,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "german/question_answering.md",
    "chars": 594,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n\n### Table of contents\n\n- [GermanQuAD](#g"
  },
  {
    "path": "german/summarization.md",
    "chars": 3120,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "hindi/hindi.md",
    "chars": 4391,
    "preview": "# Hindi\n\n## Chunking\n\n| Model           | Dev accuracy  | Test F1 | Paper / Source | Code | \n| ------------- | :-----:| "
  },
  {
    "path": "jekyll_instructions.md",
    "chars": 1077,
    "preview": "# Instructions for building the site locally\n\nYou can build the site locally using Jekyll by following the steps detaile"
  },
  {
    "path": "korean/question_answering.md",
    "chars": 541,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [Reading compreh"
  },
  {
    "path": "nepali/nepali.md",
    "chars": 393,
    "preview": "# Nepali\n\n## Machine Translation\n\n| Model           | BLEU  | Paper / Source | Code | \n| ------------- | :-----:|  --- |"
  },
  {
    "path": "persian/named_entity_recognition.md",
    "chars": 2968,
    "preview": "# Named entity recognition\n\nNamed entity recognition (NER) is the task of tagging entities in text with their correspond"
  },
  {
    "path": "persian/natural_language_inference.md",
    "chars": 1714,
    "preview": "# Natural Language Inference\n\nNatural Language Inference (NLI) is the task of determining the inference relationship bet"
  },
  {
    "path": "persian/summarization.md",
    "chars": 4031,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "portuguese/question_answering.md",
    "chars": 1037,
    "preview": "# Question Answering\n\nSee [here](../english/question_answering.md) for more information about the task.\n\n### Datasets\n\n#"
  },
  {
    "path": "russian/question_answering.md",
    "chars": 1089,
    "preview": "# Question answering\n\nQuestion answering is the task of answering a question.\n\n### Table of contents\n\n- [Reading compreh"
  },
  {
    "path": "russian/sentiment-analysis.md",
    "chars": 3834,
    "preview": "# Sentiment Analysis\n\nSentiment analysis is the task of classifying the polarity of a given text.\n\n## RuSentRel\n\nThe [Ru"
  },
  {
    "path": "russian/summarization.md",
    "chars": 3112,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "spanish/entity_linking.md",
    "chars": 591,
    "preview": "# Entity Linking\n\nSee [here](../english/entity_linking.md) for more information about the task.\n\n### Datasets\n\n#### AIDA"
  },
  {
    "path": "spanish/named_entity_recognition.md",
    "chars": 4746,
    "preview": "# Named entity recognition\n\nNamed entity recognition (NER) is the task of tagging entities in text with their correspond"
  },
  {
    "path": "spanish/summarization.md",
    "chars": 3120,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "structured/README.md",
    "chars": 752,
    "preview": "# Exporting NLP-progress into a structure format\n\nParse and export the unstructured information from Markdown into a str"
  },
  {
    "path": "structured/export.py",
    "chars": 13681,
    "preview": "import argparse\nimport os\nimport pprint\nfrom typing import Dict, Tuple, List\nimport re\nimport sys\nimport json\n\n\ndef extr"
  },
  {
    "path": "structured/requirements.txt",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "turkish/summarization.md",
    "chars": 3122,
    "preview": "# Summarization\n\nSummarization is the task of producing a shorter version of one or several documents that preserves mos"
  },
  {
    "path": "vietnamese/vietnamese.md",
    "chars": 20692,
    "preview": "# Vietnamese NLP tasks\n\n## Dependency parsing\n\n* Experiments employ the [benchmark Vietnamese dependency treebank VnDT]("
  }
]

About this extraction

This page contains the full source code of the sebastianruder/NLP-progress GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 78 files (550.3 KB), approximately 165.4k tokens, and a symbol index with 13 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo