Repository: sebastianruder/NLP-progress
Branch: master
Commit: 379f03ff7568
Files: 78
Total size: 550.3 KB

Directory structure:
gitextract_k09iiau2/

├── .gitignore
├── CITATION.cff
├── CNAME
├── LICENSE
├── README.md
├── _config.yml
├── _includes/
│   ├── chart.html
│   └── table.html
├── arabic/
│   └── language_modeling.md
├── bengali/
│   ├── emotion_detection.md
│   ├── part_of_speech_tagging.md
│   ├── question_answering.md
│   └── sentiment_analysis.md
├── chinese/
│   ├── chinese.md
│   ├── chinese_word_segmentation.md
│   └── question_answering.md
├── english/
│   ├── automatic_speech_recognition.md
│   ├── ccg.md
│   ├── common_sense.md
│   ├── constituency_parsing.md
│   ├── coreference_resolution.md
│   ├── data_to_text_generation.md
│   ├── dependency_parsing.md
│   ├── dialogue.md
│   ├── domain_adaptation.md
│   ├── entity_linking.md
│   ├── grammatical_error_correction.md
│   ├── information_extraction.md
│   ├── intent_detection_slot_filling.md
│   ├── keyphrase_extraction_generation.md
│   ├── language_modeling.md
│   ├── lexical_normalization.md
│   ├── machine_translation.md
│   ├── missing_elements.md
│   ├── multi-task_learning.md
│   ├── multimodal.md
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   ├── paraphrase-generation.md
│   ├── part-of-speech_tagging.md
│   ├── question_answering.md
│   ├── relation_prediction.md
│   ├── relationship_extraction.md
│   ├── semantic_parsing.md
│   ├── semantic_role_labeling.md
│   ├── semantic_textual_similarity.md
│   ├── sentiment_analysis.md
│   ├── shallow_syntax.md
│   ├── simplification.md
│   ├── stance_detection.md
│   ├── summarization.md
│   ├── taxonomy_learning.md
│   ├── temporal_processing.md
│   ├── text_classification.md
│   └── word_sense_disambiguation.md
├── french/
│   ├── question_answering.md
│   └── summarization.md
├── german/
│   ├── question_answering.md
│   └── summarization.md
├── hindi/
│   └── hindi.md
├── jekyll_instructions.md
├── korean/
│   └── question_answering.md
├── nepali/
│   └── nepali.md
├── persian/
│   ├── named_entity_recognition.md
│   ├── natural_language_inference.md
│   └── summarization.md
├── portuguese/
│   └── question_answering.md
├── russian/
│   ├── question_answering.md
│   ├── sentiment-analysis.md
│   └── summarization.md
├── spanish/
│   ├── entity_linking.md
│   ├── named_entity_recognition.md
│   └── summarization.md
├── structured/
│   ├── README.md
│   ├── export.py
│   └── requirements.txt
├── turkish/
│   └── summarization.md
└── vietnamese/
    └── vietnamese.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
_site/
Gemfile*
venv
.idea
structured.json


================================================
FILE: CITATION.cff
================================================
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ruder"
  given-names: "Sebastian"
title: "NLP-progress"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2022-02-06
url: "https://nlpprogress.com/"


================================================
FILE: CNAME
================================================
nlpprogress.com

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Sebastian Ruder

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Tracking Progress in Natural Language Processing

## Table of contents

### English

- [Automatic speech recognition](english/automatic_speech_recognition.md)
- [CCG](english/ccg.md)
- [Common sense](english/common_sense.md)
- [Constituency parsing](english/constituency_parsing.md)
- [Coreference resolution](english/coreference_resolution.md)
- [Data-to-Text Generation](english/data_to_text_generation.md)
- [Dependency parsing](english/dependency_parsing.md)
- [Dialogue](english/dialogue.md)
- [Domain adaptation](english/domain_adaptation.md)
- [Entity linking](english/entity_linking.md)
- [Grammatical error correction](english/grammatical_error_correction.md)
- [Information extraction](english/information_extraction.md)
- [Intent Detection and Slot Filling](english/intent_detection_slot_filling.md) 
- [Keyphrase Extraction and Generation](english/keyphrase_extraction_generation.md)
- [Language modeling](english/language_modeling.md)
- [Lexical normalization](english/lexical_normalization.md)
- [Machine translation](english/machine_translation.md)
- [Missing elements](english/missing_elements.md)
- [Multi-task learning](english/multi-task_learning.md)
- [Multi-modal](english/multimodal.md)
- [Named entity recognition](english/named_entity_recognition.md)
- [Natural language inference](english/natural_language_inference.md)
- [Part-of-speech tagging](english/part-of-speech_tagging.md)
- [Paraphrase Generation](english/paraphrase-generation.md)
- [Question answering](english/question_answering.md)
- [Relation prediction](english/relation_prediction.md)
- [Relationship extraction](english/relationship_extraction.md)
- [Semantic textual similarity](english/semantic_textual_similarity.md)
- [Semantic parsing](english/semantic_parsing.md)
- [Semantic role labeling](english/semantic_role_labeling.md)
- [Sentiment analysis](english/sentiment_analysis.md)
- [Shallow syntax](english/shallow_syntax.md)
- [Simplification](english/simplification.md)
- [Stance detection](english/stance_detection.md)
- [Summarization](english/summarization.md)
- [Taxonomy learning](english/taxonomy_learning.md)
- [Temporal processing](english/temporal_processing.md)
- [Text classification](english/text_classification.md)
- [Word sense disambiguation](english/word_sense_disambiguation.md)

### Vietnamese

- [Dependency parsing](vietnamese/vietnamese.md#dependency-parsing)
- [Intent detection and Slot filling](vietnamese/vietnamese.md#intent-detection-and-slot-filling)
- [Machine translation](vietnamese/vietnamese.md#machine-translation)
- [Named entity recognition](vietnamese/vietnamese.md#named-entity-recognition)
- [Part-of-speech tagging](vietnamese/vietnamese.md#part-of-speech-tagging)
- [Semantic parsing](vietnamese/vietnamese.md#semantic-parsing)
- [Word segmentation](vietnamese/vietnamese.md#word-segmentation)

### Hindi

- [Chunking](hindi/hindi.md#chunking)
- [Part-of-speech tagging](hindi/hindi.md#part-of-speech-tagging)
- [Machine Translation](hindi/hindi.md#machine-translation)

### Chinese

- [Entity linking](chinese/chinese.md#entity-linking)
- [Chinese word segmentation](chinese/chinese_word_segmentation.md)
- [Question answering](chinese/question_answering.md)

For more tasks, datasets and results in Chinese, check out the [Chinese NLP](https://chinesenlp.xyz/#/) website.

### French

- [Question answering](french/question_answering.md)
- [Summarization](french/summarization.md)

### Russian

- [Question answering](russian/question_answering.md)
- [Sentiment Analysis](russian/sentiment-analysis.md)
- [Summarization](russian/summarization.md)

### Spanish

- [Named Entity Recognition](spanish/named_entity_recognition.md)
- [Entity linking](spanish/entity_linking.md#entity-linking)
- [Summarization](spanish/summarization.md)

### Portuguese

- [Question Answering](portuguese/question_answering.md)

### Korean

- [Question Answering](korean/question_answering.md)

### Nepali

- [Machine Translation](nepali/nepali.md#machine-translation)

### Bengali
- [Part-of-speech Tagging](bengali/part_of_speech_tagging.md)
- [Emotion Detection](bengali/emotion_detection.md)
- [Sentiment Analysis](bengali/sentiment_analysis.md)

### Persian
- [Named entity recognition](persian/named_entity_recognition.md)
- [Natural language inference](persian/natural_language_inference.md)
- [Summarization](persian/summarization.md)

### Turkish

- [Summarization](turkish/summarization.md)

### German

- [Question Answering](german/question_answering.md)
- [Summarization](german/summarization.md)

### Arabic
- [Language modeling](arabic/language_modeling.md)


This document aims to track the progress in Natural Language Processing (NLP) and give an overview
of the state-of-the-art (SOTA) across the most common NLP tasks and their corresponding datasets.

It aims to cover both traditional and core NLP tasks such as dependency parsing and part-of-speech tagging
as well as more recent ones such as reading comprehension and natural language inference. The main objective
is to provide the reader with a quick overview of benchmark datasets and the state-of-the-art for their
task of interest, which serves as a stepping stone for further research. To this end, if there is a 
place where results for a task are already published and regularly maintained, such as a public leaderboard,
the reader will be pointed there.

If you want to find this document again in the future, just go to [`nlpprogress.com`](https://nlpprogress.com/)
or [`nlpsota.com`](http://nlpsota.com/) in your browser.

### Contributing

#### Guidelines

**Results** &nbsp; Results reported in published papers are preferred; an exception may be made for influential preprints.

**Datasets** &nbsp; Datasets should have been used for evaluation in at least one published paper besides 
the one that introduced the dataset.

**Code** &nbsp; We recommend to add a link to an implementation 
if available. You can add a `Code` column (see below) to the table if it does not exist.
In the `Code` column, indicate an official implementation with [Official](http://link_to_implementation).
If an unofficial implementation is available, use [Link](http://link_to_implementation) (see below).
If no implementation is available, you can leave the cell empty.

#### Adding a new result

If you would like to add a new result, you can just click on the small edit button in the top-right
corner of the file for the respective task (see below).

![Click on the edit button to add a file](img/edit_file.png)

This allows you to edit the file in Markdown. Simply add a row to the corresponding table in the
same format. Make sure that the table stays sorted (with the best result on top). 
After you've made your change, make sure that the table still looks ok by clicking on the
"Preview changes" tab at the top of the page. If everything looks good, go to the bottom of the page,
where you see the below form. 

![Fill out the file change information](img/propose_file_change.png)

Add a name for your proposed change, an optional description, indicate that you would like to
"Create a new branch for this commit and start a pull request", and click on "Propose file change".

#### Adding a new dataset or task

For adding a new dataset or task, you can also follow the steps above. Alternatively, you can fork the repository.
In both cases, follow the steps below:

1. If your task is completely new, create a new file and link to it in the table of contents above.
2. If not, add your task or dataset to the respective section of the corresponding file (in alphabetical order).
3. Briefly describe the dataset/task and include relevant references. 
4. Describe the evaluation setting and evaluation metric.
5. Show how an annotated example of the dataset/task looks like.
6. Add a download link if available.
7. Copy the below table and fill in at least two results (including the state-of-the-art)
  for your dataset/task (change Score to the metric of your dataset). If your dataset/task
  has multiple metrics, add them to the right of `Score`.
1. Submit your change as a pull request.
  
| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
|  |  |  | |


### Wish list

These are tasks and datasets that are still missing:

- Bilingual dictionary induction
- Discourse parsing
- Keyphrase extraction
- Knowledge base population (KBP)
- More dialogue tasks
- Semi-supervised learning
- Frame-semantic parsing (FrameNet full-sentence analysis)

### Exporting into a structured format

You can extract all the data into a structured, machine-readable JSON format with parsed tasks, descriptions and SOTA tables. 

The instructions are in [structured/README.md](structured/README.md).

### Instructions for building the site locally

Instructions for building the website locally using Jekyll can be found [here](jekyll_instructions.md).


================================================
FILE: _config.yml
================================================
theme: jekyll-theme-slate

================================================
FILE: _includes/chart.html
================================================
<style>

.chart div {
  font: 18px sans-serif;
  background-color: steelblue;
  padding: 6px;
  margin: 2px;
  color: white;
  height: 40px;
}

.alignleft {
	float: left;
}
.alignright {
	float: right;
}

</style>

<div class="chart">
{% for result in include.results %}
{% assign score = result[include.score] %}
  <div style="width: {{ score | times: 6.0 }}px;">
    <p class="alignleft">{{ result.authors }} ({{ result.year }})</p>
    <p class="alignright">{{ score }}</p>
  </div>
{% endfor %}
</div>


================================================
FILE: _includes/table.html
================================================
{% assign scores = include.scores | split: "," %}

<table>
  <thead>
    <tr>
      <th>Model</th>
      {% for score in scores %}
      <th style="text-align: center">{{ score }}</th>
      {% endfor %}
      <th>Paper / Source</th>
      <th>Code</th>
    </tr>
  </thead>
  <tbody>
  {% for result in include.results %}
    <tr>
      <td>{% if result.model %} {{ result.model }} by {% endif %} {{ result.authors }} ({{ result.year }})</td>
      {% for score in scores %}
      <td style="text-align: center">{{ result[score] }}</td>
      {% endfor %}
      <td><a href="{{ result.url }}">{{ result.paper }}</a></td>
      <td>
      {% for el in result.code %}
        <a href="{{ el.url }}">{{ el.name }}</a>
      {% endfor %}
      </td>
    </tr>
  {% endfor %}
  </tbody>
</table>


================================================
FILE: arabic/language_modeling.md
================================================
# Language modeling

Language modeling is the task of predicting the next word or character in a document.


| Model           | Paper / Source | Code |
| ------------- | :-----:| :-----: |
| Zen 2.0: Continue training and adaption for n-gram enhanced text encoders | [ZEN](https://arxiv.org/abs/2105.01279) | [Official](https://github.com/sinovation/ZEN2) |
|hULMonA: The Universal Language Model in Arabic|[hULMonA](https://aclanthology.org/W19-4608/) | [Official](https://github.com/aub-mind/hULMonA) |
|AraBERT: Transformer-based Model for Arabic Language Understanding|[AraBERT](https://arxiv.org/abs/2003.00104) | [Official](https://github.com/aub-mind/araBERT) |


================================================
FILE: bengali/emotion_detection.md
================================================
# Fine-grained Emotion Detection

Fine-grained Emotion Detection is the task of detecting one or multiple emotion of a given text.

## EmoNoBa

[EmoNoBa: A Dataset for Analyzing Fine-Grained Emotions on Noisy Bangla Texts](https://aclanthology.org/2022.aacl-short.17.pdf) is a dataset which contains 22,698 instances with each labeled with one or atmost all 6 emotions. The dataset is available [here](https://www.kaggle.com/datasets/saifsust/emonoba). The models are evaluated based on Macro Average F1-score.

| Model | F1-score | Paper / Source | Code |
| ------------ | ------------- | ------------ | ------------- |
| W1 + W2 + W3+ W4 + C1 + C2 + C3 | 42.81 | [EmoNoBa: A Dataset for Analyzing Fine-Grained Emotions on Noisy Bangla Texts](https://aclanthology.org/2022.aacl-short.17.pdf) | [Official](https://github.com/KhondokerIslam/EmoNoBa) |


================================================
FILE: bengali/part_of_speech_tagging.md
================================================

# Part-of-speech Tagging
Part-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of speech.
A part of speech is a category of words with similar grammatical properties. Common English
parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

## Linguistic Data Consortium: Indian Bengali
Indian Language Part-of-Speech Tagset: Bengali, Linguistic Data Consortium (LDC) catalog number LDC2010T16 and isbn 1-58563-561-8, is a corpus developed by Microsoft Research (MSR) India to support the task of Part-of-Speech Tagging (POS) and other data-driven linguistic research on Indian Languages in general.

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Deep Learning(Fasihul et al. 2016) | 93.33 | [Deep learning based parts of speech tagger for Bengali](https://ieeexplore.ieee.org/abstract/document/7760098) | --- |


================================================
FILE: bengali/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [Bangla-SQuAD](#Bangla-SQuAD)
  
## Reading comprehension
  
### Bangla-SQuAD

The [Bengali Question Answering Dataset (Bengali-SQuAD)](https://zenodo.org/record/4557874#.YaEUp9BBxPY) is an automatically translated (using Google Translate) and preprocessed subset of the large-scale reading comprehension
dataset English [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) introduced in the paper ["Deep Learning Based Question Answering System
in Bengali"](https://www.tandfonline.com/doi/full/10.1080/24751839.2020.1833136)


Example:

| Document  | Question | Answer |
| ------------- | -----:| -----: |
| চার্লসটন আমেরিকা যুক্তরাষ্ট্রের দক্ষিণ ক্যারোলাইনা রাজ্যের প্রাচীনতম এবং দ্বিতীয় বৃহত্তম শহর, চার্লসটন কাউন্টির কাউন্টি আসন এবং চার্লসটন – নর্থ চার্লসটন – সামারভিলে মেট্রোপলিটন স্ট্যাটিস্টিকাল এরিয়ার প্রধান শহর  শহরটি দক্ষিণ ক্যারোলিনার উপকূলরেখার ভৌগলিক মিডপয়েন্টের ঠিক দক্ষিণে অবস্থিত এবং অ্যাশলে এবং কুপার নদীর নদীর সংগম দ্বারা গঠিত আটলান্টিক মহাসাগরের একটি খাঁটি চার্লস্টন হারবারে অবস্থিত, অথবা স্থানীয়ভাবে প্রকাশিত হয়েছে, \"যেখানে কুপার এবং অ্যাশলে রয়েছে। নদীগুলি একত্র হয়ে আটলান্টিক মহাসাগর গঠনে আসে।|চার্লসটন হারবার কোন মহাসাগরের খাঁড়ি? |আটলান্টিক মহাসাগরের|


================================================
FILE: bengali/sentiment_analysis.md
================================================
# Sentiment analysis

Sentiment Analysis is the task of classifying polarity of a given text.

## SentNoB

[SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts](https://aclanthology.org/2021.findings-emnlp.278.pdf) is a dataset which contains 15,728 instances with each labeled with one of three-class labels. This work also proposes, <em>unique word percentage</em>, a new evaluation metric for datasets. Models are evaluated based on micro-averaged F1 score.

| Model | F1-score | Paper / Source | Code |
| ------------ | ------------- | ------------ | ------------- |
| U + B + T + C2 + C3 + C4 + C5 | 64.61 | [SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts](https://aclanthology.org/2021.findings-emnlp.278.pdf) | [Official](https://github.com/KhondokerIslam/SentNoB) |


================================================
FILE: chinese/chinese.md
================================================
# Chinese NLP tasks

## Entity linking

See [here](../english/entity_linking.md) for more information about the task.

### Datasets

#### AIDA CoNLL-YAGO Dataset

##### Disambiguation-Only Models

|  Model | Micro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: |
| Sil et al. (2018) | 84.4 | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | |
| Tsai & Roth (2016) | 83.6 | [Cross-lingual wikification using multilingual embeddings](http://cogcomp.org/papers/TsaiRo16b.pdf) | |

[Go back to the README](../README.md)


================================================
FILE: chinese/chinese_word_segmentation.md
================================================
# Chinese Word Segmentation

## Task
Chinese word segmentation is the task of
splitting Chinese text (a sequence of Chinese characters)
into words.

Example:
```
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
```

## Systems
&spades; marks the system that uses character unigram as input.
&clubs; marks the system that uses character bigram as input.

- Tian et al. (2020): ZEN + key-value memory networks &spades;
- Huang et al. (2019): BERT + model compression + multi-criterial learing &spades;
- Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings &spades;&clubs; 
- Ma et al. (2018): BiLSTM-CRF + hyper-params search&spades;&clubs;
- Yang et al. (2017): Transition-based + Beam-search + Rich pretrain&spades;&clubs; 
- Zhou et al. (2017): Greedy Search + word context&spades;
- Chen et al. (2017): BiLSTM-CRF + adv. loss&spades;&clubs;
- Cai et al. (2017): Greedy Search+Span representation&spades;
- Kurita et al. (2017): Transition-based + Joint model&spades;
- Liu et al. (2016): neural semi-CRF&spades;
- Cai and Zhao (2016): Greedy Search&spades;
- Chen et al. (2015a): Gated Recursive NN&spades;&clubs;
- Chen et al. (2015b): BiLSTM-CRF&spades;&clubs;

## Evaluation

### Metrics

F1-score

### Dataset
#### Chinese Treebank 6

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Huang et al. (2019) | 97.6 |[Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)||
| Tian et al. (2020) | 97.3 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Ma et al. (2018) | 96.7 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2018) | 96.3 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Yang et al. (2017) | 96.2 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Zhou et al. (2017) | 96.2 | [Word-Context Character Embeddings for Chinese Word Segmentation](https://www.aclweb.org/anthology/D17-1079)| |
| Chen et al. (2017) | 96.2 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 95.5 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Chen et al. (2015b) | 96.0 | [Long Short-Term Memory Neural Networks for Chinese Word Segmentation](http://www.aclweb.org/anthology/D15-1141) | [Github](https://github.com/FudanNLP/CWS_LSTM) |

#### Chinese Treebank 7

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Ma et al. (2018) | 96.6 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Kurita et al. (2017) | 96.2 | [Neural Joint Model for Transition-based Chinese Syntactic Analysis](http://www.aclweb.org/anthology/P17-1111) | |

#### AS

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 96.6 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Ma et al. (2018) | 96.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2017) | 95.7 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) |[Github](https://github.com/jiesutd/RichWordSegmentor) |
| Cai et al. (2017) | 95.3 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 94.8 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |

#### CityU

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 97.9 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Huang et al. (2019) | 97.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Ma et al. (2018) | 97.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2017) | 96.9 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Cai et al. (2017) | 95.6 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 95.6 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |

#### PKU

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Tian et al. (2020) | 96.5 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Yang et al. (2017) | 96.3 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Ma et al. (2018) | 96.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Yang et al. (2018) | 95.9 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Cai et al. (2017) | 95.8 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 94.3 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 95.7 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Cai and Zhao (2016) | 95.7 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) |

#### MSR

| Model         | F1 | Paper / Source | Code |
| ------------- | :-----: |  --- | --- |
| Tian et al. (2020) | 98.4 | [Improving Chinese Word Segmentation with Wordhood Memory Networks](https://www.aclweb.org/anthology/2020.acl-main.734/)| [Github](https://github.com/SVAIGBA/WMSeg)|
| Ma et al. (2018) | 98.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| |
| Huang et al. (2019) | 97.9 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| |
| Yang et al. (2018) | 97.8 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)|
| Yang et al. (2017) | 97.5 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)|
| Cai et al. (2017) | 97.1 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) |
| Chen et al. (2017) | 96.0 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) |
| Liu et al. (2016) | 97.6 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) |
| Cai and Zhao (2016) | 96.4 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) |

[Go back to the README](../README.md)


================================================
FILE: chinese/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [CMRC2018](#cmrc-2018)
  - [DRCD](#drcd)
  - [DuReader](#dureader)
  
## Reading comprehension

### CMRC 2018

The [Chinese Machine Reading Comprehension (CMRC 2018)](https://www.aclweb.org/anthology/D19-1600/) is a SQuAD-like
reading comprehension dataset that consists of 20,000 questions annotated on Wikipedia paragraphs by human experts. The
dataset can be downloaded [here](https://github.com/ymcui/cmrc2018). Below we show the F1 and EM scores both on the
test set and the challenge set. 

| Model           | Test F1 | Test EM | Challenge F1 | Challenge EM | Paper |
| ------------- | :-----:| :-----:| :-----:| :-----:| --- |
| Human performance | 97.9 | 92.4 | 95.2 | 90.4 | [A Span-Extraction Dataset for Chinese Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1600/) |
| Dual BERT (w / SQuAD; Cui et al., 2019) | 90.2 | 73.6 | 55.2 | 27.8 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
| Dual BERT (Cui et al., 2019) | 88.1 | 70.4 | 47.9 | 23.8 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |

### DRCD

The [Delta Reading Comprehension Dataset (DRCD)](https://arxiv.org/abs/1806.00920) is a SQuAD-like reading 
comprehension dataset that contains 30,000+ questions on 10,014 paragraphs from 2,108 Wikipedia articles. The dataset
can be downloaded [here](https://github.com/DRCKnowledgeTeam/DRCD).

| Model           | F1 | EM |  Paper |
| ------------- | :-----:| :-----:| --- |
| Human performance | 93.3 | 80.4 | [DRCD: a Chinese Machine Reading Comprehension Dataset](https://arxiv.org/abs/1806.00920) |
| Dual BERT (w / SQuAD; Cui et al., 2019) | 91.6 | 85.4 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
| Dual BERT (Cui et al., 2019) | 90.3 | 83.7 | [Cross-Lingual Machine Reading Comprehension](https://www.aclweb.org/anthology/D19-1169/) |
  
### DuReader

[DuReader](https://www.aclweb.org/anthology/W18-2605/) is a large-scale reading comprehension dataset that is based on
the logs of Baidu Search and contains 200k questions, 420k answers, and 1M documents. For more information, refer to
[its website](https://ai.baidu.com/broad/introduction?dataset=dureader) to see the introduction. You can download the
dataset [here](https://ai.baidu.com/broad/download?dataset=dureader). The best models can be view on the 
[public leaderboard](https://ai.baidu.com/broad/leaderboard?dataset=dureader).


================================================
FILE: english/automatic_speech_recognition.md
================================================
# Automatic speech recognition (ASR)

Automatic speech recognition is the task of automatically recognizing speech. You 
can find a repository tracking the state-of-the-art [here](https://github.com/syhw/wer_are_we).


================================================
FILE: english/ccg.md
================================================
# Combinatory Categorical Grammar

Combinatory Categorical Grammar (CCG; [Steedman, 2000](http://www.citeulike.org/group/14833/article/8971002)) is a
highly lexicalized formalism. The standard parsing model of [Clark and Curran (2007)](https://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.4.493)
uses over 400 lexical categories (or _supertags_), compared to about 50 part-of-speech tags for typical parsers.

Example:

| Vinken | , | 61 | years | old |
| --- | ---| --- | --- | --- |
| N| , | N/N | N | (S[adj]\ NP)\ NP |

## Parsing

CCG parsing is evaluated in terms of labeled dependency F-score, which "take\[s\] into account the lexical category containing the dependency relation, the argument slot, the word associated with the lexical category, and the argument head word: All four must be correct to score a point" ([Clark & Curran, 2007](https://doi.org/10.1162/coli.2007.33.4.493)).
Besides the word forms, some popular parsers (like the C&C parser) take POS tags as input. For fair comparison, systems should use automatically obtained POS as input, though some papers additionally report performance with oracle gold-standard POS features.

### CCGBank

The CCGBank is a corpus of CCG derivations and dependency structures extracted from the Penn Treebank by
[Hockenmaier and Steedman (2007)](http://www.aclweb.org/anthology/J07-3004). Sections 2-21 are used for training,
section 00 for development, and section 23 as in-domain test set.

| Model           | Labeled F-score |  Paper / Source |
| ------------- | :-----:| --- |
| Prange et al. (2021), non-constructive | 90.91 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Bhargava and Penn (2020), constructive | 90.9 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Prange et al. (2021), constructive | 90.79 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Vaswani et al. (2016) | 88.32 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) |
| Lewis et al. (2016) | 88.1 | [LSTM CCG Parsing](https://aclweb.org/anthology/N/N16/N16-1026.pdf) |
| Xu et al. (2015) | 87.04 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | 85.95 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Clark and Curran (2007) | 85.45 | [Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models](https://www.aclweb.org/anthology/J07-4004) |

### Wikipedia

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| Xu et al. (2015) | 82.49 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | 81.7 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |

### Bioinfer

| Model         | Bio specifc taggers? | Accuracy |  Paper / Source |
| ------------- | -------------------- | :-------:| --- |
| Kummerfeld et al. (2010), with additional unlabeled data | Yes | 82.3 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Rimell and Clark (2008) | Yes | 81.5 | [Adapting a Lexicalized-Grammar Parser to Contrasting Domains](https://aclweb.org/anthology/papers/D/D08/D08-1050/) |
| Xu et al. (2015) | No | 77.74 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Kummerfeld et al. (2010), with additional unlabeled data | No | 76.1 | [Faster Parsing by Supertagger Adaptation](https://www.aclweb.org/anthology/papers/P/P10/P10-1036/) |
| Rimell and Clark (2008) | No | 76.0 | [Adapting a Lexicalized-Grammar Parser to Contrasting Domains](https://aclweb.org/anthology/papers/D/D08/D08-1050/) |

## Supertagging

To mitigate sparsity, CCG supertaggers have traditionally been trained only on categories that occur 10 times or more in the CCGBank training data, which amounts to the 425 most frequent categories. In more recent work, using this threshold is becoming less common. In any case, supertagging evaluation is always measured for all supertags occurring in the test set. Models are evaluated based on token accuracy.

### Constructive supertagging

A constructive tagger models the internal structure of supertags rather than treating each supertag type as opaque ([Kogkalidis et al., 2019](https://www.aclweb.org/anthology/W19-4314/)). Supertags are constructed from minimal pieces (which for CCG are slashes and atomic categories) and there is no frequency cutoff.

### CCGBank

Like for parsing, sections 2-21 are used for training, section 00 for development, and section 23 as in-domain test set.

| Model           | Accuracy |  Paper / Source |
| ----------------- | :-----:| --- |
| Kogkalidis and Moortgat (2022), constructive | 96.29 | [Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions](https://arxiv.org/abs/2203.12235) | 
| Tian et al. (2020), non-constructive | 96.25 | [Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks](https://aclanthology.org/2020.emnlp-main.487/) |
| Prange et al. (2021), non-constructive | 96.22 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), constructive | 96.09 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Clark et al. (2018) | 96.05 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) |
| Bhargava and Penn (2020), constructive | 96.00 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Lewis et al. (2016) | 94.7 | [LSTM CCG Parsing](https://aclweb.org/anthology/N/N16/N16-1026.pdf) |
| Vaswani et al. (2016) | 94.24 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) |
| Low supervision (Søgaard and Goldberg, 2016) | 93.26 | [Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038) |
| Xu et al. (2015) | 93.00 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |
| Clark and Curran (2004) | 92.00 | [The Importance of Supertagging for Wide-Coverage CCG Parsing](https://aclweb.org/anthology/papers/C/C04/C04-1041/) (result from Lewis et al. (2016)) |

#### Rare and unseen supertags

| Model           | Acc on tags seen 1-9 times | Acc on unseen tags |  Paper / Source |
| ------------- | :-----: | :-----: | --- |
| Bhargava and Penn (2020), constructive | - | 5.00 | [Supertagging with CCG primitives](https://www.aclweb.org/anthology/2020.repl4nlp-1.23/) |
| Prange et al. (2021), constructive | 37.40 | 3.03 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), non-constructive | 23.17 | 0.00 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Kogkalidis and Moortgat (2022), constructive | 34.45 | 4.55 | [Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions](https://arxiv.org/abs/2203.12235) |

### Wikipedia

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----: | --- |
| Prange et al. (2021), non-constructive | 92.54 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Prange et al. (2021), constructive | 92.46 | [Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories](https://doi.org/10.1162/tacl_a_00364) |
| Xu et al. (2015) | 90.00 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) |

## Conversion to PTB

There has been interest in converting CCG derivations to phrase structure parses for comparison with phrase structure parsers (since CCGBank is based on the PTB).

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| Kummerfeld et al. (2012) | 96.30 | [Robust Conversion of CCG Derivations to Phrase Structure Trees](https://www.aclweb.org/anthology/P12-2021) |
| Zhang et al. (2012) | 95.71 | [A Machine Learning Approach to Convert CCGbank to Penn Treebank](https://www.aclweb.org/anthology/C12-3067)
| Clark and Curran (2009) | 94.64 | [Comparing the Accuracy of CCG and Penn Treebank Parsers](https://aclweb.org/anthology/papers/P/P09/P09-2014/) |

[Go back to the README](../README.md)


================================================
FILE: english/common_sense.md
================================================
# Common sense

Common sense reasoning tasks are intended to require the model to go beyond pattern 
recognition. Instead, the model should use "common sense" or world knowledge
to make inferences.

### Event2Mind

Event2Mind is a crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations.
Given an event described in a short free-form text, a model should reason about the likely intents and reactions of the
event's participants. Models are evaluated based on average cross-entropy (lower is better).

| Model           | Dev  | Test  |  Paper / Source | Code | 
| ------------- | :-----:| :-----:|--- | --- | 
| BiRNN 100d (Rashkin et al., 2018) | 4.25 | 4.22 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |
| ConvNet (Rashkin et al., 2018) | 4.44 | 4.40 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | |

### SWAG

Situations with Adversarial Generations (SWAG) is a dataset consisting of 113k multiple
choice questions about a rich spectrum of grounded situations.

| Model           | Dev  | Test  |  Paper / Source | Code | 
| ------------- | :-----:| :-----:|--- | --- | 
| BERT Large (Devlin et al., 2018) | 86.6 | 86.3 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| BERT Base (Devlin et al., 2018) | 81.6 | - | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| ESIM + ELMo (Zellers et al., 2018) | 59.1 | 59.2 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |
| ESIM + GloVe (Zellers et al., 2018) | 51.9 | 52.7 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) |  |

### Winograd Schema Challenge

The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492)
is a dataset for common sense reasoning. It employs Winograd Schema questions that
require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models
are evaluated based on accuracy.

Example:

The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big?
Answer 0: the trophy. Answer 1: the suitcase

| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | |
| Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | |
| USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) | |

### Winograd NLI (WNLI)

WNLI is a relaxation of the Winograd Schema Challenge proposed as part of the [GLUE benchmark](https://arxiv.org/abs/1804.07461) and a conversion to the natural language inference (NLI) format. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. While the training set is balanced between two classes (entailment and not entailment), the test set is imbalanced between them (35% entailment, 65% not entailment). The majority baseline is thus 65%, while for the Winograd Schema Challenge it is 50% ([Liu et al., 2017](https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392)). The latter is more challenging.

Results are available at the [GLUE leaderboard](https://gluebenchmark.com/leaderboard). Here is a subset of results of recent models:

| Model           | Score  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| XLNet-Large (ensemble) (Yang et al., 2019) | 90.4 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 89.0 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 65.1 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |

### Visual Common Sense

Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding.
With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered 
pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring 
higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense 
Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a 
rationale explaining why its answer is true.

| Model | Q->A  | QA->R  | Q->AR  | Paper / Source | Code |
| ------ | :-------:| :-------: | :-------:| ------ |  ------ | 
| Human Performance University of Washington (Zellers et al. '18) | 91.0 | 93.0 | 85.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) | | 
| Recognition to Cognition Networks University of Washington | 65.1 | 67.3 | 44.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) |  https://github.com/rowanz/r2c |
| BERT-Base Google AI Language (experiment by Rowan) | 53.9 | 64.5 | 35.0 | | https://github.com/google-research/bert |
| MLB Seoul National University (experiment by Rowan) | 46.2 | 36.8 | 17.2 | | https://github.com/jnhwkim/MulLowBiVQA |
| Random Performance | 25.0 | 25.0 | 6.2 | | | 

### ReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd] and is part of the [SuperGLUE benchmark](https://arxiv.org/pdf/1905.00537.pdf).

| Model | EM  | F1  | Paper / Source | Code |
| ------ | ------- | ------- | ------ |  ------ | 
| Human Performance Johns Hopkins University (Zhang et al. '18) | 91.31 | 91.69 | [ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension](https://arxiv.org/pdf/1810.12885.pdf) | | 
| LUKE (Yamada et al., 2020) | 90.64 | 91.21 | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523) | [Official](https://github.com/studio-ousia/luke) |
| RoBERTa (Facebook AI)  | 90.0 | 90.6 | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf) | [Official](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 
| XLNet + MTL + Verifier (ensemble)  | 83.09 | 83.74 | | | 
| CSRLM (single model) | 81.78 | 82.58 | | |


================================================
FILE: english/constituency_parsing.md
================================================
# Constituency parsing

Constituency parsing aims to extract a constituency-based parse tree from a sentence that
represents its syntactic structure according to a [phrase structure grammar](https://en.wikipedia.org/wiki/Phrase_structure_grammar).

Example:

                 Sentence (S)
                     |
       +-------------+------------+
       |                          |
     Noun (N)                Verb Phrase (VP)
       |                          |
     John                 +-------+--------+
                          |                |
                        Verb (V)         Noun (N)
                          |                |
                        sees              Bill

[Recent approaches](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf)
convert the parse tree into a sequence following a depth-first traversal in order to
be able to apply sequence-to-sequence models to it. The linearized version of the
above parse tree looks as follows: (S (N) (VP V N)).

### Penn Treebank

The Wall Street Journal section of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is used for
evaluating constituency parsers. Section 22 is used for development and Section 23 is used for evaluation.
Models are evaluated based on F1. Most of the below models incorporate external data or features.
For a comparison of single models trained only on WSJ, refer to [Kitaev and Klein (2018)](https://arxiv.org/abs/1805.01052).

| Model                                                                              | F1 score | Paper / Source                                                                                                                    | Code                                                  |
| ---------------------------------------------------------------------------------- | :------: | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| Span Attention + XLNet (Tian et al., 2020) | 96.40 | [Improving Constituency Parsing with Span Attention](https://aclanthology.org/2020.findings-emnlp.153/) | [Official](https://github.com/cuhksz-nlp/SAPar) |
| Label Attention Layer + HPSG + XLNet (Mrini et al., 2020)                          |  96.38   | [Rethinking Self-Attention: Towards Interpretability for Neural Parsing](https://www.aclweb.org/anthology/2020.findings-emnlp.65.pdf) | [Official](https://github.com/KhalilMrini/LAL-Parser) |
| Attach-Juxtapose Parser + XLNet (Yang and Deng, 2020)                              |  96.34   | [Strongly Incremental Constituency Parsing with Graph Neural Networks](https://arxiv.org/abs/2010.14568) | [Official](https://github.com/princeton-vl/attach-juxtapose-parser) |
| Head-Driven Phrase Structure Grammar Parsing (Joint) + XLNet (Zhou and Zhao, 2019) |  96.33   | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://arxiv.org/pdf/1907.02684.pdf)                             |                                                       |
| Head-Driven Phrase Structure Grammar Parsing (Joint) + BERT (Zhou and Zhao, 2019)  |  95.84   | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://arxiv.org/pdf/1907.02684.pdf)                             |                                                       |
| CRF Parser + BERT (Zhang et al., 2020)                                             |  95.69   | [Fast and Accurate Neural CRF Constituency Parsing](https://www.ijcai.org/Proceedings/2020/560)                                   | [Official](https://github.com/yzhangcs/crfpar)        |
| Self-attentive encoder + ELMo (Kitaev and Klein, 2018)                             |  95.13   | [Constituency Parsing with a Self-Attentive Encoder](https://arxiv.org/abs/1805.01052)                                            | [Official](https://github.com/nikitakit/self-attentive-parser) |
| Model combination (Fried et al., 2017)                                             |  94.66   | [Improving Neural Parsing by Disentangling Model Combination and Reranking Effects](https://arxiv.org/abs/1707.03058)             |                                                       |
| LSTM Encoder-Decoder + LSTM-LM (Takase et al., 2018)                               |  94.47   | [Direct Output Connection for a High-Rank Language Model](http://aclweb.org/anthology/D18-1489)                                   |                                                       |
| LSTM Encoder-Decoder + LSTM-LM (Suzuki et al., 2018)                               |  94.32   | [An Empirical Study of Building a Strong Baseline for Constituency Parsing](http://aclweb.org/anthology/P18-2097)                 |                                                       |
| In-order (Liu and Zhang, 2017)                                                     |   94.2   | [In-Order Transition-based Constituent Parsing](http://aclweb.org/anthology/Q17-1029)                                             |                                                       |
| CRF Parser (Zhang et al., 2020)                                                    |  94.12   | [Fast and Accurate Neural CRF Constituency Parsing](https://www.ijcai.org/Proceedings/2020/560)                                   | [Official](https://github.com/yzhangcs/crfpar)        |
| Semi-supervised LSTM-LM (Choe and Charniak, 2016)                                  |   93.8   | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257)                                                          |                                                       |
| Stack-only RNNG (Kuncoro et al., 2017)                                             |   93.6   | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774)                                 |                                                       |
| RNN Grammar (Dyer et al., 2016)                                                    |   93.3   | [Recurrent Neural Network Grammars](https://www.aclweb.org/anthology/N16-1024)                                                    |                                                       |
| Transformer (Vaswani et al., 2017)                                                 |   92.7   | [Attention Is All You Need](https://arxiv.org/abs/1706.03762)                                                                     |                                                       |
| Combining Constituent Parsers (Fossum and Knight, 2009)                            |   92.4   | [Combining constituent parsers via parse selection or parse hybridization](https://dl.acm.org/citation.cfm?id=1620923)            |                                                       |
| Semi-supervised LSTM (Vinyals et al., 2015)                                        |   92.1   | [Grammar as a Foreign Language](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf)                              |                                                       |
| Self-trained parser (McClosky et al., 2006)                                        |   92.1   | [Effective Self-Training for Parsing](https://pdfs.semanticscholar.org/6f0f/64f0dab74295e5eb139c160ed79ff262558a.pdf)             |                                                       |

[Go back to the README](../README.md)


================================================
FILE: english/coreference_resolution.md
================================================
# Coreference resolution

Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.

Example:

```
               +-----------+
               |           |
I voted for Obama because he was most aligned with my values", she said.
 |                                                 |            |
 +-------------------------------------------------+------------+
```

"I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster.

### CoNLL 2012

Experiments are conducted on the data of the [CoNLL-2012 shared task](http://www.aclweb.org/anthology/W12-4501), which
uses OntoNotes coreference annotations. Papers
report the precision, recall, and F1 of the MUC, B3, and CEAFφ4 metrics using the official
CoNLL-2012 evaluation scripts. The main evaluation metric is the average F1 of the three metrics.

| Model           | Avg F1 |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| wl-coref + RoBERTa | 81.0 | [Word-Level Coreference Resolution](https://arxiv.org/abs/2109.04127) | [Official](https://github.com/vdobrovolskii/wl-coref) |
| s2e+Longformer-Large | 80.3 | [Coreference Resolution without Span Representations](https://arxiv.org/abs/2101.00434) | [Official](https://github.com/yuvalkirstain/s2e-coref) |
| Xu et al. (2020) | 80.2 | [Revealing the Myth of Higher-Order Inference in Coreference Resolution](https://arxiv.org/abs/2009.12013) |[Official](https://github.com/emorynlp/coref-hoi) |
| Joshi et al. (2019)<sup>[1](#myfootnote1)</sup> | 79.6 | [SpanBERT: Improving Pre-training by Representing and Predicting Spans](https://arxiv.org/pdf/1907.10529) |[Official](https://github.com/facebookresearch/SpanBERT) |
| Joshi et al. (2019)<sup>[2](#myfootnote2)</sup> | 76.9 | [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) | [Official](https://github.com/mandarjoshi90/coref) |
| Kantor and Globerson (2019) | 76.6 | [Coreference Resolution with Entity Equalization](https://www.aclweb.org/anthology/P19-1066/) | [Official](https://github.com/kkjawz/coref-ee) |
| Fei et al. (2019) | 73.8 | [End-to-end Deep Reinforcement Learning Based Coreference Resolution](https://www.aclweb.org/anthology/P19-1064/) | |
| (Lee et al., 2017)+ELMo (Peters et al., 2018)+coarse-to-fine & second-order inference (Lee et al., 2018) | 73.0 | [Higher-order Coreference Resolution with Coarse-to-fine Inference](http://aclweb.org/anthology/N18-2108) | [Official](https://github.com/kentonl/e2e-coref) |
| (Lee et al., 2017)+ELMo (Peters et al., 2018) | 70.4 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | |
| Lee et al. (2017) | 67.2 | [End-to-end Neural Coreference Resolution](https://arxiv.org/abs/1707.07045) | |

<a name="myfootnote1">[1]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+SpanBERT (Joshi et al., 2019)

<a name="myfootnote2">[2]</a> Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+BERT (Devlin et al., 2019)

### Gendered Ambiguous Pronoun Resolution

Experiments are conducted on [GAP dataset](https://github.com/google-research-datasets/gap-coreference). 
Metrics used are F1 score on Masculine (M) and Feminine (F) examples, Overall, and a Bias factor calculated as F / M.

| Model           | Overall F1 | Masculine F1 (M) | Feminine F1 (F) | Bias (F/M) | Paper / Source | Code |
| ------------- | :-----:| :-----:| :-----:| :-----:| --- | --- |
| Attree et al. (2019) | 92.5 | 94.0 | 91.1 | 0.97 | [Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling](https://arxiv.org/abs/1906.00839) | [GREP](https://github.com/sattree/gap) |
| Chada et al. (2019) | 90.2 | 90.9 | 89.5 | 0.98 | [Gendered Pronoun Resolution using BERT and an extractive question answering formulation](https://arxiv.org/abs/1906.03695) | [CorefQA](https://github.com/rakeshchada/corefqa) |


[Go back to the README](../README.md)


================================================
FILE: english/data_to_text_generation.md
================================================
# Data-to-Text Generation

**Data-to-Text Generation (D2T NLG)** can be described as Natural Language Generation from structured input.
<!-- is a task of NLG where the **textual output** is generated using **structured input** (such as tables or graphs). -->
Unlike other NLG tasks such as, Machine Translation or Question Answering (also referred as **Text-to-Text Generation or T2T NLG**) where requirement is to generate textual output using some unstructured textual input, in D2T NLG the requirement is to generate textual output from the input provided in a structured format such as: tables; or knowledge graphs; or JSONs <sup>[[1]](#myfootnote1)</sup>.

## RotoWire
The [dataset](https://github.com/harvardnlp/boxscore-data/blob/master/rotowire.tar.bz2) consists of articles summarizing NBA basketball games, paired with their corresponding box- and line-score tables. It is professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but structured, and targets an audience primarily interested in game statistics <sup>[[2]](#myfootnote2)</sup>.

The performance is evaluated on two different automated metrics: first, **BLEU score**; and second, a family of **Extractive Evaluations (EE)**. EE contains three different submetrics evaluating three different aspects of the generation:

1. **Content Selection (CS)**: precision (P%) and recall (R%) of unique relations extracted from generated text that are also extracted from golden text. This measures how well the generated document matches the gold document in terms of selecting which records to generate.

2. **Relation Generation (RG)**: precision (P%) and number of unique relations (#) extracted from generated text that also appear in structured input provided. This measures how well the system is able to generate text containing factual (i.e., correct) records.

3. **Content Ordering (CO)**: normalized Damerau-Levenshtein Distance (DLD%) between the sequences of records extracted from golden text and that extracted from generated text. This measures how well the system orders the records it chooses to discuss.

| Model           | BLEU | CS (P% & R%) | RG (P% & #) | CO (DLD%) |  Paper / Source | Code |
| ------------- | :-----: | :-----: | :-----: | :-----:| --- | --- |
| Rebuffel, Clément, et al. (2020)<sup>[[4]](#myfootnote4)</sup> | 17.50 | 39.47 & 51.64 | 89.46 & 21.17 | 18.90 | [A Hierarchical Model for Data-to-Text Generation](https://link.springer.com/chapter/10.1007/978-3-030-45439-5_5) |[Official](https://github.com/KaijuML/data-to-text-hierarchical) |
| Puduppully et al. (2019)<sup>[[3]](#myfootnote3)</sup> | 16.50 | 34.18 & 51.22 | 87.47 & 34.28 | 18.58 | [Data-to-text generation with content selection and planning](https://www.aaai.org/ojs/index.php/AAAI/article/view/4668) |[Official](https://github.com/ratishsp/data2text-plan-py) |
| Puduppully and Lapata (2021)<sup>[[10]](#myfootnote10)</sup> | 15.46 | 34.1 & 57.8 |  97.6 & 42.1 | 17.7 | [Data-to-text generation with macro planning](https://doi.org/10.1162/tacl_a_00381) |[Official](https://github.com/ratishsp/data2text-macro-plan-py) |
| Wiseman et al. (2017)<sup>[[2]](#myfootnote2)</sup> | 14.49 | 22.17 & 27.16 | 71.82 & 12.82 | 8.68 | [Challenges in Data-to-Document Generation](https://www.aclweb.org/anthology/D17-1239.pdf) |[Official](https://github.com/harvardnlp/data2text) |

## WebNLG
The [WebNLG challenge](https://webnlg-challenge.loria.fr/) consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples. For example, given the three DBpedia triples (as shown in [a]), the aim is to generate a text (as shown in [b]):

* **[a]**. (John_E_Blaha birthDate 1942_08_26) (John_E_Blaha birthPlace San_Antonio) (John_E_Blaha occupation Fighter_pilot)

* **[b]**. John E Blaha, born in San Antonio on 1942-08-26, worked as a fighter pilot.

The performance is evaluated on the basis of **BLEU, METEOR and TER scores**. The data from WebNLG Challenge 2017 can be downloaded [here](https://gitlab.com/shimorina/webnlg-dataset).

| Model           | BLEU | METEOR | TER |  Paper / Source | Code |
| ------------- | :-----: | :-----: | :-----: | --- | --- |
| Kale, Mihir. (2020) <sup>[[9]](#myfootnote9)</sup> | 57.1 | 0.44 |  | [Text-to-Text Pre-Training for Data-to-Text Tasks](https://arxiv.org/pdf/2005.10433v2.pdf) |  |
| Moryossef et al. (2019) <sup>[[5]](#myfootnote5)</sup> | 47.4 | 0.391 | 0.631 | [Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation](https://www.aclweb.org/anthology/N19-1236.pdf) | [Official](https://github.com/AmitMY/chimera) |
| Baseline | 33.24 | 0.235436 | 0.613080 | [Baseline system provided during the challenge](https://webnlg-challenge.loria.fr/challenge_2017/#webnlg-baseline-system) |[Official](https://gitlab.com/webnlg/webnlg-baseline) |

**P.S.**: The **test dataset** of WebNLG consists of **total 15 categories**, out of which 10 (**seen**) catgories are used for training while 5 (**unseen**) are not. The results reported here are those obtained on overall test data, i.e., all 15 categories.

## Meaning Representations

The dataset was first provided for the [E2E Challenge](http://www.macs.hw.ac.uk/InteractionLab/E2E/) in 2017. It is a crowd-sourced data set of 50k instances in the restaurant domain.Each instance consist of a dialogue act-based meaning representations (MR) and up to 5 references in natural language (NL). For example:

* **MR**: name[The Eagle], eatType[coffee shop], food[French], priceRange[moderate], customerRating[3/5], area[riverside], kidsFriendly[yes], near[Burger King]

* **NL**: “The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The performance is evaluated using **BLEU, NIST, METEOR, ROUGE-L, CIDEr scores**. The data from E2E Challenge 2017 can be downloaded [here](https://github.com/tuetschek/e2e-dataset/releases/download/v1.0.0/e2e-dataset.zip).

| Model           | BLEU | NIST | METEOR | ROUGE-L | CIDEr |  Paper / Source | Code |
| ------------- | :-----: | :-----: |:-----: |:-----: | :-----: | --- | --- |
| Shen, Sheng, et al. (2019) <sup>[[7]](#myfootnote6)</sup> | 68.60 | 8.73 | 45.25 | 70.82 | 2.37 | [Pragmatically Informative Text Generation](https://www.aclweb.org/anthology/N19-1410.pdf) |[Official](https://github.com/sIncerass/prag_generation) |
| Elder, Henry, et al. (2019) <sup>[[8]](#myfootnote8)</sup> | 67.38 | 8.7277 | 45.72 | 71.52 | 2.2995 | [Designing a Symbolic Intermediate Representation for Neural Surface Realization](https://www.aclweb.org/anthology/W19-2308.pdf) | |
| Gehrmann, Sebastian, et al. (2018) <sup>[[6]](#myfootnote7)</sup> | 66.2 | 8.60 | 45.7 | 70.4 | 2.34 | [End-to-End Content and Plan Selection for Data-to-Text Generation](https://www.aclweb.org/anthology/W18-6505.pdf) |[Official](https://github.com/sebastianGehrmann/diverse_ensembling) |
| Baseline | 65.93 | 8.61 | 44.83 | 68.50 | 2.23 | [Baseline system provided during the challenge](http://www.macs.hw.ac.uk/InteractionLab/E2E/#baseline) |[Official](https://github.com/UFAL-DSG/tgen/tree/master/e2e-challenge) |

<!-- ## WikiBio  -->

## References
<a name="myfootnote1">[1]</a> Albert Gatt and Emiel Krahmer. 2018. [Survey of the state of the art in natural language generation: core tasks, applications and evaluation](https://www.jair.org/index.php/jair/article/download/11173/26378/). J. Artif. Int. Res. 61, 1 (January 2018), 65–170.

<a name="myfootnote2">[2]</a> Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. "[Challenges in Data-to-Document Generation](https://www.aclweb.org/anthology/D17-1239.pdf)." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

<a name="myfootnote3">[3]</a> Puduppully, Ratish, Li Dong, and Mirella Lapata. "[Data-to-text generation with content selection and planning](https://www.aaai.org/ojs/index.php/AAAI/article/view/4668)." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

<a name="myfootnote4">[4]</a> Rebuffel, Clément, et al. "[A Hierarchical Model for Data-to-Text Generation](https://link.springer.com/chapter/10.1007/978-3-030-45439-5_5)." European Conference on Information Retrieval. Springer, Cham, 2020.

<a name="myfootnote5">[5]</a> Moryossef, Amit, Yoav Goldberg, and Ido Dagan. "[Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation](https://www.aclweb.org/anthology/N19-1236.pdf)." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

<a name="myfootnote6">[6]</a> Gehrmann, Sebastian, et al. "[End-to-End Content and Plan Selection for Data-to-Text Generation](https://www.aclweb.org/anthology/W18-6505.pdf)." Proceedings of the 11th International Conference on Natural Language Generation. 2018.

<a name="myfootnote7">[7]</a> Shen, Sheng, et al. "[Pragmatically Informative Text Generation](https://www.aclweb.org/anthology/N19-1410.pdf)." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

<a name="myfootnote8">[8]</a> Elder, Henry, et al. "[Designing a Symbolic Intermediate Representation for Neural Surface Realization](https://www.aclweb.org/anthology/W19-2308.pdf)." Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. 2019.

<a name="myfootnote9">[9]</a> Kale, Mihir. "[Text-to-Text Pre-Training for Data-to-Text Tasks](https://arxiv.org/pdf/2005.10433v2.pdf)" arXiv preprint arXiv:2005.10433 (2020).

<a name="myfootnote10">[10]</a> Puduppully, Ratish and Mirella Lapata. "[Data-to-text generation with macro planning](https://doi.org/10.1162/tacl_a_00381)." Transactions of the Association for Computational Linguistics 2021; 9 510–527.

[Go back to the README](../README.md)


================================================
FILE: english/dependency_parsing.md
================================================
# Dependency parsing

Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical
structure and defines the relationships between "head" words and words, which modify those heads.

Example:

```
     root
      |
      | +-------dobj---------+
      | |                    |
nsubj | |   +------det-----+ | +-----nmod------+
+--+  | |   |              | | |               |
|  |  | |   |      +-nmod-+| | |      +-case-+ |
+  |  + |   +      +      || + |      +      | |
I  prefer  the  morning   flight  through  Denver
```

Relations among the words are illustrated above the sentence with directed, labeled
arcs from heads to dependents (+ indicates the dependent).

### Penn Treebank

Models are evaluated on the [Stanford Dependency](https://nlp.stanford.edu/software/dependencies_manual.pdf)
conversion (**v3.3.0**) of the Penn Treebank with __predicted__ POS-tags. Punctuation symbols
are excluded from the evaluation. Evaluation metrics are unlabeled attachment score (UAS) and labeled attachment score (LAS). UAS does not consider the semantic relation (e.g. Subj) used to label the attachment between the head and the child, while LAS requires a semantic correct label for each attachment.Here, we also mention the predicted POS tagging accuracy.

| Model                                                                        |  POS  |  UAS  |  LAS  | Paper / Source                                                                                                                    | Code                                                                           |
| ---------------------------------------------------------------------------- | :---: | :---: | :---: | --------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| Label Attention Layer + HPSG + XLNet (Mrini et al., 2019)                    | 97.3  | 97.42 | 96.26 | [Rethinking Self-Attention: Towards Interpretability for Neural Parsing](https://khalilmrini.github.io/Label_Attention_Layer.pdf) | [Official](https://github.com/KhalilMrini/LAL-Parser)                          |
| Pre-training + XLNet (Tian et al. 2022) | - | 97.30 | 95.92 | [Enhancing Structure-aware Encoder with Extremely Limited Data for Graph-based Dependency Parsing](https://aclanthology.org/2022.coling-1.483/) | [Official](https://github.com/synlp/DMPar) |
| ACE + fine-tune (Wang et al., 2020) | - | 97.20 | 95.80 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| HPSG Parser (Joint) + XLNet (Zhou et al, 2020)                            | 97.3  | 97.20 | 95.72 | [Head-Driven Phrase Structure Grammar Parsing on Penn Treebank](https://www.aclweb.org/anthology/2020.findings-emnlp.398.pdf)                        | [Official](https://github.com/DoodleJZ/HPSG-Neural-Parser)                     |
| Second-Order MFVI + BERT (Wang et al., 2020) | - | 96.91 | 95.34 | [Second-Order Neural Dependency Parsing with Message Passing and End-to-End Training](https://arxiv.org/pdf/2010.05003.pdf) | [Official](https://github.com/wangxinyu0922/Second_Order_Parsing)|
| CVT + Multi-Task (Clark et al., 2018)                                        | 97.74 | 96.61 | 95.02 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370)                                    | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) |
| CRF Parser (Zhang et al., 2020)                                              |   -   | 96.14 | 94.49 | [Efficient Second-Order TreeCRF for Neural Dependency Parsing](https://www.aclweb.org/anthology/2020.acl-main.302)                | [Official](https://github.com/yzhangcs/crfpar)                                 |
| Second-Order MFVI (Wang et al., 2020) | - | 96.12 | 94.47 | [Second-Order Neural Dependency Parsing with Message Passing and End-to-End Training](https://arxiv.org/pdf/2010.05003.pdf) | [Official](https://github.com/wangxinyu0922/Second_Order_Parsing)|
| Left-to-Right Pointer Network (Fernández-González and Gómez-Rodríguez, 2019) | 97.3  | 96.04 | 94.43 | [Left-to-Right Dependency Parsing with Pointer Networks](https://www.aclweb.org/anthology/N19-1076)                               | [Official](https://github.com/danifg/Left2Right-Pointer-Parser)                |
| Graph-based parser with GNNs (Ji et al., 2019)                               | 97.3  | 95.97 | 94.31 | [Graph-based Dependency Parsing with Graph Neural Networks](https://www.aclweb.org/anthology/P19-1237)                            |                                                                                |
| Deep Biaffine (Dozat and Manning, 2017)                                      | 97.3  | 95.74 | 94.08 | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)                                         | [Official](https://github.com/tdozat/Parser-v1)                                |
| jPTDP (Nguyen and Verspoor, 2018)                                            | 97.97 | 94.51 | 92.87 | [An improved neural network model for joint POS tagging and dependency parsing](https://arxiv.org/abs/1807.03955)                 | [Official](https://github.com/datquocnguyen/jPTDP)                             |
| Andor et al. (2016)                                                          | 97.44 | 94.61 | 92.79 | [Globally Normalized Transition-Based Neural Networks](https://www.aclweb.org/anthology/P16-1231)                                 |                                                                                |
| Distilled neural FOG (Kuncoro et al., 2016)                                  | 97.3  | 94.26 | 92.06 | [Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser](https://arxiv.org/abs/1609.07561)                       |                                                                                |
| Distilled transition-based parser (Liu et al., 2018)                         | 97.3  | 94.05 | 92.14 | [Distilling Knowledge for Search-based Structured Prediction](http://aclweb.org/anthology/P18-1129)                               | [Official](https://github.com/Oneplus/twpipe)                                  |
| Weiss et al. (2015)                                                          | 97.44 | 93.99 | 92.05 | [Structured Training for Neural Network Transition-Based Parsing](http://anthology.aclweb.org/P/P15/P15-1032.pdf)                 |                                                                                |
| BIST transition-based parser (Kiperwasser and Goldberg, 2016)                | 97.3  | 93.9  | 91.9  | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)  | [Official](https://github.com/elikip/bist-parser/tree/master/barchybrid/src)   |
| Arc-hybrid (Ballesteros et al., 2016)                                        | 97.3  | 93.56 | 91.42 | [Training with Exploration Improves a Greedy Stack-LSTM Parser](https://arxiv.org/abs/1603.03793)                                 |                                                                                |
| BIST graph-based parser (Kiperwasser and Goldberg, 2016)                     | 97.3  | 93.1  | 91.0  | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)  | [Official](https://github.com/elikip/bist-parser/tree/master/bmstparser/src)   |

### Universal Dependencies

The focus of the task is learning syntactic dependency parsers that can work in a real-world setting, starting from raw text, and that can work over many typologically different languages, even low-resource languages for which there is little or no training data, by exploiting a common syntactic annotation standard. This task has been made possible by the Universal Dependencies initiative (UD, http://universaldependencies.org), which has developed treebanks for 60+ languages with cross-linguistically consistent annotation and recoverability of the original raw texts.

Participating systems will have to find labeled syntactic dependencies between words, i.e. a syntactic head for each word, and a label classifying the type of the dependency relation. In addition to syntactic dependencies, prediction of morphology and lemmatization will be evaluated. There will be multiple test sets in various languages but all data sets will adhere to the common annotation style of UD. Participants will be asked to parse raw text where no gold-standard pre-processing (tokenization, lemmas, morphology) is available. Data preprocessed by a baseline system (UDPipe, https://ufal.mff.cuni.cz/udpipe) was provided so that the participants could focus on improving just one part of the processing pipeline. The organizers believed that this made the task reasonably accessible for everyone.

| Model                     |  LAS  | MLAS  | BLEX  | Paper / Source                                                                                                                                              | Code                                                                 |
| ------------------------- | :---: | :---: | :---: | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| Stanford (Qi et al.)      | 74.16 | 62.08 | 65.28 | [Universal Dependency Parsing from Scratch](https://arxiv.org/pdf/1901.10457.pdf)                                                                           | [Official](https://github.com/stanfordnlp/stanfordnlp)               |
| UDPipe Future (Straka)    | 73.11 | 61.25 | 64.49 | [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020)                                                              | [Official](https://github.com/CoNLL-UD-2018/UDPipe-Future)           |
| HIT-SCIR (Che et al.)     | 75.84 | 59.78 | 65.33 | [Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation](https://arxiv.org/abs/1807.03121)                    |                                                                      |
| TurkuNLP (Kanerva et al.) | 73.28 | 60.99 | 66.09 | [Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/proceedings/pdf/K18-2013.pdf) | [Official](https://github.com/TurkuNLP/Turku-neural-parser-pipeline) |

The following results are just for references:

| Model                                                                  |  UAS  |  LAS  | Note                           | Paper / Source                                                                                    |
| ---------------------------------------------------------------------- | :---: | :---: | ------------------------------ | ------------------------------------------------------------------------------------------------- |
| Stack-only RNNG (Kuncoro et al., 2017)                                 | 95.8  | 94.6  | Constituent parser             | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774) |
| Deep Biaffine (Dozat and Manning, 2017)                                | 95.75 | 94.22 | Stanford conversion **v3.5.0** | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)         |
| Semi-supervised LSTM-LM (Choe and Charniak, 2016) (Constituent parser) | 95.9  | 94.1  | Constituent parser             | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257)                          |

# Cross-lingual zero-shot dependency parsing

Cross-lingual zero-shot parsing is the task of inferring the dependency parse of sentences from one language without any labeled training trees for that language.

## Universal Dependency Treebank

Models are evaluated against the [Universal Dependency Treebank v2.0](https://github.com/ryanmcd/uni-dep-tb). For each of the 6 target languages, models can use the trees of all other languages and English and are evaluated by the UAS and LAS on the target. The final score is the average score across the 6 target languages. The most common evaluation setup is to use
gold POS-tags.

| Model                                      |  UAS  |  LAS  | Paper / Source                                                                                                                               | Code                                                          |
| ------------------------------------------ | :---: | :---: | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| XLM-R + SubDP (Shi et al., 2022) | --- | 79.6* | [Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing](https://aclanthology.org/2022.acl-long.452/) | [Official](https://aclanthology.org/attachments/2022.acl-long.452.software.zip)
| Cross-Lingual ELMo (Schuster et al., 2019) | 84.2  | 77.3  | [Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing](https://arxiv.org/abs/1902.09492) | [Official](https://github.com/TalSchuster/CrossLingualELMo)   |
| MALOPA (Ammar et al., 2016)                |       | 70.5  | [Many Languages, One Parser](https://www.transacl.org/ojs/index.php/tacl/article/view/892)                                                   | [Official](https://github.com/clab/language-universal-parser) |
| Guo et al. (2016)                          | 76.7  | 69.9  | [A representation learning framework for multi-source transfer parsing](https://dl.acm.org/citation.cfm?id=3016100.3016284)                  |

*: Evaluated on four target languages.

# Unsupervised dependency parsing

Unsupervised dependency parsing is the task of inferring the dependency parse of sentences without any labeled training data.

## Penn Treebank

As with supervised parsing, models are evaluated against the Penn Treebank. The most common evaluation setup is to use
gold POS-tags as input and to evaluate systems using the unlabeled attachment score (also called 'directed dependency
accuracy').

| Model                                                |  UAS  | Paper / Source                                                                                                                                        |
| ---------------------------------------------------- | :---: | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| Iterative reranking (Le & Zuidema, 2015)             | 66.2  | [Unsupervised Dependency Parsing - Let’s Use Supervised Parsers](http://www.aclweb.org/anthology/N15-1067)                                            |
| Combined System (Spitkovsky et al., 2013)            | 64.4  | [Breaking Out of Local Optima with Count Transforms and Model Recombination - A Study in Grammar Induction](http://www.aclweb.org/anthology/D13-1204) |
| Tree Substitution Grammar DMV (Blunsom & Cohn, 2010) | 55.7  | [Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing](http://www.aclweb.org/anthology/D10-1117)                               |
| Shared Logistic Normal DMV (Cohen & Smith, 2009)     | 41.4  | [Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction](http://www.aclweb.org/anthology/N09-1009)           |
| DMV (Klein & Manning, 2004)                          | 35.9  | [Corpus-Based Induction of Syntactic Structure - Models of Dependency and Constituency](http://www.aclweb.org/anthology/P04-1061)                     |

[Go back to the README](../README.md)


================================================
FILE: english/dialogue.md
================================================
# Dialogue

Dialogue is notoriously hard to evaluate. Past approaches have used human evaluation.

## Dialogue act classification

Dialogue act classification is the task of classifying an utterance with respect to the function it serves in a dialogue, i.e. the act the speaker is performing. Dialogue acts are a type of speech acts (for Speech Act Theory, see [Austin (1975)](http://www.hup.harvard.edu/catalog.php?isbn=9780674411524) and [Searle (1969)](https://www.cambridge.org/core/books/speech-acts/D2D7B03E472C8A390ED60B86E08640E7)).

### Switchboard corpus
The [Switchboard-1 corpus](https://catalog.ldc.upenn.edu/ldc97s62) is a telephone speech corpus, consisting of about 2,400 two-sided telephone conversation among 543 speakers with about 70 provided conversation topics. The dataset includes the audio files and the transcription files, as well as information about the speakers and the calls.

The Switchboard Dialogue Act Corpus (SwDA) [[download](https://web.stanford.edu/~jurafsky/swb1_dialogact_annot.tar.gz)] extends the Switchboard-1 corpus with tags from the [SWBD-DAMSL tagset](https://web.stanford.edu/~jurafsky/ws97/manual.august1.html), which is an augmentation to the Discourse Annotation and Markup System of Labeling (DAMSL) tagset. The 220 tags were reduced to 42 tags by clustering in order to improve the language model on the Switchboard corpus. A subset of the Switchboard-1 corpus consisting of 1155 conversations was used. The resulting tags include dialogue acts like statement-non-opinion, acknowledge, statement-opinion, agree/accept, etc.  
Annotated example:  
*Speaker:* A, *Dialogue Act:* Yes-No-Question, *Utterance:* So do you go to college right now?  

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| SGNN (Ravi et al., 2018) | 83.1 | [Self-Governing Neural Networks for On-Device Short Text Classification](https://www.aclweb.org/anthology/D18-1105.pdf) |[Link](https://github.com/glicerico/SGNN) |
| CASA (Raheja et al., 2019) | 82.9 | [Dialogue Act Classification with Context-Aware Self-Attention](https://www.aclweb.org/anthology/N19-1373.pdf)|[Link](https://github.com/macabdul9/CASA-Dialogue-Act-Classifier)|
| DAH-CRF (Li et al., 2019) | 82.3 | [A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification](https://www.aclweb.org/anthology/K19-1036.pdf)
| ALDMN (Wan et al., 2018) | 81.5 | [Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training](https://arxiv.org/pdf/1811.05021.pdf)
| CRF-ASN (Chen et al., 2018) | 81.3 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | |
| Bi-LSTM-CRF (Kumar et al., 2017) | 79.2 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) |
| RNN with 3 utterances in context (Bothe et al., 2018) | 77.34 | [A Context-based Approach for Dialogue Act Recognition using Simple Recurrent Neural Networks](https://arxiv.org/abs/1805.06280) | |


### ICSI Meeting Recorder Dialog Act (MRDA) corpus
The [MRDA corpus](http://www1.icsi.berkeley.edu/Speech/mr/) [[download](http://www.icsi.berkeley.edu/~ees/dadb/icsi_mrda+hs_corpus_050512.tar.gz)] consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts.   
Annotated example:  
*Time:* 2804-2810, *Speaker:* c6, *Dialogue Act:* s^bd, *Transcript:* i mean these are just discriminative.  
Multiple dialogue acts are separated by "^".

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| DAH-CRF (Li et al., 2019) | 92.2 | [A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification](https://www.aclweb.org/anthology/K19-1036.pdf)
| CRF-ASN (Chen et al., 2018) | 91.7 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | |
| CASA (Raheja et al., 2019) | 91.1 | [Dialogue Act Classification with Context-Aware Self-Attention](https://www.aclweb.org/anthology/N19-1373.pdf)
| Bi-LSTM-CRF (Kumar et al., 2017) | 90.9 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) |
| SGNN (Ravi et al., 2018) | 86.7 | [Self-Governing Neural Networks for On-Device Short Text Classification](https://www.aclweb.org/anthology/D18-1105.pdf)

## Dialogue state tracking

Dialogue state tacking consists of determining at each turn of a dialogue the
full representation of what the user wants at that point in the dialogue,
which contains a goal constraint, a set of requested slots, and the user's dialogue act.

### Second dialogue state tracking challenge

For goal-oriented dialogue, the dataset of the [second Dialogue Systems Technology Challenges](http://www.aclweb.org/anthology/W14-4337)
(DSTC2) is a common evaluation dataset. The DSTC2 focuses on the restaurant search domain. Models are
evaluated based on accuracy on both individual and joint slot tracking.

| Model           | Request | Area  |  Food  |  Price  |  Joint  |  Paper / Source |
| ------------- | :-----: | :-----:| :-----:| :-----:| :-----:| --- |
| Zhong et al. (2018) | 97.5 | - | - | - | 74.5| [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) |
| Liu et al. (2018) | - | 90 | 84 | 92 | 72 | [Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems](https://arxiv.org/abs/1804.06512) |
| Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 90 | 84 | 94 | 73.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) |
| RNN (Henderson et al., 2014) | 95.7 | 92 | 86 | 86 | 69 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) |

### Wizard-of-Oz

The [WoZ 2.0 dataset](https://arxiv.org/pdf/1606.03777.pdf) is a newer dialogue state tracking dataset whose evaluation is detached from the noisy output of speech recognition systems. Similar to DSTC2, it covers the restaurant search domain and has identical evaluation.


| Model           | Request  |  Joint  |  Paper / Source |
| ------------- |  :-----:| :-----:| --- |
| BERT-based tracker (Lai et al., 2020) | 97.6 | 90.5 | [A Simple but Effective BERT Model for Dialog State Tracking on Resource-Limited Systems](https://ieeexplore.ieee.org/document/9053975) |
| GCE (Nouri et al., 2018) | 97.4 | 88.5 | [Toward Scalable Neural Dialogue State Tracking Model](https://arxiv.org/abs/1812.00899) |
| Zhong et al. (2018) | 97.1 | 88.1 | [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) |
| Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 84.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) |
| RNN (Henderson et al., 2014) | 87.1 | 70.8 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) |


### MultiWOZ

The [MultiWOZ dataset](https://arxiv.org/abs/1810.00278) is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The dialogue are set between a tourist and a clerk in the information. It spans over 7 domains.

#### Belief Tracking
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th></th><th colspan="2">MultiWOZ 2.0</th><th colspan="2">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>Joint Accuracy</th><th>Slot</th><th>Joint Accuracy</th><th>Slot</th></tr></thead>
<tbody>
<tr><td><a href="https://www.aclweb.org/anthology/P18-2069">MDBT</a> (Ramadan et al., 2018) </td><td>15.57 </td><td>89.53</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/abs/1805.09655">GLAD</a> (Zhong et al., 2018)</td><td>35.57</td><td>95.44 </td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1812.00899.pdf">GCE</a> (Nouri and Hosseini-Asl, 2018)</td><td>36.27</td><td>98.42</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1908.01946.pdf">Neural Reading</a> (Gao et al, 2019)</td><td>41.10</td><td></td><td></td><td></td></tr>

<tr><td><a href="https://arxiv.org/pdf/1907.00883.pdf">HyST</a> (Goel et al, 2019)</td><td>44.24</td><td></td><td></td><td></td></tr>
<tr><td><a href="https://www.aclweb.org/anthology/P19-1546/">SUMBT</a> (Lee et al, 2019)</td><td>46.65</td><td>96.44</td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1905.08743.pdf">TRADE</a> (Wu et al, 2019)</td><td>48.62</td><td>96.92</td><td>45.60</td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1909.00754.pdf">COMER</a> (Ren et al, 2019)</td><td>48.79</td><td></td><td></td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.06192.pdf">DSTQA</a> (Zhou et al, 2019)</td><td>51.44</td><td>97.24</td><td>51.17</td><td>97.21</td></tr>
<tr><td><a href="https://arxiv.org/pdf/1910.03544.pdf">DST-Picklist</a> (Zhang et al, 2019)</td><td></td><td></td><td>53.3</td><td></td></tr>
<tr><td><a href="https://www.aaai.org/Papers/AAAI/2020GB/AAAI-ChenL.10030.pdf">SST</a> (Chen et al. 2020)</td><td></td><td></td><td>55.23</td><td></td></tr>
<tr><td><a href="https://arxiv.org/abs/2005.02877">TripPy</a> (Heck et al. 2020)</td><td></td><td></td><td>55.3</td><td></td></tr>
<tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td></td><td></td><td>55.72</td><td></td></tr>

</tbody>
</table>
</div>

#### Policy Optimization
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th>(INFORM	+ SUCCESS)*0.5 +	BLEU</th><th colspan="3">MultiWOZ 2.0</th><th colspan="3">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th></tr></thead>
<tbody>
 <tr><td><a href="https://arxiv.org/pdf/1907.05346.pdf">TokenMoE</a> (Pei et al. 2019)</td><td>75.30</td><td> 59.70</td><td> 16.81 </td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://pdfs.semanticscholar.org/47d0/1eb59cd37d16201fcae964bd1d2b49cfb55e.pdf">Baseline</a> (Budzianowski et al. 2018)</td><td>71.29</td><td> 60.96 </td><td> 18.8 </td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1907.10016.pdf">Structured Fusion</a> (Mehri et al. 2019)</td><td>82.70</td><td>72.10</td><td> 16.34</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/abs/1902.08858">LaRL</a> (Zhao et al. 2019)</td><td>82.8</td><td>79.2</td><td> 12.8</td><td> </td><td> </td><td> </td></tr>
  <tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td>88.9</td><td>67.1</td><td> 16.9</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.08151.pdf">MoGNet</a> (Pei et al. 2019)</td><td>85.3</td><td>73.30</td><td> 20.13</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1905.12866.pdf">HDSA</a> (Chen et al. 2019)</td><td>82.9</td><td>68.9</td><td> 23.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/abs/1910.03756">ARDM</a> (Wu et al. 2019)</td><td>87.4</td><td>72.8</td><td> 20.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/1911.10484.pdf">DAMD</a> (Zhang et al. 2019)</td><td>89.2</td><td>77.9</td><td> 18.6</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/2005.05298.pdf">SOLOIST</a> (Peng et al. 2020)</td><td>89.60</td><td> 79.30</td><td> 18.3</td><td> </td><td> </td><td> </td></tr>
<tr><td><a href="https://arxiv.org/pdf/2004.12363.pdf">MarCo</a> (Wang et al. 2020)</td><td>92.30</td><td> 78.60</td><td> 20.02</td><td> 92.50</td><td> 77.80</td><td> 19.54</td></tr>
<tfoot> </tfoot>
</tbody>
</table>
</div>

#### Natural Language Generation
<div class="datagrid" style="width:500px;"><table>
<thead><tr><th>Model</th><th>SER</th><th>BLEU</th></tr></thead>
<tbody>
<tr><td><a href="https://pdfs.semanticscholar.org/47d0/1eb59cd37d16201fcae964bd1d2b49cfb55e.pdf">Baseline</a> (Budzianowski et al. 2018)</td><td>2.99 </td><td> 0.632</td></tr>
</tbody>
</table>
</div>

#### End-to-End Modelling
<div class="datagrid" style="width:500px;">
<table>
<thead><tr><th>(INFORM	+ SUCCESS)*0.5 +	BLEU</th><th colspan="3">MultiWOZ 2.0</th><th colspan="3">MultiWOZ 2.1</th></tr></thead>
<thead><tr><th>Model</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th><th>INFORM</th><th>SUCCESS</th><th>BLEU</th></tr></thead>
<tbody>
<tr><td><a href="https://arxiv.org/pdf/1911.10484.pdf">DAMD</a> (Zhang et al. 2019)</td><td>76.3</td><td>60.4</td><td> 18.6</td><td> </td><td> </td><td> </td></tr>
 <tr><td><a href="https://arxiv.org/pdf/2005.00796.pdf">SimpleTOD</a> (Hosseini-Asl et al. 2020)</td><td>84.4</td><td>70.1</td><td> 15.01</td><td> </td><td></td><td></td></tr>
 <tr><td><a href="https://arxiv.org/pdf/2005.05298.pdf">SOLOIST</a> (Peng et al. 2020)</td><td>85.50</td><td>72.90</td><td> 16.54</td><td> </td><td></td><td> </td></tr>

<tfoot> </tfoot>
</tbody>
</table>
</div>

## Retrieval-based Chatbots
These systems take as input a context and a list of possible responses and rank the responses, returning the highest ranking one.

### Ubuntu IRC Data

There are several corpra based on the [Ubuntu IRC Channel Logs](https://irclogs.ubuntu.com):

- [Uthus and Aha (2013)](), available [here](https://daviduthus.org/UCC/), the first dataset to use the resource, but not for retrieval-based chatbot research.
- UDC v1, [Lowe et al. (2015)](https://arxiv.org/abs/1506.08909), available [here](http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/), the first version of the Ubuntu Dialogue Corpus.
- UDC v2, [Lowe et al. (2017)](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698), available [here](https://arxiv.org/abs/1506.08909), the second version of the Ubuntu Dialogue Corpus.
- DSTC 7, [Gunasekara et al. (2019)](http://workshop.colips.org/dstc7/papers/dstc7_task1_final_report.pdf), available [here](https://ibm.github.io/dstc-noesis/public/index.html), the data from DSTC 7 track 1.
- DSTC 8, [Gunasekara et al. (2020)](http://jkk.name/pub/dstc20task2.pdf), available [here](https://github.com/dstc8-track2/NOESIS-II/), the data from DSTC 8 track 2.

Each version of the dataset contains a set of dialogues from the IRC channel, extracted by automatically disentangling conversations occurring simultaneously. See below for results on the disentanglement process.

The exact tasks used vary slightly, but all consider variations of Recall_N@K, which means how often the true answer is in the top K options when there are N total candidates.

| Data   | Model           |  R_100@1    |  R_100@10   |  R_100@50   |  MRR        |  Paper / Source |
| ------ | -------------   | :---------: | :---------: | :---------: | :---------: |---------------|
| DSTC 8 (main) | Wu et. al., (2020) | 76.1 | 97.9 | - | 84.8 | Enhancing Response Selection with Advanced Context Modeling and Post-training |
| DSTC 8 (subtask 2) | Wu et. al., (2020) | 70.6 | 95.7 | - | 79.9 | Enhancing Response Selection with Advanced Context Modeling and Post-training |
| DSTC 7 | Seq-Att-Network (Chen and Wang, 2019) | 64.5 | 90.2 | 99.4 | 73.5 | [Sequential Attention-based Network for Noetic End-to-End Response Selection](http://workshop.colips.org/dstc7/papers/07.pdf) |

| Data   | Model           | R_2@1       |  R_10@1      |  Paper / Source |
| ------ | -------------   | :---------: | :---------: |---------------|
| UDC v2 | DAM (Zhou et al. 2018) | 93.8 | 76.7 | [Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://www.aclweb.org/anthology/P18-1103) |
| UDC v2  | SMN (Wu et al. 2017) | 92.3 | 72.3 | [Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots](https://arxiv.org/pdf/1612.01627.pdf) |
| UDC v2  | Multi-View (Zhou et al. 2017) | 90.8 | 66.2 | [Multi-view Response Selection for Human-Computer Conversation](https://aclweb.org/anthology/D16-1036) |
| UDC v2  | Bi-LSTM (Kadlec et al. 2015) | 89.5 | 63.0 | [Improved Deep Learning Baselines for Ubuntu Corpus Dialogs](https://arxiv.org/pdf/1510.03753.pdf) |

Additional results can be found in the DSTC task reports linked above.

### Reddit Corpus
The [Reddit Corpus](https://arxiv.org/abs/1904.06472) contains 726 million multi-turn dialogues from the Reddit board. Reddit  is an American social news aggregation website, where users can post links, and take partin discussions on these post. The task of Reddit Corpus is to select the correct response from 100 candidates (others are negatively sampled) by considering previous conversation history.  Models are evaluated with the Recall 1 at 100 metric (the 1-of-100 ranking accuracy). You can find more details at [here](https://github.com/PolyAI-LDN/conversational-datasets).

| Model           |   R_1@100   |  Paper / Source |
| -------------   |   :---------:|---------------|
| PolyAI Encoder (Henderson et al. 2019) |  61.3 | [A Repository of Conversational Dataset](https://arxiv.org/pdf/1904.06472.pdf) |
| USE (Cer et al. 2018) | 47.7 | [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) |
| BERT (Devlin et al. 2017) | 24.0 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| ELMO (Peters et al. 2018) | 19.3 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) |

### Advising Corpus
The [Advising Corpus](http://workshop.colips.org/dstc7/papers/dstc7_task1_final_report.pdf), available [here](https://ibm.github.io/dstc-noesis/public/index.html), contains a collection of conversations between a student and an advisor at the University of Michigan. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.

| Model           |  R_100@1    |  R_100@10   |  R_100@50   |  MRR        |  Paper / Source |
| -------------   | :---------: | :---------: | :---------: | :---------: |---------------|
| Yang et. al., (2020) | 56.4 | 87.8 | - | 67.7 | Transformer-based Semantic Matching Model for Noetic Response Selection |
| Seq-Att-Network (Chen and Wang, 2019) | 21.4 | 63.0 | 94.8 | 33.9 | [Sequential Attention-based Network for Noetic End-to-End Response Selection](http://workshop.colips.org/dstc7/papers/07.pdf)


## Generative-based Chatbots
The main task of generative-based chatbot is to generate consistent and engaging response given the context.
### Personalized Chit-chat

The task of persinalized chit-chat dialogue generation is first proposed by [PersonaChat](https://arxiv.org/pdf/1801.07243.pdf). The motivation is to enhance the engagingness and consistency of chit-chat bots via endowing explicit personas to agents. Here the `persona` is defined as several profile natural language sentences like "I weight 300 pounds.". NIPS 2018 has hold a competition [The Conversational Intelligence Challenge 2 (ConvAI2)](http://convai.io/) based on the dataset. The Evaluation metric is F1, Hits@1 and ppl. F1 evaluates on the word-level, and Hits@1 represents the probability of the real next utterance ranking the highest according to the model, while ppl is perplexity for language modeling. The following results are reported on dev set (test set is still hidden), almost of them are borrowed from [ConvAI2 Leaderboard](https://github.com/DeepPavlov/convai/blob/master/leaderboards.md).

| Model           | F1 | Hits@1 | ppl | Paper / Source | Code |
| -------------   | :---------: | :---------:| :--------: | ---------------| ------------- |
| P^2 Bot (Liu et al. 2020) | 19.77 | 81.9 | 15.12 | [You Impress Me: Dialogue Generation via Mutual Persona Perception](https://arxiv.org/pdf/2004.05388.pdf) | [Code](https://github.com/SivilTaram/Persona-Dialogue-Generation) |
| TransferTransfo (Thomas et al. 2019) | 19.09 | 82.1 | 17.51 | [TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents](https://arxiv.org/pdf/1901.08149.pdf) | [Code](https://github.com/huggingface/transfer-learning-conv-ai) |
| Lost In Conversation | 17.79 | - | 17.3 | [NIPS 2018 Workshop Presentation](http://convai.io/NeurIPSParticipantSlides.pptx) | [Code](https://github.com/atselousov/transformer_chatbot) |
| Seq2Seq + Attention (Dzmitry et al. 2014) | 16.18 | 12.6 | 29.8 | [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/seq2seq) |
| KV Profile Memory (Zhang et al. 2018) | 11.9 | 55.2 | - | [Personalizing Dialogue Agents: I have a dog, do you have pets too?](https://arxiv.org/pdf/1801.07243.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/kvmemnn)

## Disentanglement

As noted for the Ubuntu data above, sometimes multiple conversations are mixed together in a single channel. Work on conversation disentanglement aims to separate out conversations. There are two main resources for the task.

This can be formultated as a clustering problem, with no clear best metric. Several metrics are considered:

- Variation of Information
- F-1 over 1-1 matched clusters using max-flow
- Precision, Recall, and F-score on exact match for clusters
- Local overlap
- Another form of F-1 defined by [Shen et al. (2006)](https://dl.acm.org/citation.cfm?doid=1148170.1148180)

### Ubuntu IRC

Manually labeled by [Kummerfeld et al. (2019)](https://www.aclweb.org/anthology/P19-1374), this data is available [here](https://jkk.name/irc-disentanglement/).

| Model                                            | VI   | 1-1  | Precision | Recall | F-Score | Paper / Source | Code      |
| ------------------------------------------------ | :--: | :--: | :-------: | :----: | :-----: | ---------------| --------- |
| BERT + BiLSTM                                    | 93.3 |    - |      44.3 |   49.6 |    46.8 | Pre-Trained and Attention-Based Neural Networks for Building Noetic Task-Oriented Dialogue Systems | - |
| FF ensemble: Vote      (Kummerfeld et al., 2019) | 91.5 | 76.0 |      36.3 |   39.7 |    38.0 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Feedforward            (Kummerfeld et al., 2019) | 91.3 | 75.6 |      34.6 |   38.0 |    36.2 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| FF ensemble: Intersect (Kummerfeld et al., 2019) | 69.3 | 26.6 |      67.0 |   21.1 |    32.1 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Linear               (Elsner and Charniak, 2008) | 82.1 | 51.4 |      12.1 |   21.5 |    15.5 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Heuristic            (Lowe et al., 2015)         | 80.6 | 53.7 |      10.8 |    7.6 |     8.9 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |

### Linux IRC

This data has been manually annotated three times:

- By [Elsner and Charniak (2008)](https://www.aclweb.org/anthology/P08-1095), available [here](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz).
- A portion by [Mehri and Carenini (2017)](https://aclweb.org/anthology/I17-1062/), available [here](http://shikib.com/td_annotations).
- By [Kummerfeld et al. (2019)](https://www.aclweb.org/anthology/P19-1374), available [here](https://jkk.name/irc-disentanglement/).

| Data | Model           | 1-1        | Local | Shen F-1 | Paper / Source | Code          |
| ---- | -------------   | :---------:| :---: | :------: | ---------------| ------------- |
| Kummerfeld | Linear     (Elsner and Charniak, 2008) | 59.7 | 80.8 | 63.0 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Kummerfeld | Feedforward (Kummerfeld et al., 2019)  | 57.7 | 80.3 | 59.8 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Kummerfeld | Heuristic   (Lowe et al., 2015)        | 43.4 | 67.9 | 50.7 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |
| Elsner | Linear     (Elsner and Charniak, 2008)     | 53.1 | 81.9 | 55.1 | [You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement](https://www.aclweb.org/anthology/P08-1095/) | [Code](https://www.asc.ohio-state.edu/elsner.14/resources/chat-distr.tgz) |
| Elsner | Feedforward (Kummerfeld et al., 2019)      | 52.1 | 77.8 | 53.8 | [A Large-Scale Corpus for Conversation Disentanglement](https://www.aclweb.org/anthology/P19-1374/) | [Code](https://jkk.name/irc-disentanglement) |
| Elsner | Wang and Oard (2009) | 47.0 | 75.1 | 52.8 | [Context-based Message Expansion for Disentanglement of Interleaved Text Conversations](https://www.aclweb.org/anthology/N09-1023/) | - |
| Elsner | Heuristic   (Lowe et al., 2015)            | 45.1 | 73.8 | 51.8 | [Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus](http://dad.uni-bielefeld.de/index.php/dad/article/view/3698) | [Code](https://github.com/npow/ubuntu-corpus) |


================================================
FILE: english/domain_adaptation.md
================================================
# Domain adaptation

## Sentiment analysis

### Multi-Domain Sentiment Dataset

The [Multi-Domain Sentiment Dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) is a common
evaluation dataset for domain adaptation for sentiment analysis. It contains product reviews from
Amazon.com from different product categories, which are treated as distinct domains.
Reviews contain star ratings (1 to 5 stars) that are generally converted into binary labels. Models are
typically evaluated on a target domain that is different from the source domain they were trained on, while only
having access to unlabeled examples of the target domain (unsupervised domain adaptation). The evaluation
metric is accuracy and scores are averaged across each domain.

| Model           | DVD | Books | Electronics | Kitchen | Average |  Paper / Source |
| ------------- | :-----:| :-----:| :-----:| :-----:| :-----:| --- |
| Multi-task tri-training (Ruder and Plank, 2018) | 78.14 | 74.86 | 81.45 | 82.14 | 79.15 | [Strong Baselines for Neural Semi-supervised Learning under Domain Shift](https://arxiv.org/abs/1804.09530) |
| Asymmetric tri-training (Saito et al., 2017) | 76.17 | 72.97 | 80.47 | 83.97 | 78.39 | [Asymmetric Tri-training for Unsupervised Domain Adaptation](https://arxiv.org/abs/1702.08400) |
| VFAE (Louizos et al., 2015) | 76.57 | 73.40 | 80.53 | 82.93 | 78.36 | [The Variational Fair Autoencoder](https://arxiv.org/abs/1511.00830) |
| DANN (Ganin et al., 2016) | 75.40 | 71.43 | 77.67 | 80.53 | 76.26 | [Domain-Adversarial Training of Neural Networks](https://arxiv.org/abs/1505.07818) |

## Financial Technology and Natural Language Processing (FinNLP) 

The [FinNLP Progress](https://github.com/YangLinyi/FinNLP-Progress) is a repository to track the progress in Natural Language Processing (NLP) related to the domain of Finance, including the datasets, papers, and current state-of-the-art results for the most popular tasks. Examples include Financial Event Prediction, Financial Index Forecasting, Financial Risk Analysis, Financial Text Mining, Fraud Detection, etc.

[Go back to the README](../README.md)


================================================
FILE: english/entity_linking.md
================================================
# Entity Linking

## Task

Entity Linking (EL) is the task of recognizing (cf. [Named Entity Recognition](named_entity_recognition.md)) and disambiguating (Named Entity Disambiguation) named entities to a knowledge base (e.g. Wikidata, DBpedia, or YAGO). It is sometimes also simply known as Named Entity Recognition and Disambiguation.

EL can be split into two classes of approaches:
* *End-to-End*: processing a piece of text to extract the entities (i.e. Named Entity Recognition) and then disambiguate these extracted entities to the correct entry in a given knowledge base (e.g. Wikidata, DBpedia, YAGO).
* *Disambiguation-Only*: contrary to the first approach, this one directly takes gold standard named entities as input and only disambiguates them to the correct entry in a given knowledge base.

Example:

| Barack | Obama | was | born | in | Hawaï |
| --- | ---| --- | --- | --- | --- |
| https://en.wikipedia.org/wiki/Barack_Obama | https://en.wikipedia.org/wiki/Barack_Obama | O | O | O | https://en.wikipedia.org/wiki/Hawaii |

More in details can be found in this [survey](http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf).

## Current SOTA
[Raiman][Raiman] is the current SOTA in Cross-lingual Entity Linking for WikiDisamb30 and TAC KBP 2010 datasets (note: [Mulang’ et al. 2020](https://arxiv.org/pdf/2008.05190.pdf) is the current Sota for ConLL-AIDA dataset). They construct a type system, and use it to constrain the outputs of a neural network to respect the symbolic structure. They achieve this by reformulating the design problem into a mixed integer problem: create a type system and subsequently train a neural network with it. They propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system
informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters. They apply DeepType to the problem of Entity Linking on three standard datasets (i.e. WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings, while explicitly using symbolic information lets it integrate new entities without retraining.

## Evaluation

### Metrics

#### Disambiguation-Only Approach

* Micro-Precision: Fraction of correctly disambiguated named entities in the full corpus.
* Macro-Precision: Fraction of correctly disambiguated named entities, averaged by document.

#### End-to-End Approach

* Gerbil Micro-F1 - strong matching: micro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.
* Gerbil Macro-F1 - strong matching: macro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.

### Datasets

#### AIDA CoNLL-YAGO Dataset

The [AIDA CoNLL-YAGO][AIDACoNLLYAGO] Dataset by [[Hoffart]](http://www.aclweb.org/anthology/D11-1072) contains assignments of entities to the mentions of named entities annotated for the original [[CoNLL]](http://www.aclweb.org/anthology/W03-0419.pdf) 2003 NER task. The entities are identified by [YAGO2](http://yago-knowledge.org/) entity identifier, by [Wikipedia URL](https://en.wikipedia.org/), or by [Freebase mid](http://wiki.freebase.com/wiki/Machine_ID).

##### Disambiguation-Only Models
   
|  Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| Mulang’ et al. (2020) | 94.94 | - | [Evaluating the Impact of Knowledge Graph Context on Entity Disambiguation Models](https://arxiv.org/pdf/2008.05190.pdf) | -  |
| Raiman et al. (2018) | 94.88 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) |
| Sil et al. (2018) | 94.0 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | |
| Radhakrishnan et al. (2018) | 93.0 | 93.7 | [ELDEN: Improved Entity Linking using Densified Knowledge Graphs](http://aclweb.org/anthology/N18-1167) | |
| Le et al. (2018) | 93.07 | - | [Improving Entity Linking by Modeling Latent Relations between Mentions](http://aclweb.org/anthology/P18-1148) |[Official](https://github.com/lephong/mulrel-nel)
| Ganea and Hofmann (2017) | 92.22 | - | [Deep Joint Entity Disambiguation with Local Neural Attention](https://www.aclweb.org/anthology/D17-1277) | [Link](https://github.com/dalab/deep-ed) |
| Hoffart et al. (2011) | 82.29 | 82.02 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) |  |

##### End-to-End Models
   
|  Paper / Source | Micro-F1-strong | Macro-F1-strong | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| van Hulst et al. (2020) | **83.3** | 81.3  | [REL: An Entity Linker Standing on the Shoulders of Giants](https://arxiv.org/abs/2006.01969) | [Official](https://github.com/informagi/REL) |
| Kolitsas et al. (2018) | 82.6 | **82.4** | [End-to-End Neural Entity Linking](https://arxiv.org/pdf/1808.07699.pdf) | [Official](https://github.com/dalab/end2end_neural_el) |
| Kannan Ravi et al. (2021) | 83.1| - | [CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata](https://arxiv.org/pdf/2101.09969.pdf) | [Official](https://github.com/ManojPrabhakar/CHOLAN) |
| Piccinno et al. (2014) | 70.8 | 73.0 | [From TagME to WAT: a new entity annotator](https://dl.acm.org/citation.cfm?id=2634350) | |
| Hoffart et al. (2011) | 71.9 | 72.8 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) | |

#### TAC KBP English Entity Linking Comprehensive and Evaluation Data 2010 

The Knowledge Base Population (KBP) Track at [TAC 2010](https://tac.nist.gov/2010) will explore extraction of information about entities with reference to an external knowledge source. Using basic schema for persons, organizations, and locations, nodes in an ontology must be created and populated using unstructured information found in text. A collection of [Wikipedia Infoboxes](http://en.wikipedia.org/wiki/Help:Infobox) will serve as a rudimentary initial knowledge representation. You can download the dataset from [LDC](https://www.ldc.upenn.edu/) or [here](https://github.com/ChrisLeeJ/TAC_KBP_English_EL_2010).

##### Disambiguation-Only Models

|  Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 
| ------------- | :-----:| :----: | :----: | --- |
| Raiman et al. (2018) | 90.85 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) |
| Sil et al. (2018) | 87.4 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) |      |
| Yamada et al. (2016) | 85.2 | - | [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/pdf/1601.01343.pdf) |      |

### Platforms

Evaluating Entity Linking systems in a manner that allows for direct comparison of performance can be difficult. The precise definition of a "correct" annotation can be somewhat subjective and it is easy to make mistakes. To provide a simple example, given the input surface form **"Tom Waits"**, an evaluation dataset might record the dbpedia resource `http://dbpedia.org/resource/Tom_Waits` as the correct referent. Yet an annotation system which returns a reference to `http://dbpedia.org/resource/PEHDTSCKJBMA` has technically provided an appropriate annotation as this resource is a redirect to `http://dbpedia.org/resource/Tom_Waits`. Alternatively if evaluating an End-to-End EL system, then accuracy with respect to word boundaries must be considered e.g. if a system only annotates **"Obama"** with the URI `http://dbpedia.org/resource/Barack_Obama` in the surface form **"Barack Obama"**, then is the system correct or incorrect in its annotation?

Furthermore, the performance of an EL system can be strongly affected by the nature of the content on which the evaluation is performed e.g. news content versus Tweets. Hence comparing the relative performance of two EL systems which have been tested on two different corpora can be fallicious. Rather than allowing these little subjective points to creep into the evaluation of EL systems, it is better to make use of a standard evaluation platform where these assumptions are known and made explicit in the configuration of the experiment.

[GERBIL][GERBIL], developed by [AKSW][AKSW] is an evaluation platform that is based on the [BAT framework][Cornolti]. It defines a number of standard experiments which may be run for any given EL service. These experiment types determine how strict the evaluation is with respect to measures such as word boundary alignment and also dictates how much responsibility is assigned to the EL service with respect to Entity Recognition, etc. GERBIL hosts 38 evaluation datasets obtained from a variety of different EL challenges. At present it also has hooks for 17 different EL services which may be included in an experiment.

GERBIL may be used to test your own EL system either by downloading the source code and deploying GERBAL locally, or by making your service available on the web and giving GERBIL a link to your API endpoint. The only condition is that your API must accept input and respond with output in [NIF][NIF] format. It is also possible to upload your own evaluation dataset if you would like to test these services on your own content. Note the dataset must also be in NIF format. The [DBpedia Spotlight evaluation dataset][SpotlightEvaluation] is a good example of how to structure your content.

GERBIL does have a number of shortcomings, the most notable of which are:
1. There is no way to view the annotations returned by each system you test. These are handled internally by GERBIL and then discarded. This can make it difficult to determine the source of error with an EL system.
2. There is no way to observe the candidate list considered for each surface form. This is, of course, a standard problem with any third party EL API, but if one is conducting a detailed investigation into the performance of an EL system, it is important to know if the source of error was the EL algorithm itself, or the candidate retrieval process which failed to identify the correct referent as a candidate. This was listed as an important consideration by [Hachey et al][Hachey].

Nevertheless, GERBIL is an excellent resource for standardising how EL systems are tested and compared. It is also a good starting point for anyone new to Entity Linking as it contains links to a wide variety of EL resources. For more information, see the research paper by [[Usbeck]](http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf).

## References

[Hoffart] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust Disambiguation of Named Entities in Text. EMNLP 2011. http://www.aclweb.org/anthology/D11-1072

[CoNLL] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL 2003. http://www.aclweb.org/anthology/W03-0419.pdf

[Usbeck] Usbeck et al. GERBIL - General Entity Annotator Benchmarking Framework. WWW 2015. http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf

[Go back to the README](../README.md)

[Sil]: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101 "Neural Cross-Lingual Entity Linking"
[Shen]: http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions"
[AIDACoNLLYAGO]: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/ "AIDA CoNLL-YAGO Dataset"
[YAGO2]: http://yago-knowledge.org/ "YAGO2"
[Wikipedia]: https://en.wikipedia.org/ "Wikipedia"
[Freebase]: http://wiki.freebase.com/wiki/Machine_ID "Freebase"
[Radhakrishnan]: http://aclweb.org/anthology/N18-1167 "ELDEN: Improved Entity Linking using Densified Knowledge Graphs"
[Le]: https://arxiv.org/abs/1804.10637
[NIF]: http://persistence.uni-leipzig.org/nlp2rdf/ "NLP Interchange Formt"
[SpotlightEvaluation]: http://apps.yovisto.com/labs/ner-benchmarks/data/dbpedia-spotlight-nif.ttl "GERBIL DBpedia Spotlight Dataset"
[Cornolti]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40749.pdf "A Framework for Benchmarking Entity-Annotation Systems"
[GERBIL]: http://aksw.org/Projects/GERBIL.html "General Entity Annotator Benchmarking framework"
[AKSW]: http://aksw.org/About.html "Agile Knowledge Engineering and Semantic Web"
[Hachey]: http://benhachey.info/pubs/hachey-aij12-evaluating.pdf "Evaluating Entity Linking with Wikipedia"
[Raiman]: https://arxiv.org/pdf/1802.01021.pdf "DeepType: Multilingual Entity Linking by Neural Type System Evolution"


================================================
FILE: english/grammatical_error_correction.md
================================================
# Grammatical Error Correction

Grammatical Error Correction (GEC) is the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors. 

GEC is typically formulated as a sentence correction task. A GEC system takes a potentially erroneous sentence as input and is expected to transform it to its corrected version. See the example given below: 

| Input (Erroneous)          | Output (Corrected)     |
| -------------------------  | ---------------------- |
|She see Tom is catched by policeman in park at last night. | She saw Tom caught by a policeman in the park last night.|

### CoNLL-2014 Shared Task

The [CoNLL-2014 shared task test set](https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz) is the most widely used dataset to benchmark GEC systems. The test set contains 1,312 English sentences with error annotations by 2 expert annotators. Models are evaluated with MaxMatch scorer ([Dahlmeier and Ng, 2012](http://www.aclweb.org/anthology/N12-1067)) which computes a span-based F<sub>β</sub>-score (β set to 0.5 to weight precision twice as recall).

The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (_**Restricted**_) and those that made use of large, non-public datasets (_**Unrestricted**_).

**Restricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Majority-voting ensemble (7 systems) (Omelianchuk et al., BEA 2024) | 72.8 | [Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models](https://arxiv.org/abs/2404.14914) | [official](https://github.com/grammarly/pillars-of-gec) |
| GRECO (Qorib and Ng, EMNLP 2023) | 71.12 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785) | [official](https://github.com/nusnlp/greco) |
| ESC (Qorib et al., NAACL 2022) | 69.51 | [Frustratingly Easy System Combination for Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.143/) | [official](https://github.com/nusnlp/esc) |
| T5 ([t5.1.1.xxl](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md)) trained on [cLang-8](https://github.com/google-research-datasets/clang8) (Rothe et al., ACL-IJCNLP 2021) | 68.87 | [A Simple Recipe for Multilingual Grammatical Error Correction](https://arxiv.org/pdf/2106.03830.pdf) | [T5](https://github.com/google-research/text-to-text-transfer-transformer), [cLang-8](https://github.com/google-research-datasets/clang8) |
| Tagged corruptions - ensemble (Stahlberg and Kumar, 2021)| 68.3 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa + ELECTRA + RoBERTa ensemble (Mesham et al., EACL 2023) | 67.93 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| TMTC (Lai et al., ACL Findings 2022) | 67.02 | [Type-Driven Multi-Turn Corrections for Grammatical Error Correction](https://aclanthology.org/2022.findings-acl.254) | [official](https://github.com/DeepLearnXMU/TMTC) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + (BERT, RoBERTa, XLNet), ensemble (Omelianchuk et al., BEA 2020) | 66.5 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Shallow Aggressive Decoding with BART (12+2), single model (beam=1) (Sun et al., ACL 2021) | 66.4 | [Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding](https://aclanthology.org/2021.acl-long.462.pdf) | [Official](https://github.com/AutoTemp/Shallow-Aggressive-Decoding) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa (Mesham et al., EACL 2023) | 66.06 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| DeBERTa(L) + RoBERTa(L) + XLNet (Tarnavskyi et al., ACL 2022) | 65.3 | [Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction](https://aclanthology.org/2022.acl-long.266) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + XLNet, single model (Omelianchuk et al., BEA 2020) | 65.3 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 65.2 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) | 65.0 | [An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction](https://arxiv.org/abs/1909.00502) | [Official](https://github.com/butsugiri/gec-pseudodata) |
| Seq2Edits ensemble + Full sequence rescoring  (Stahlberg and Kumar, EMNLP 2020) | 62.7 | [Seq2Edits: Sequence Transduction Using Span-level Edit Operations](https://aclanthology.org/2020.emnlp-main.418.pdf) | [Official](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/transformer_seq2edits.py) |
| Sequence Labeling with edits using BERT, Faster inference (Ensemble)  (Awasthi et al., EMNLP 2019) | 61.2 | [Parallel Iterative Edit Models for Local Sequence Transduction](https://www.aclweb.org/anthology/D19-1435.pdf) | [Official](https://github.com/awasthiabhijeet/PIE) |
| Copy-Augmented Transformer + Pre-train (Zhao and Wang, NAACL 2019) | 61.15 | [Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data](https://arxiv.org/pdf/1903.00138.pdf) | [Official](https://github.com/zhawe01/fairseq-gec) |
| Sequence Labeling with edits using BERT, Faster inference (Single Model) (Awasthi et al., EMNLP 2019) | 59.7 | [Parallel Iterative Edit Models for Local Sequence Transduction](https://www.aclweb.org/anthology/D19-1435.pdf) | [Official](https://github.com/awasthiabhijeet/PIE) |
| CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) | 56.52 | [Neural Quality Estimation of Grammatical Error Correction](http://aclweb.org/anthology/D18-1274) | [Official](https://github.com/nusnlp/neuqe/) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  56.25 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| Transformer (Junczys-Dowmunt et al., 2018) | 55.8 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| [Official](https://github.com/grammatical/neural-naacl2018) |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 54.79 | [A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |

**Unrestricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  61.34 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### CoNLL-2014 10 Annotations

[Bryant and Ng, 2015](http://aclweb.org/anthology/P15-1068) released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set ([link](http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip)).

**Restricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| GRECO (Qorib and Ng, EMNLP 2023) | 85.21 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785/) | [official](https://github.com/nusnlp/greco) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  72.04 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 70.14 (measured by Ge et al., 2018) | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |

**Unrestricted**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost (Ge et al., 2018) |  76.88 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### JFLEG

[JFLEG test set](https://github.com/keisks/jfleg) released by [Napoles et al., 2017](http://aclweb.org/anthology/E17-2037) consists of 747 English sentences with 4 references for each sentence. Models are evaluated with [GLEU](https://github.com/cnap/gec-ranking/) metric ([Napoles et al., 2016](https://arxiv.org/pdf/1605.02592.pdf)).


**Restricted**:  

| Model           | GLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Tagged corruptions (Stahlberg and Kumar, 2021)| 64.7 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 62.0 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) |  61.50 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA |
| Transformer (Junczys-Dowmunt et al., 2018) | 59.9 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| NA |
| CNN Seq2Seq (Chollampatt and Ng, 2018)| 57.47 | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) |


**Unrestricted**:

| Model           | GLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) |  62.42 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA |

_**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets.


### BEA Shared Task - 2019
[BEA shared task - 2019 dataset](https://www.cl.cam.ac.uk/research/nl/bea2019st/) released for the BEA Shared Task on Grammatical Error Correction provides a newer and bigger dataset for evaluating GEC models in 3 tracks, based on the datasets used for training:
- [Restricted track](https://competitions.codalab.org/competitions/20228)
- [Unrestricted track](https://competitions.codalab.org/competitions/20229)
- [Low-resource track](https://competitions.codalab.org/competitions/20230)   


Training and dev sets are released publicly and a GEC model's performance is evaluated by F-0.5 score. The model outputs on the test-set have to be uploaded to Codalab(publicly available) where category-wise error metrics are displayed. The test set consists of 4477 sentences(larger and diverse than the CoNLL-14 dataset) and the outputs are scored via [ERRANT](https://github.com/chrisjbryant/errant) toolkit. The released data are collected from 2 sources: 
  - Write & Improve, an online web platform that assists non-native English students with their writing.
  - LOCNESS, a corpus consisting of essays written by native English students.   


The description of tracks from the BEA [site](https://www.cl.cam.ac.uk/research/nl/bea2019st/#tracks) is given below:   


_**Restricted Track:**_
In the restricted track, participants may only use the following learner datasets:
  - FCE (Yannakoudakis et al., 2011)
  - Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012)
  - NUCLE (Dahlmeier et al., 2013)
  - W&I+LOCNESS (Bryant et al., 2019; Granger, 1998)   
Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.   


_**Unrestricted Track:**_
In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.   


_**Low Resource Track (formerly Unsupervised Track):**_
In the low resource track, participants may only use the following learner dataset: W&I+LOCNESS development set.   

Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.   


### Results on WI-LOCNESS test set:
**Restricted track**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Majority-voting ensemble (7 systems) (Omelianchuk et al., BEA 2024) | 81.4 | [Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models](https://arxiv.org/abs/2404.14914) | [official](https://github.com/grammarly/pillars-of-gec) |
| GRECO (Qorib and Ng, EMNLP 2023) | 80.84 | [System Combination via Quality Estimation for Grammatical Error Correction](https://aclanthology.org/2023.emnlp-main.785) | [official](https://github.com/nusnlp/greco) |
| ESC (Qorib et al., NAACL 2022) | 79.90| [Frustratingly Easy System Combination for Grammatical Error Correction](https://aclanthology.org/2022.naacl-main.143/) | [official](https://github.com/nusnlp/esc) |
| TMTC (Lai et al., ACL Findings 2022) | 77.93 | [Type-Driven Multi-Turn Corrections for Grammatical Error Correction](https://aclanthology.org/2022.findings-acl.254) | [official](https://github.com/DeepLearnXMU/TMTC) |
| RedPenNet (Didenko & Sameliuk, UNLP 2023) | 77.60 | [RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans](https://aclanthology.org/2023.unlp-1.15/) | [official](https://github.com/WebSpellChecker/unlp-2023-shared-task) |
| RoBERTa(L) + EditScorer (Sorokin, EMNLP 2022) | 77.1 | [Improved grammatical error correction by ranking elementary edits](https://aclanthology.org/2022.emnlp-main.785) | [official](https://github.com/AlexeySorokin/EditScorer) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa + ELECTRA + RoBERTa ensemble (Mesham et al., EACL 2023) | 76.17 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| DeBERTa(L) + RoBERTa(L) + XLNet (Tarnavskyi et al., ACL 2022) | 76.05 | [Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction](https://aclanthology.org/2022.acl-long.266) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| GECToR large without synthetic pre-training - ensemble (Tarnavskyi and Omelianchuk, 2021) | 76.05 | [Improving Sequence Tagging for Grammatical Error Correction](https://er.ucu.edu.ua/handle/1/2707) | [Official](https://github.com/MaksTarnavskyi/gector-large) |
| T5 ([t5.1.1.xxl](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md)) trained on [cLang-8](https://github.com/google-research-datasets/clang8) (Rothe et al., ACL-IJCNLP 2021) | 75.88 | [A Simple Recipe for Multilingual Grammatical Error Correction](https://arxiv.org/pdf/2106.03830.pdf) | [T5](https://github.com/google-research/text-to-text-transfer-transformer), [cLang-8](https://github.com/google-research-datasets/clang8) |
| Tagged corruptions - ensemble (Stahlberg and Kumar, 2021)| 74.9 | [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models](https://www.aclweb.org/anthology/2021.bea-1.4.pdf)| [Official](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + (BERT, RoBERTa, XLNet), ensemble (Omelianchuk et al., BEA 2020) | 73.6 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| BEA Combination | 73.18 | [Learning to Combine Grammatical Error Corrections ](https://www.aclweb.org/anthology/W19-4414/) | [official](https://github.com/IBM/learning-to-combine-grammatical-error-corrections) |
| Sequence tagging + token-level transformations + two-stage fine-tuning, DeBERTa (Mesham et al., EACL 2023) | 73.09 | [An Extended Sequence Tagging Vocabulary for Grammatical Error Correction](https://aclanthology.org/2023.findings-eacl.119.pdf) | [Official](https://github.com/StuartMesham/gector_experiment_public) |
| Shallow Aggressive Decoding with BART (12+2), single model (beam=1) (Sun et al., ACL 2021) | 72.9 | [Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding](https://aclanthology.org/2021.acl-long.462.pdf) | [Official](https://github.com/AutoTemp/Shallow-Aggressive-Decoding) |
| Sequence tagging + token-level transformations + two-stage fine-tuning + XLNet, single model (Omelianchuk et al., BEA 2020) | 72.4 | [GECToR – Grammatical Error Correction: Tag, Not Rewrite](https://arxiv.org/pdf/2005.12592.pdf) | [Official](https://github.com/grammarly/gector) |
| Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) | 70.2 | [An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction](https://arxiv.org/abs/1909.00502) | NA |
| Transformer + Pre-train with Pseudo Data + BERT (Kaneko et al., ACL 2020) | 69.8 | [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction](https://arxiv.org/pdf/2005.00987.pdf) | [Official](https://github.com/kanekomasahiro/bert-gec) |
| Transformer | 69.47  | [Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data](https://www.aclweb.org/anthology/W19-4427)| [Official: Code to be updated soon](https://github.com/grammatical/pretraining-bea2019) |
| Transformer | 69.00  | [A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning](https://www.aclweb.org/anthology/W19-4423)| [Official](https://github.com/kakaobrain/helo_word/) |
| Ensemble of models | 66.78  | [The LAIX Systems in the BEA-2019 GEC Shared Task](https://www.aclweb.org/anthology/W19-4416)| NA |

**Low-resource track**:

| Model           | F0.5  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Transformer | 64.24  | [Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data](https://www.aclweb.org/anthology/W19-4427)| [Official: Code to be updated soon](https://github.com/grammatical/pretraining-bea2019) |
| Transformer | 58.80  | [A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning](https://www.aclweb.org/anthology/W19-4423)| [Official](https://github.com/kakaobrain/helo_word/) |
| Ensemble of models | 51.81  | [The LAIX Systems in the BEA-2019 GEC Shared Task](https://www.aclweb.org/anthology/W19-4416)| NA |

 
 **Reference**:
 - Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Torsten Zesch, in [Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications](https://www.aclweb.org/anthology/W19-44)
 - Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of Error Types for Grammatical Error Correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.


================================================
FILE: english/information_extraction.md
================================================
# Information Extraction

## Open Knowledge Graph Canonicalization

Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\<Barack Obama, was born in, Honolulu\>* and *\<Obama, took birth in, Honolulu\>* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB.

### Datasets 

| Datasets                                 | # Gold Entities | #NPs  | #Relations | #Triples |
| ---------------------------------------- | :-------------: | ----- | ---------- | -------- |
| [Base](https://suchanek.name/work/publications/cikm2014.pdf) |       150       | 290   | 3K         | 9K       |
| [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) |       446       | 717   | 11K        | 37K      |
| [ReVerb45K](https://github.com/malllabiisc/cesi) |      7.5K       | 15.5K | 22K        | 45K      |

### Noun Phrase Canonicalization

| **Model**                     |               | Base Dataset |        |               | Ambiguous dataset |        |               | ReVerb45k  |        | **Paper**/Source                         |
| :---------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: | ---------------------------------------- |
|                               | **Precision** |  **Recall**  | **F1** | **Precision** |    **Recall**     | **F1** | **Precision** | **Recall** | **F1** |                                          |
| CESI (Vashishth et al., 2018) |     98.2      |     99.8     |  99.9  |     66.2      |       92.4        | 91.9   |     62.7      |    84.4    |  81.9  | [CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information](https://github.com/malllabiisc/cesi) |
| Galárraga et al., 2014 ( IDF) |     94.8      |     97.9     |  98.3  |     67.9      |       82.9        | 79.3   |     71.6      |    50.8    |  0.5   | [Canonicalizing Open Knowledge Bases](https://suchanek.name/work/publications/cikm2014.pdf) |

[Go back to the README](../README.md)


================================================
FILE: english/intent_detection_slot_filling.md
================================================
# Intent Detection and Slot Filling
Intent Detection and Slot Filling is the task of interpreting user commands/queries by extracting the intent and the relevant slots.

Example (from ATIS):
```
Query: What flights are available from pittsburgh to baltimore on thursday morning
Intent: flight info
Slots: 
    - from_city: pittsburgh
    - to_city: baltimore
    - depart_date: thursday
    - depart_time: morning
```

## ATIS
ATIS (Air Travel Information System) (Hemphill et al.) is a dataset by Microsoft CNTK. Available from the [github page](https://github.com/microsoft/CNTK/tree/master/Examples/LanguageUnderstanding/ATIS). The slots are labeled in the BIO ([Inside Outside Beginning](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))) format (similar to NER). This dataset contains only air travel related commands. Most of the ATIS results are based on the work [here](https://github.com/zhenwenzhang/Slot_Filling).

| Model | Slot F1 Score | Intent Accuracy | Paper / Source | Code |
| ------ | ------ | ------ | ------ | ------ |
| Bi-model with decoder | 96.89 | 98.99  | [A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling](https://arxiv.org/abs/1812.10235) |
| CTRAN | 98.46 | 98.07  | [CTRAN: CNN-Transformer-based network for natural language understanding](https://www.sciencedirect.com/science/article/abs/pii/S0952197623011971) | [Official](https://github.com/rafiepour/CTran/)|
| SlotRefine + BERT | 96.16 | 97.74  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| SlotRefine | 96.22 | 97.11  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Stack-Propagation + BERT | 96.10 | 97.50 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| JointBERT-CAE | 96.1 | 97.50 | [CAE: Mechanism to Diminish the Class Imbalanced in SLU Slot Filling Task](https://link.springer.com/chapter/10.1007/978-3-031-16210-7_12)|[Official](https://github.com/phuongnm94/JointBERT_CAE)|
| Co-interactive Transformer | 95.90 | 97.70 | [A Co-Interactive Transformer for Joint Slot Filling and Intent Detection](https://arxiv.org/abs/2010.03880)|[Official](https://github.com/kangbrilliant/DCA-Net)|
| Heterogeneous Attention | 95.58 | 97.76 | [Joint agricultural intent detection and slot filling based on enhanced heterogeneous attention mechanism](https://www.sciencedirect.com/science/article/abs/pii/S0168169923001448)|  |
| Stack-Propagation | 95.90 | 96.90 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| Attention Encoder-Decoder NN | 95.87 | 98.43 | [Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1609.01454)|
| SF-ID (BLSTM) network | 95.80 | 97.76 | [A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1907.00390) | [Official](https://github.com/ZephyrChenzf/SF-ID-Network-For-NLU) |
| Context Encoder | 95.80 | NA | [Improving Slot Filling by Utilizing Contextual Information](https://arxiv.org/pdf/1911.01680.pdf) |
| Capsule-NLU | 95.20 | 95.00 | [Joint Slot Filling and Intent Detection via Capsule Neural Networks](https://arxiv.org/abs/1812.09471) | [Official](https://github.com/czhang99/Capsule-NLU) |
| Joint GRU model(W) | 95.49 | 98.10  |[A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding](https://www.ijcai.org/Proceedings/16/Papers/425.pdf)|
| Slot-Gated BLSTM with Attension | 95.20 | 94.10 | [Slot-Gated Modeling for Joint Slot Filling and Intent Prediction](https://www.csie.ntu.edu.tw/~yvchen/doc/NAACL18_SlotGated.pdf)| [Official](https://github.com/MiuLab/SlotGated-SLU) |
| Joint model with recurrent slot label context  | 94.64 |  98.40 | [Joint Online Spoken Language Understanding and Language Modeling with Recurrent Neural Networks](https://arxiv.org/pdf/1609.01462.pdf) | [Official](https://github.com/HadoopIt/joint-slu-lm) |
| Recursive NN  | 93.96 | 95.40 | [JOINT SEMANTIC UTTERANCE CLASSIFICATION AND SLOT FILLING WITH RECURSIVE NEURAL NETWORKS](https://www.microsoft.com/en-us/research/wp-content/uploads/2014/12/RecNNSLU.pdf) | |
| Encoder-labeler Deep LSTM | 95.66 | NA  | [Leveraging Sentence-level Information with Encoder LSTM for Natural Language Understanding](https://arxiv.org/abs/1601.01530) |
| RNN with Label Sampling  | 94.89 | NA | [Recurrent Neural Network Structured Output Prediction for Spoken Language Understanding](http://speech.sv.cmu.edu/publications/liu-nipsslu-2015.pdf) | |
| Hybrid RNN | 95.06 | NA | [Using recurrent neural networks for slot filling in spoken language understanding.](http://www.iro.umontreal.ca/~lisa/pointeurs/taslp_RNNSLU_final_doubleColumn.pdf) | |
| RNN-EM | 95.25 |  NA  | [Recurrent neural networks with external memory for language understanding](https://arxiv.org/abs/1506.00195) |
| CNN-CRF | 94.35 | NA  | [Convolutional neural network based triangular crf for joint intent detection and slot filling](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/12/IEEE-ASRU-2013.pdf) | |


## SNIPS
SNIPS is a dataset by Snips.ai for Intent Detection and Slot Filling benchmarking. Available from the [github page](https://github.com/snipsco/nlu-benchmark). This dataset contains several day to day user command categories (e.g. play a song, book a restaurant).

| Model | Slot F1 Score | Intent Accuracy | Paper / Source | Code |
| ------ | ------ | ------ | ------ | ------ |
| CTRAN | 98.30 | 99.42  | [CTRAN: CNN-Transformer-based Network for Natural Language Understanding](https://www.sciencedirect.com/science/article/abs/pii/S0952197623011971) | [Official](https://github.com/rafiepour/CTran/)|
| SlotRefine + BERT | 97.05 | 99.04  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Stack-Propagation + BERT | 97.00 | 99.00 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| JointBERT-CAE | 97.00 | 98.30 | [CAE: Mechanism to Diminish the Class Imbalanced in SLU Slot Filling Task](https://link.springer.com/chapter/10.1007/978-3-031-16210-7_12)|[Official](https://github.com/phuongnm94/JointBERT_CAE)|
| Heterogeneous Attention | 96.32 | 98.29 | [Joint agricultural intent detection and slot filling based on enhanced heterogeneous attention mechanism](https://www.sciencedirect.com/science/article/abs/pii/S0168169923001448)|  |
| Co-interactive Transformer | 95.90 | 98.80 | [A Co-Interactive Transformer for Joint Slot Filling and Intent Detection](https://arxiv.org/abs/2010.03880)|[Official](https://github.com/kangbrilliant/DCA-Net)|
| Stack-Propagation | 94.20 | 98.00 | [A Stack-Propagation Framework with Token-level Intent Detection for Spoken Language Understanding](https://arxiv.org/abs/1909.02188)|[Official](https://github.com/LeePleased/StackPropagation-SLU)|
| SlotRefine | 93.72 | 97.44  | [SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling](https://aclanthology.org/2020.emnlp-main.152.pdf) | [Official](https://github.com/moore3930/SlotRefine)|
| Context Encoder | 93.60 | NA | [Improving Slot Filling by Utilizing Contextual Information](https://arxiv.org/pdf/1911.01680.pdf) |
| SF-ID (BLSTM) network | 92.23 | 97.43 | [A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling](https://arxiv.org/abs/1907.00390) | [Official](https://github.com/ZephyrChenzf/SF-ID-Network-For-NLU) |
| Capsule-NLU | 91.80 | 97.70 | [Joint Slot Filling and Intent Detection via Capsule Neural Networks](https://arxiv.org/abs/1812.09471) | [Official](https://github.com/czhang99/Capsule-NLU) |
| Slot-Gated BLSTM with Attention | 88.80 | 97.00 | [Slot-Gated Modeling for Joint Slot Filling and Intent Prediction](https://www.csie.ntu.edu.tw/~yvchen/doc/NAACL18_SlotGated.pdf)| [Official](https://github.com/MiuLab/SlotGated-SLU) |


================================================
FILE: english/keyphrase_extraction_generation.md
================================================
# Keyphrase Extraction and Generation

Keyphrase extraction is the NLP task of identifying **key** phrases in the document, and has a wide range of applications applications such as information retrieval, question answering, text summarization etc. There are two aspects to keyphrases - some of them are directly occuring in the document, and are termed **present** keyphrases in the literature. Some of the keyphrases don't occur in the document, but can still function as appropriate summaries/tags for a given document, and they are termed **absent** keyphrases. Traditionally, NLP research addressed extracting the **present** keyphrases, while the post-deep learning approaches are also considering **absent** keyphrases. Thus, while Keyphrase Extraction (KPE) can be termed a "sequence labeling" problem, Keyphrase Generation (KPG) is treated as a "sequence to sequence" generation problem. Another dominant approach is to treat both of them together as a generation problem in an integrated approach.  

Two recent surveys summarizing all research on this topic:
1. "A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models". [Song et.al., 2023](https://aclanthology.org/2023.findings-eacl.161/). EACL 2023. 
2. "From statistical methods to deep learning, automatic keyphrase prediction: A survey". [Xie et.al., 2023](https://www.sciencedirect.com/science/article/pii/S030645732300119X). Information Processing and Management 60(4). 

### Standard Datasets and Evaluation Measures

There are several open datasets for this task, and they generally consists of text instances, followed by a list of assigned keyphrases per text. Keyphrases are either manually annotated or extracted automatically from pre-tagged web content in the training data. Keyphrases can be either *present* or *absent* in the text itself. 

#### **Commonly used Datasets**

#### KP20K
This dataset was first described in [Meng et.al., 2017](https://aclanthology.org/P17-1054/) and contains the titles, abstracts, and keyphrases of 20,000 scientific articles in computer science extracted automaticallly, and it can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/kp20k).

#### Inspec
The dataset consists of 2000 English scientific abstracts from the [Inspec](https://en.wikipedia.org/wiki/Inspec) database, with keyphrases annotated by professional indexers. The dataset is described in [Hulth, 2003](https://aclanthology.org/W03-1028/) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/inspec). 

#### Krapivin
Krapivin consists of 2000 English scientific articles (full text) from computer science domain, with keyphrases annotated by the authors, and verified by the reviewers. The dataset is described in [Krapivin et.al., 2010](https://link.springer.com/chapter/10.1007/978-3-642-13654-2_12) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/krapivin). 

#### NUS
NUS consists of about 200 English scientific publications (full text), with keyphrases annotated by the authors, as well as an independent set of annotators. The dataset is described in [Nguyen and Kan, 2007](https://link.springer.com/chapter/10.1007/978-3-540-77094-7_41) and can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/nus). 

#### SemEval
SemEval dataset was originally used in the [SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications](https://aclanthology.org/S17-2091/), and consists of 500 English open-access scientific publications from ScienceDirect. Keyphrases are annotated by a set of student volunteers followed by a second annotation by an expert annotator. It can be accessed from [Huggingface hub](https://huggingface.co/datasets/midas/semeval2017). 

#### Other Datasets

#### DUC
This dataset [Wan and Xiao, 2008](https://dl.acm.org/doi/10.5555/1620163.1620205) consists of around 300 English news articles with their keyphrases, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/duc2001).

#### KPTimes
KPTimes [Gallina et.al., 2019](https://aclanthology.org/W19-8617/) is a large dataset of 279,923 news articles from NYTimes and 10,000 articles from JPTimes, with curated keyphrase annotations by editors, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/kptimes)

#### OpenKP
OpenKP [Xiong et.al., 2019](https://aclanthology.org/D19-1521/) consists of approximately 150K web documents with manually annotated keyphrases, and is hosted on [Huggingface hub](https://huggingface.co/datasets/midas/openkp). 

**Evaluation Measures**

Macro Precision/Recall/F1 score are calculated for top-k matches while comparing the ground-truth keyphrases and the model output. While F1\@k where k= 5 or 10 are commonly reported, variants such as F1@/O/M are also reported. F1\@O uses the number of gold keyphrases as k, and F1\@M uses the number of predicted keyphrases as k. For "absent" keyphrases, some papers also report R\@10/50. The following tables will rank the models in terms of F1\@5, for the five most commonly reported datasets, KP20K, Inspec, Krapivin, NUS, SemEval [Most recent research reports experiments using KP20K as training data, and testing on KP20k, NUS, Semeval, Inspec and Krapivin]. 

Here are a few notes on results:
 - Asterisk indicates the paper reported **Micro** scores, instead of Macro.  
 - Exclamation indicates the paper does not mention whether they report a macro or a micro measure.  
 - All results are from the original results reported in the paper that describes the model.    
 - Some papers report on a scale of 0-1 and some on 0-100 (sometimes, the same paper uses different scales in different tables!). All results below are changed to 0-1 to maintain uniformity. 

#### KP20K 

| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|0.232 (!) |0.044 (!) |[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.351(!) | 0.032(!)  | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.370 | 0.050 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.363 | 0.067 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |  
|SetTrans (Ye et.al., 2021) | 0.358| 0.036| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.408 (!) | 0.047 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) |0.311 |0.016 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |- |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.333 | -| [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |  


#### SemEval
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|-|- |[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.329 (!)| 0.028 (!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.360 | 0.043 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.343 | 0.053 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |
|SetTrans (Ye et.al., 2021) | 0.331| 0.026| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.416 (!) | 0.030 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) |
|ExHiRD-h (Chen et.al., 2020) |0.284 | 0.017|[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |0.320 | - | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.293 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### Inspec

| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|0.352 (!)|0.049 (!)|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.26 (!) | 0.017(!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) | 0.330 | 0.025 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.322 | 0.036 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |  
|SetTrans (Ye et.al., 2021) | 0.285| 0.021| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|  
| UniKeyphrase (Wu et.al., 2021)| 0.29 (!) | 0.029 (!)| [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) |0.253 |0.011 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) | - |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.278 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### Krapivin
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
| P-AKG (Wu et.al., 2022) | - |- | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|ChatGPT (Martinez et.al., 2023)|-|-|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
|WR-SetTrans (Xie et.al., 2022) | 0.360|0.057 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.323 | 0.078 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) | 
|SetTrans (Ye et.al., 2021) | 0.326| 0.047| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| - | - | [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) | 0.286 | 0.022 |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) | 0.318|- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)|
|CopyRNN (Meng et.al., 2017) |0.311 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

#### NUS 
| Model           | Present-F1\@5 | Absent-F1\@5 | Paper / Source | Code |
| --------------- | :-----: |  :-----: | -------------- | ---- |
|ChatGPT (Martinez et.al., 2023)|-|-|[ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task](https://arxiv.org/abs/2304.14177) | - |
| P-AKG (Wu et.al., 2022) |  0.412 (!)| 0.036(!) | [Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning](https://ojs.aaai.org/index.php/AAAI/article/view/21402) | -  |
|WR-SetTrans (Xie et.al., 2022) |0.428 |0.057 |[WR-One2Set: Towards Well-Calibrated Keyphrase Generation](https://aclanthology.org/2022.emnlp-main.491/)|
| Beam+KPD-A (Chowdhury et.al., 2022) | 0.418 | 0.079 | [KPDROP: Improving Absent Keyphrase Generation](https://aclanthology.org/2022.findings-emnlp.357) |
|SetTrans (Ye et.al., 2021) | 0.406| 0.042| [One2Set: Generating Diverse Keyphrases as a Set](https://aclanthology.org/2021.acl-long.354/)|
| UniKeyphrase (Wu et.al., 2021)| 0.434 (!) | 0.037 (!) | [UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction](https://aclanthology.org/2021.findings-acl.73/) | 
|ExHiRD-h (Chen et.al., 2020) | - | - |[Exclusive Hierarchical Decoding for Deep Keyphrase Generation](https://aclanthology.org/2020.acl-main.103) |
|CorrRNN (Chen et.al., 2018) |0.358 |- | [Keyphrase Generation with Correlation Constraints](https://aclanthology.org/D18-1439)| 
|CopyRNN (Meng et.al., 2017) |0.334 | - | [Deep Keyphrase Generation](https://aclanthology.org/P17-1054) |

[Go back to the README](../README.md)


================================================
FILE: english/language_modeling.md
================================================
# Language modeling

Language modeling is the task of predicting the next word or character in a document.

\* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. ([Mikolov et al., (2010)](https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), [Krause et al., (2017)](https://arxiv.org/pdf/1709.07432))

## Word Level Models

### Penn Treebank

A common evaluation dataset for language modeling is the Penn Treebank,
as pre-processed by [Mikolov et al., (2011)](https://www.isca-speech.org/archive/archive_papers/interspeech_2011/i11_0605.pdf).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with `<eos>`,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an `<unk>` token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).

| Model           | Validation perplexity | Test perplexity | Number of params |  Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)            | 42.9  | 42.9  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)      | 44.9  | 44.8  | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 46.63 | 46.01 | 22M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 47.38 | 46.54 | 22M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) |
| AWD-LSTM-DOC x5 (Takase et al., 2018) | 48.63 | 47.17 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| Mogrifier RLSTM (Melis, 2022)                           | 48.9  | 47.9  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)                     | 51.4  | 50.1  | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | 24M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 53.79 | 52.00 | 23M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) |
| AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | 23M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | 24M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| Trellis Network (Bai et al., 2019) |   -   | 54.19 | 34M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet)
| AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 56.44 | 54.33 | 22M | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| AWD-LSTM-MoS + finetune (Yang et al., 2018) | 56.54 | 54.44 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| Transformer-XL (Dai et al., 2018) ***under review*** | 56.72 | 54.52 | 24M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| AWD-LSTM-MoS (Yang et al., 2018) | 58.08 | 55.97 | 22M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) |  58.9 | 56.8 | 24M | [Fraternal dropout](https://arxiv.org/pdf/1711.00066.pdf) | [Official](https://github.com/kondiz/fraternal-dropout) |
| AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | 24M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |

### WikiText-2

[WikiText-2](https://arxiv.org/abs/1609.07843) has been proposed as a more realistic
benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2
consists of around 2 million words extracted from Wikipedia articles.

| Model           | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)            | 39.3  | 38.0  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)      | 40.2  | 38.6  | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 40.27 | 38.65 | 35M | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 40.85 | 39.14 | 35M | [FRAGE: Frequency-Agnostic Word Representation](https://arxiv.org/abs/1809.06858) | [Official](https://github.com/ChengyueGongR/Frequency-Agnostic) |
| AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | 33M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | 33M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| AWD-LSTM-DOC x5 (Takase et al., 2018) | 54.19 | 53.09 | 185M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| Mogrifier RLSTM (Melis, 2022)                           | 56.7  | 55.0  | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)                     | 57.3  | 55.1  | 35M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| AWD-LSTM-DOC + Partial Shuffle (Press, 2019) ***preprint*** | 60.16 | 57.85 | 37M | [Partially Shuffling the Training Data to Improve Language Models](https://arxiv.org/abs/1903.04167) | [Official](https://github.com/ofirpress/PartialShuffle) |
| AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | 37M | [Direct Output Connection for a High-Rank Language Model](https://arxiv.org/abs/1808.10143) | [Official](https://github.com/nttcslab-nlp/doc_lm) |
| AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | 35M | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) | [Official](https://github.com/zihangdai/mos) |
| AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) |  66.8 | 64.1 | 34M | [Fraternal dropout](https://arxiv.org/pdf/1711.00066.pdf) | [Official](https://github.com/kondiz/fraternal-dropout) |
| AWD-LSTM + ATOI (Kocher et al., 2019) | 67.47 | 64.73 | 33M | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | 33M | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) | [Official](https://github.com/salesforce/awd-lstm-lm) |

### WikiText-103

[WikiText-103](https://arxiv.org/abs/1609.07843) The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.

| Model           | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :---:| :---:| :---:| -------- | --- |
| Routing Transformer (Roy et al., 2020)* ***arxiv preprint*** | - | 15.8 | - | [Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/pdf/2003.05997.pdf) | - |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 15.8 | 16.4 | 257M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Compressive Transformer (Rae et al., 2019)* ***arxiv preprint*** | 16.0 | 17.1(16.1 with basic dynamic evaluation) | ~257M | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/pdf/1911.05507.pdf) | - |
| SegaTransformer-XL (Bai et al., 2020) | - | 17.1 | 257M | [Segatron: Segment-Aware Transformer for Language Modeling and Understanding](https://arxiv.org/abs/2004.14996) | [Official](https://github.com/rsvp-ai/segatron_aaai) |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 17.7 | 18.3 | 257M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer with tied adaptive embeddings (Baevski and Auli, 2018) | 19.8 | 20.5 | 247M | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) |
| TaLK Convolutions (Lioutas et al., 2020)| - | 23.3 | 240M | [Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184) | [Official](https://github.com/lioutasb/TaLKConvolutions) |
| Transformer-XL Standard (Dai et al., 2018) ***under review*** | 23.1 | 24.0 | 151M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| AdvSoft + 4 layer QRNN + dynamic eval (Wang et al., 2019) | 27.2 | 28.0 |  | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) | [Official](https://github.com/ChengyueGongR/advsoft) |
| LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) | 29.0 | 29.2 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| Trellis Network (Bai et al., 2019) |   -   | 30.35 | 180M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet)
| AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 31.92 | 32.85 | | [Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes](https://arxiv.org/abs/1909.08700) | [Official](https://github.com/nkcr/overlap-ml) |
| LSTM + Hebbian (Rae et al., 2018) | 34.1 | 34.3 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| LSTM (Rae et al., 2018) | 36.0 | 36.4 | | [Fast Parametric Learning with Activation Memorization](http://arxiv.org/abs/1803.10049) ||
| Gated CNN (Dauphin et al., 2016) | - | 37.2 | | [Language modeling with gated convolutional networks](https://arxiv.org/abs/1612.08083) ||
| Neural cache model (size = 2,000) (Grave et al., 2017) | - | 40.8 | | [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/pdf/1612.04426.pdf) | [Link](https://github.com/kaishengtai/torch-ntm) |
| Temporal CNN (Bai et al., 2018) | - | 45.2 | | [Convolutional sequence modeling revisited](https://openreview.net/forum?id=BJEX-H1Pf) ||
| LSTM (Grave et al., 2017) | - | 48.7 | | [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/pdf/1612.04426.pdf) | [Link](https://github.com/kaishengtai/torch-ntm) |

### 1B Words / Google Billion Word benchmark

[The One-Billion Word benchmark](https://arxiv.org/pdf/1312.3005.pdf) is a large dataset derived from a news-commentary site.
The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words.
Importantly, sentences in this model are shuffled and hence context is limited.

| Model         | Test perplexity | Number of params | Paper / Source | Code |
| ------------- | :-----:| :-----:| --------- | --- |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 21.8 | 0.8B | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer-XL Base (Dai et al., 2018) ***under review*** | 23.5 | 0.46B | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018) | 23.7 | 0.8B | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) 
| 10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ***ensemble*** | 23.7 | 43B? | [Exploring the Limits of Language Modeling](https://arxiv.org/pdf/1602.02410.pdf) | [Official](https://github.com/rafaljozefowicz/lm) |
| Transformer with shared adaptive embeddings (Baevski and Auli, 2018) | 24.1 | 0.46B | [Adaptive Input Representations for Neural Language Modeling](https://arxiv.org/pdf/1809.10853.pdf) | [Link](https://github.com/AranKomat/adapinp) 
| Big LSTM+CNN inputs (Jozefowicz et al., 2016) | 30.0 | 1.04B | [Exploring the Limits of Language Modeling](https://arxiv.org/pdf/1602.02410.pdf) ||
| Gated CNN-14Bottleneck (Dauphin et al., 2017) | 31.9 | ? | [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083.pdf) ||
| BIGLSTM baseline (Kuchaiev and Ginsburg, 2018) | 35.1 | 0.151B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |
| BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018) | 36.3 | 0.052B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |
| BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018) | 39.4 | 0.035B | [Factorization tricks for LSTM networks](https://arxiv.org/pdf/1703.10722.pdf) | [Official](https://github.com/okuchaiev/f-lm) |


## Character Level Models

### Hutter Prize

[The Hutter Prize](http://prize.hutter1.net) Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the
first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset.
Within these 100 million bytes are 205 unique tokens.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)                                  | 0.935 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 0.94 | 277M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Compressive Transformer (Rae et al., 2019) ***arxiv preprint*** | 0.97 | - | [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/pdf/1911.05507.pdf) | - |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)            | 0.988 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| 24-layer Transformer-XL (Dai et al., 2018) ***under review*** | 0.99 | 277M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Longformer Large (Beltagy, Peters, and Cohan; 2020) | 0.99 | 102M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.00 | 41M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| 18-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.03 | 88M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Mogrifier RLSTM (Melis, 2022)                                 | 1.042 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| 12-layer Transformer-XL (Dai et al., 2018) ***under review*** | 1.06 | 41M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| 64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.06 | 235M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| 12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.11 | 44M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| Mogrifier LSTM (Melis et al., 2019)            | 1.122 | 96M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| 3-layer AWD-LSTM (Merity et al., 2018)  | 1.232 | 47M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |
| Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474) | [Official](https://github.com/jzilly/RecurrentHighwayNetworks) |
| FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |

### Text8
[The text8 dataset](http://mattmahoney.net/dc/textdata.html) is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Transformer-XL + RMS dynamic eval (Krause et al., 2019)* ***arxiv preprint*** | 1.038 | 277M | [Dynamic Evaluation of Transformer Language Models](https://arxiv.org/pdf/1904.08378.pdf) | [Official](https://github.com/benkrause/dynamiceval-transformer) |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)               | 1.044 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Transformer-XL Large (Dai et al., 2018) ***under review*** | 1.08 | 277M | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf) | [Official](https://github.com/kimiyoung/transformer-xl) |
| Mogrifier RLSTM (Melis, 2022)                              | 1.096 | 96M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.10 | 41M | [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) | [Official](https://github.com/allenai/longformer) |
| 64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.13 | 235M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| 12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.18 | 44M | [Character-Level Language Modeling with Deeper Self-Attention](https://arxiv.org/abs/1808.04444) ||
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432) | [Official](https://github.com/benkrause/dynamic-evaluation) |
| Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474) | [Official](https://github.com/jzilly/RecurrentHighwayNetworks) |
| LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 |  35M | [Hierarchical Multiscale Recurrent Neural Networks](https://arxiv.org/abs/1609.01704) ||
| BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | [Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025) | [Official](https://github.com/cooijmanstim/recurrent-batch-normalization) |
| Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959) | [Official](https://github.com/benkrause/mLSTM) |

### Penn Treebank
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset.  This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

| Model           | Bit per Character (BPC) |  Number of params | Paper / Source | Code |
| ---------------- | :-----: | :-----: | -------------- | ---- |
| Mogrifier RLSTM + dynamic eval (Melis, 2022)      | 1.061 | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.083 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier RLSTM (Melis, 2022)                     | 1.096 | 24M | [Circling Back to Recurrent Models of Language](https://arxiv.org/abs/2211.01848) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)               | 1.120 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Trellis Network (Bai et al., 2019) | 1.159 | 13.4M | [Trellis Networks for Sequence Modeling](https://openreview.net/pdf?id=HyeVtoRqtQ) | [Official](https://github.com/locuslab/trellisnet)
| 3-layer AWD-LSTM (Merity et al., 2018)  | 1.175 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| 6-layer QRNN (Merity et al., 2018)  | 1.187 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) | [Official](https://github.com/salesforce/awd-lstm-lm) |
| FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |
| FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) | [Official](https://github.com/amujika/Fast-Slow-LSTM) |
| NASCell (Zoph & Le, 2016) | 1.214 |  16.3M | [Neural Architecture Search with Reinforcement Learning](https://arxiv.org/abs/1611.01578) ||
| 2-layer Norm HyperLSTM (Ha et al., 2016) |  1.219 | 14.4M | [HyperNetworks](https://arxiv.org/abs/1609.09106) ||

### Multilingual Wikipedia Corpus

The character-based [MWC](http://k-kawakami.com/research/mwc) dataset is a collection of Wikipedia pages available in a number of languages. Markup and rare characters were removed, but otherwise no preprocessing was applied.

#### MWC English in the single text, large setting.

| Model           | Validation BPC | Test BPC | Number of params |  Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.200 | 1.187 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)               | 1.312 | 1.298 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| HCLM with Cache (Kawakami et al. 2017)            | 1.591 | 1.538 |  8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) |  |
| LSTM (Kawakami et al. 2017)                       | 1.793 | 1.736 |  8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) |  |

#### MWC Finnish in the single text, large setting.

| Model           | Validation BPC | Test BPC | Number of params |  Paper / Source | Code |
| ------------- | :-----:| :-----: | :-----: | -------------- | ---- |
| Mogrifier LSTM + dynamic eval (Melis et al., 2019)| 1.202 | 1.191 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| Mogrifier LSTM (Melis et al., 2019)               | 1.327 | 1.313 | 24M | [Mogrifier LSTM](http://arxiv.org/abs/1909.01792) | [Official](https://github.com/deepmind/lamb) |
| HCLM with Cache (Kawakami et al. 2017)            | 1.754 | 1.711 |  8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) |  |
| LSTM (Kawakami et al. 2017)                       | 1.943 | 1.913 |  8M | [Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling](https://arxiv.org/abs/1704.06986) |  |

[Go back to the README](../README.md)


================================================
FILE: english/lexical_normalization.md
================================================
# Lexical Normalization

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

```
new pix comming tomoroe
new pictures coming tomorrow
```

Datasets usually consists of tweets, since these naturally contain a fair amount of 
these phenomena.

For lexical normalization, only replacements on the word-level are annotated.
Some corpora include annotation for 1-N and N-1 replacements. However, word
insertion/deletion and reordering is not part of the task.

### LexNorm
The [LexNorm](http://people.eng.unimelb.edu.au/tbaldwin/etc/lexnorm_v1.2.tgz) corpus was originally introduced by [Han and Baldwin (2011)](http://aclweb.org/anthology/P/P11/P11-1038.pdf).
Several mistakes in annotation were resolved by [Yang and Eisenstein](http://www.aclweb.org/anthology/D13-1007);
on this page, we only report results on the new dataset. For this dataset, the 2,577
tweets from [Li and Liu(2014)](http://www.aclweb.org/anthology/P14-3012) is often
used as training data, because of its similar annotation style.

This dataset is commonly evaluated with accuracy on the non-standard words. This
means that the system knows in advance which words are in need of normalization.

| Model           | Accuracy  |  Paper / Source | Code | 
| ------------- | :-----:| --- | --- | 
| MoNoise (van der Goot & van Noord, 2017) | 87.63 | [MoNoise: Modeling Noise Using a Modular Normalization System](http://www.let.rug.nl/rob/doc/clin27.paper.pdf) | [Official](https://bitbucket.org/robvanderg/monoise/) | 
| Joint POS + Norm in a Viterbi decoding (Li & Liu, 2015) | 87.58* | [Joint POS Tagging and Text Normalization for Informal Text](http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/download/10839/10838) | |
| Syllable based (Xu et al., 2015) | 86.08 | [Tweet Normalization with Syllables](http://www.aclweb.org/anthology/P15-1089) | |
| unLOL (Yang & Eisenstein, 2013) | 82.06 | [A Log-Linear Model for Unsupervised Text Normalization](http://www.aclweb.org/anthology/D13-1007) | |

\* used a slightly different version of the data

#### LexNorm2015

The
[LexNorm2015](https://github.com/noisy-text/noisy-text.github.io/blob/master/2015/files/lexnorm2015.tgz)
dataset was introduced for the shared task on lexical normalization, hosted at
WNUT2015 ([Baldwin et al(2015)](http://aclweb.org/anthology/W15-4319)).  In
this dataset, 1-N and N-1 replacements are included in the annotation. The
evaluation metrics used are precision, recall and F1 score. However, this is
calculated a bit odd:

Precision: out of all necessary replacements, how many correctly found

Recall: out of all normalization by system, how many correct

This means that if the system replaces a word which is in need of normalization, 
but chooses the wrong normalization, it is penalized twice.

| Model           | F1  | Precision | Recall | Paper / Source | Code | 
| ------------- | :-----:| :-----:| :-----:| --- | --- | 
| MoNoise (van der Goot & van Noord, 2017) | 86.39 | 93.53 | 80.26 | [MoNoise: Modeling Noise Using a Modular Normalization System](http://www.let.rug.nl/rob/doc/clin27.paper.pdf) | [Official](https://bitbucket.org/robvanderg/monoise/) | 
| Random Forest + novel similarity metric (Jin, 2017) | 84.21 | 90.61 | 78.65 | [NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization](http://www.aclweb.org/anthology/W15-4313) | |

[Go back to the README](../README.md)


================================================
FILE: english/machine_translation.md
================================================
# Machine translation

Machine translation is the task of translating a sentence in a source language to a different target language. 

Results with a * indicate that the mean test score over the the best window based on average dev-set BLEU score over 
21 consecutive evaluations is reported as in [Chen et al. (2018)](https://arxiv.org/abs/1804.09849).

### WMT 2014 EN-DE

Models are evaluated on the English-German dataset of the Ninth Workshop on Statistical Machine Translation (WMT 2014) based
on BLEU.

| Model           | BLEU  |  Paper / Source |
| ------------- | :-----:| --- |
| Transformer Big + BT (Edunov et al., 2018) | 35.0 | [Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381.pdf) |
| DeepL | 33.3 | [DeepL Press release](https://www.deepl.com/press.html) |
| Admin (Liu et al., 2020) | 30.1 | [Very Deep Transformers for Neural Machine Translation](https://arxiv.org/abs/2008.07772) |
| MUSE (Zhao et al., 2019)| 29.9 | [MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning](https://arxiv.org/abs/1911.09483) |
| DynamicConv (Wu et al., 2019)| 29.7 | [Pay Less Attention With Lightweight and Dynamic Convolutions](https://arxiv.org/abs/1901.10430) |
| TaLK Convolutions (Lioutas et al., 2020)| 29.6 | [Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184) |
| AdvSoft + Transformer Big (Wang et al., 2019)| 29.52 | [Improving Neural Language Modeling via Adversarial Training](http://proceedings.mlr.press/v97/wang19f/wang19f.pdf) |
| Transformer Big (Ott et al., 2018) | 29.3 | [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) |
| RNMT+ (Chen et al., 2018) | 28.5* | [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) |
| Transformer Big (Vaswani et al., 2017) | 28.4 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| Transformer Base (Vaswani et al., 2017) | 27.3 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| MoE (Shazeer et al., 2017) | 26.03 | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) |
| ConvS2S (Gehring et al., 2017) | 25.16 | [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) | 

### WMT 2014 EN-FR

Similarly, models are evaluated on the English-French dataset of the Ninth Workshop on Statistical Machine Translation (WMT 2014) based
on BLEU.

| Model           | BLEU  |  Paper / Source |
| ------------- | :-----:| --- |
| DeepL | 45.9 | [DeepL Press release](https://www.deepl.com/press.html) |
| Transformer Big + BT (Edunov et al., 2018) | 45.6 | [Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381.pdf) |
| Admin (Liu et al., 2020) | 43.8 | [Understand the Difficulty of Training Transformers](https://arxiv.org/abs/2004.08249) |
| MUSE (Zhao et al., 2019)| 43.5 | [MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning](https://arxiv.org/abs/1911.09483) |
| TaLK Convolutions (Lioutas et al., 2020)| 43.2 | [Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184) |
| DynamicConv (Wu et al., 2019)| 43.2 | [Pay Less Attention With Lightweight and Dynamic Convolutions](https://arxiv.org/abs/1901.10430) |
| Transformer Big (Ott et al., 2018) | 43.2 | [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) |
| RNMT+ (Chen et al., 2018) | 41.0* | [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) |
| Transformer Big (Vaswani et al., 2017) | 41.0 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| MoE (Shazeer et al., 2017) | 40.56 | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) |
| ConvS2S (Gehring et al., 2017) | 40.46 | [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) | 
| Transformer Base (Vaswani et al., 2017) | 38.1 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |

 ### WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages
 
 | Model           | BLEU  |  Paper / Source |
| ------------- | :-----:| --- |
| vanilla MNMT models| 17.95 | [Tencent’s Multilingual Machine Translation System for WMT22 Large-Scale African Languages](https://arxiv.org/pdf/2210.09644v1.pdf)|


[Go back to the README](../README.md)


================================================
FILE: english/missing_elements.md
================================================
# Missing Elements

Missing elements are a collection of phenomenon that deals with things that are meant, but not explicitly mentioned in the text.
There are different kinds of missing elements, which have different aspects and behaviour. 
For example, [Ellipsis](https://en.wikipedia.org/wiki/Ellipsis_(linguistics)), Fused-Head, Bridging Anaphora, etc.


### Numeric Fused-Head (NFH)
FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be “fused” with its dependent modifier.
This missing information is implicit and is important for sentence understanding.

The Numeric [Fused-Head dataset](https://github.com/yanaiela/num_fh/tree/master/data/resolution/processed)
consists of ~10K examples of crowd-sourced classified examples, labeled into 7 different categories, from two types.
In the first type, *Reference*, the missing head is referenced explicitly somewhere else in the discourse, either in the
same sentence or in surrounding sentences.
In the second type, *Implicit*, the missing head does not appear in the text and needs to be inferred by the reader or
hearer based on the context or world knowledge. This category was labeled into the 6 most common categories of the dataset.
Models are evaluated based on accuracy.

Annotated Examples:

#### Reference

| I | bought | 5 | apples | but | got | only | 4 | . |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|   |        |   | HEAD   |     |     |      | NFH-REFERENCE | |

#### Implicit

| Let | 's | meet | at | 5 | tomorrow | ? |
| --- | --- | --- | --- | --- | --- | --- |
|     |    |      |    | NFH-TIME |   |   |


| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Bi-LSTM + Scoring (Elazar and Goldberg, 2019) | 60.8 | [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://arxiv.org/abs/1905.10886) | [Official](https://github.com/yanaiela/num_fh) |
| Bi-LSTM + Elmo + Scoring (Elazar and Goldberg, 2019) | 74.0 | [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://arxiv.org/abs/1905.10886) | [Official](https://github.com/yanaiela/num_fh) |

## PTB Traces and Null Elements

These are evaluated on section 23 of the Penn Treebank, using a metric defined by Johnson (2002).
An implementation of the metric is available with the code from [Kummerfeld and Klein (2017)](https://github.com/jkkummerfeld/1ec-graph-parser/tree/master/evaluation).

| Model           | F-score  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Kato and Matsubara (2016) | 77.8 | [Transition-Based Left-Corner Parsing for Identifying PTB-Style Nonlocal Dependencies](https://www.aclweb.org/anthology/P16-1088) | 
| Kummerfeld and Klein (2017) | 70.6 | [Parsing with Traces: An O(n^4) Algorithm and a Structural Representation](https://aclweb.org/anthology/papers/Q/Q17/Q17-1031/) | [Github](https://github.com/jkkummerfeld/1ec-graph-parser)
| Johnson (2002) | 68 | [A simple pattern-matching algorithm for recovering empty nodes and their antecedents](https://www.aclweb.org/anthology/P02-1018) | [Code](http://web.science.mq.edu.au/~mjohnson/code/Restorer.tbz) and [README](http://web.science.mq.edu.au/~mjohnson/code/Restorer-README.txt)


================================================
FILE: english/multi-task_learning.md
================================================
# Multi-task learning

Multi-task learning aims to learn multiple different tasks simultaneously while maximizing
performance on one or all of the tasks. 

### DecaNLP

The [Natural Language Decathlon](https://arxiv.org/abs/1806.08730) (decaNLP) is a benchmark for studying general NLP 
models that can perform a variety of complex, natural language tasks. 
It evaluates performance on ten disparate natural language tasks.

Results can be seen on the [public leaderboard](https://decanlp.com/).

### GLUE

The [General Language Understanding Evaluation benchmark](https://arxiv.org/abs/1804.07461) (GLUE)
is a tool for evaluating and analyzing the performance of models across a diverse
range of existing natural language understanding tasks. Models are evaluated based on their
average accuracy across all tasks.

The state-of-the-art results can be seen on the public [GLUE leaderboard](https://gluebenchmark.com/leaderboard).

[Go back to the README](../README.md)


================================================
FILE: english/multimodal.md
================================================
# Multimodal

`Multimodal` NLP involves the **combination of different types of information, such as text, speech, images, and videos, to enhance natural language processing tasks**. This allows machines to better comprehend human communication by taking into account additional contextual information beyond just text. For instance, multimodal NLP can be used to enhance machine translation by integrating visual data from images or videos to provide better translations. It can also be used to improve sentiment analysis by incorporating non-textual data such as facial expressions or tone of voice. Multimodal NLP is a growing field of study and is expected to become increasingly significant as more data becomes available across multiple modalities.

## Multimodal Emotion Recognition 

### IEMOCAP

The  IEMOCAP ([Busso  et  al., 2008](https://link.springer.com/article/10.1007/s10579-008-9076-6)) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise,  and other.

**Monologue:**

| Model           | Accuracy  |  Paper / Source |
| ------------- | :-----:| --- |
| CHFusion (Poria et al., 2017) | 76.5%  | [Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling](https://arxiv.org/pdf/1806.06228.pdf) |
| bc-LSTM (Poria et al., 2017) | 74.10%  | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) |

**Conversational:**
Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting.

| Model           |  Weighted Accuracy (WAA)  |  Paper / Source |
| ------------- | :-----:| --- |
| CMN (Hazarika et al., 2018) |  77.62%  | [Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos](http://aclweb.org/anthology/N18-1193) |
| Memn2n | 75.08 | [Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos](http://aclweb.org/anthology/N18-1193)|

## Multimodal Metaphor Recognition

[Mohammad et. al, 2016](http://www.aclweb.org/anthology/S16-2003) created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English.

| Model                                                        |                            F1 Score                             | Paper / Source                                               | Code        |
| ------------------------------------------------------------ | :----------------------------------------------------------: | ------------------------------------------------------------ | ----------- |
| 5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.75 | [Shutova et. al, 2016](http://www.aclweb.org/anthology/N16-1020) | Unavailable |

[Tsvetkov  et. al, 2014](http://www.aclweb.org/anthology/P14-1024) created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English.

| Model                                                        |                            F1 Score                             | Paper / Source                                               | Code        |
| ------------------------------------------------------------ | :----------------------------------------------------------: | ------------------------------------------------------------ | ----------- |
| 5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.79 | [Shutova et. al, 2016](http://www.aclweb.org/anthology/N16-1020) | Unavailable |

## Multimodal Sentiment Analysis

### MOSI
The MOSI dataset ([Zadeh et al., 2016](https://arxiv.org/pdf/1606.06259.pdf)) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative)  by  5  annotators.

| Model           | Accuracy  |  Paper / Source |
| ------------- | :-----:| --- |
| bc-LSTM (Poria et al., 2017) | 80.3%  | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) |
| MARN (Zadeh et al., 2018) | 77.1%  | [Multi-attention Recurrent Network for Human Communication Comprehension](https://arxiv.org/pdf/1802.00923.pdf) |

## Visual Question Answering

### VQAv2 

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer

- [Website](https://visualqa.org)
- [Challenge](https://visualqa.org/challenge.html)

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| UNITER (Chen et al., 2019) | 73.4 | [UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS](https://arxiv.org/pdf/1909.11740.pdf) | [Link](https://github.com/ChenRocks/UNITER) |
| LXMERT (Tan et al., 2019) | 72.54 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |

### GQA - Visual Reasoning in the Real World 

GQA focuses on real-world compositional reasoning. 

- [Website](https://cs.stanford.edu/people/dorarad/gqa/)
- [Challenge](https://cs.stanford.edu/people/dorarad/gqa/challenge.html)

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| KaKao Brain | 73.24 | [GQA Challenge](https://drive.google.com/file/d/1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy/view) | Unavailable |
| LXMERT (Tan et al., 2019) | 60.3 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |

### TextVQA

TextVQA requires models to read and reason about text in an image to answer questions based on them.

- [Website](https://textvqa.org/)
- [Challenge](https://textvqa.org/challenge)

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| M4C (Hu et al., 2020) | 40.46 | [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf) | [Link](https://github.com/facebookresearch/pythia/tree/project/m4c/projects/M4C_Captioner) |


### VizWiz dataset

This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.
- [Website](https://vizwiz.org/tasks-and-datasets/vqa/)
- [Challenge](https://vizwiz.org/tasks-and-datasets/vqa/)

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Pythia | 54.22 | [FB's Pythia repository](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) | [Link](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) |
| BUTD Vizwiz (Gurari et al., 2018) | 46.9 | [VizWiz Grand Challenge: Answering Visual Questions from Blind People](https://arxiv.org/abs/1802.08218) | Unavailable |

## Other multimodal resources

- [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml)
- [awesome-vision-and-language-papers](https://github.com/sangminwoo/awesome-vision-and-language-papers)

[Go back to the README](../README.md)


================================================
FILE: english/named_entity_recognition.md
================================================
# Named entity recognition

Named entity recognition (NER) is the task of tagging entities in text with their corresponding type.
Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities.
O is used for non-entity tokens.

Example:

| Mark  | Watney | visited | Mars  |
| ----- | ------ | ------- | ----- |
| B-PER | I-PER  | O       | B-LOC |

### CoNLL 2003 (English)

The [CoNLL 2003 NER task](http://www.aclweb.org/anthology/W03-0419.pdf) consists of newswire text from the Reuters RCV1 
corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1 on the test set. ♦ used both the train and development splits for training.

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| ACE + document-context (Wang et al., 2021) | 94.6 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| LUKE (Yamada et al., 2020) | 94.3 | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523/) | [Official](https://github.com/studio-ousia/luke) |
| CL-KL (Wang et al., 2021) | 93.85 | [Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning](https://arxiv.org/abs/2105.03654) | [Official](https://github.com/Alibaba-NLP/CLNER)|
| XLNet-GCN (Tran et al., 2021) | 93.82 | [Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning](https://link.springer.com/chapter/10.1007/978-3-030-91669-5_21) | [Official](https://github.com/honghanhh/ner-combining-contextual-and-global-features)|
| InferNER (Moemmur et al., 2021) | 93.76| [InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs](https://journals.flvc.org/FLAIRS/article/view/128538) | |
| ACE (Wang et al., 2021) | 93.6 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| CNN Large + fine-tune (Baevski et al., 2019) | 93.5 | [Cloze-driven Pretraining of Self-attention Networks](https://arxiv.org/pdf/1903.07785.pdf) | |
| RNN-CRF+Flair | 93.47 | [Improved Differentiable Architecture Search for Language Modeling and Named Entity Recognition](https://www.aclweb.org/anthology/D19-1367/) | |
| CrossWeigh + Flair (Wang et al., 2019)♦ | 93.43 | [CrossWeigh: Training Named Entity Tagger from Imperfect Annotations](https://www.aclweb.org/anthology/D19-1519/) | [Official](https://github.com/ZihanWangKi/CrossWeigh) |
| LSTM-CRF+ELMo+BERT+Flair | 93.38 | [Neural Architectures for Nested NER through Linearization](https://www.aclweb.org/anthology/P19-1527/) | [Official](https://github.com/ufal/acl2019_nested_ner) |
| Flair embeddings (Akbik et al., 2018)♦ | 93.09 | [Contextual String Embeddings for Sequence Labeling](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view) | [Flair framework](https://github.com/zalandoresearch/flair)
| BERT Large (Devlin et al., 2018) | 92.8 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| CVT + Multi-Task (Clark et al., 2018) | 92.61 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) |
| BERT Base (Devlin et al., 2018) | 92.4 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| BiLSTM-CRF+ELMo (Peters et al., 2018) | 92.22 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | [AllenNLP Project](https://allennlp.org/elmo) [AllenNLP GitHub](https://github.com/allenai/allennlp) |
| Peters et al. (2017) ♦| 91.93 | [Semi-supervised sequence tagging with bidirectional language models](https://arxiv.org/abs/1705.00108) | |
| CRF + AutoEncoder (Wu et al., 2018) | 91.87 | [Evaluating the Utility of Hand-crafted Features in Sequence Labelling](http://aclweb.org/anthology/D18-1310) | [Official](https://github.com/minghao-wu/CRF-AE) | 
| Bi-LSTM-CRF + Lexical Features (Ghaddar and Langlais 2018) | 91.73 | [Robust Lexical Features for Improved Neural Network Named-Entity Recognition](https://arxiv.org/pdf/1806.03489.pdf) | [Official](https://github.com/ghaddarAbs/NER-with-LS) |
| BiLSTM-CRF + IntNet (Xin et al., 2018) | 91.64 | [Learning Better Internal Structure of Words for Sequence Labeling](https://www.aclweb.org/anthology/D18-1279) | |
| Chiu and Nichols (2016) ♦| 91.62 | [Named entity recognition with bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308) | |
| HSCRF (Ye and Ling, 2018)| 91.38 | [Hybrid semi-Markov CRF for Neural Sequence Labeling](http://aclweb.org/anthology/P18-2038) | [HSCRF](https://github.com/ZhixiuYe/HSCRF-pytorch) |
| IXA pipes (Agerri and Rigau 2016) | 91.36 | [Robust multilingual Named Entity Recognition with shallow semi-supervised features](https://doi.org/10.1016/j.artint.2016.05.003)| [Official](https://github.com/ixa-ehu/ixa-pipe-nerc)|
| NCRF++ (Yang and Zhang, 2018)| 91.35 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) |
| LM-LSTM-CRF (Liu et al., 2018)| 91.24 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | [LM-LSTM-CRF](https://github.com/LiyuanLucasLiu/LM-LSTM-CRF) |
| Yang et al. (2017) ♦| 91.26 | [Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks](https://arxiv.org/abs/1703.06345) | |
| Ma and Hovy (2016) | 91.21 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) | |
| LSTM-CRF (Lample et al., 2016) | 90.94 | [Neural Architectures for Named Entity Recognition](https://arxiv.org/abs/1603.01360) | |

### CoNLL++ 

This is a cleaner version of the CoNLL 2003 NER task, where about 5% of instances in the test set are corrected due to mislabelling. The training set is left untouched.  Models are evaluated based on span-based F1 on the test set. ♦ used both the train and development splits for training. 

Links: [CoNLL++](https://github.com/ZihanWangKi/CrossWeigh) (including direct download links for data)

| Model                                   |  F1   | Paper / Source                           | Code                                     |
| --------------------------------------- | :---: | ---------------------------------------- | ---------------------------------------- |
| CL-KL (Wang et al., 2021) | 94.81 | [Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning](https://arxiv.org/abs/2105.03654) | [Official](https://github.com/Alibaba-NLP/CLNER)|
| CrossWeigh + Flair (Wang et al., 2019)♦ | 94.28 | [CrossWeigh: Training Named Entity Tagger from Imperfect Annotations](https://www.aclweb.org/anthology/D19-1519/) | [Official](https://github.com/ZihanWangKi/CrossWeigh) |
| Flair embeddings (Akbik et al., 2018)♦ | 93.89 | [Contextual String Embeddings for Sequence Labeling](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view) | [Flair framework](https://github.com/zalandoresearch/flair)
| BiLSTM-CRF+ELMo (Peters et al., 2018) | 93.42 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | [AllenNLP Project](https://allennlp.org/elmo) [AllenNLP GitHub](https://github.com/allenai/allennlp) |
| Ma and Hovy (2016) | 91.87 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) | |
| LSTM-CRF (Lample et al., 2016) | 91.47 | [Neural Architectures for Named Entity Recognition](https://arxiv.org/abs/1603.01360) | |


### Long-tail emerging entities

The [WNUT 2017 Emerging Entities task](http://aclweb.org/anthology/W17-4418) operates over a wide range of English 
text and focuses on generalisation beyond memorisation in high-variance environments. Scores are given both over
entity chunk instances, and unique entity surface forms, to normalise the biasing impact of entities that occur frequently.

| Feature   | Train  | Dev    | Test   |
| --------- | ------ | ------ | ------ |
| Posts     | 3,395  | 1,009  | 1,287  |
| Tokens    | 62,729 | 15,733 | 23,394 |
| NE tokens | 3,160  | 1,250  | 1,589  |

The data is annotated for six classes - person, location, group, creative work, product and corporation.

Links: [WNUT 2017 Emerging Entity task page](https://noisy-text.github.io/2017/emerging-rare-entities.html) (including direct download links for data and scoring script)

| Model         | F1  | F1 (surface form) |  Paper / Source |
| ---           | --- | ---               | --- |
| InferNER (Moemmur et al., 2021)          | 50.52| ---               | [InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs](https://journals.flvc.org/FLAIRS/article/view/128538) |
| CrossWeigh + Flair (Wang et al., 2019) | 50.03 | [CrossWeigh: Training Named Entity Tagger from Imperfect Annotations](https://www.aclweb.org/anthology/D19-1519/) | [Official](https://github.com/ZihanWangKi/CrossWeigh) |
| Flair embeddings (Akbik et al., 2018)  | 49.59 |                                          | [Pooled Contextualized Embeddings for Named Entity Recognition](http://alanakbik.github.io/papers/naacl2019_embeddings.pdf) / [Flair framework](https://github.com/zalandoresearch/flair) |
| Aguilar et al. (2018)                  | 45.55 |                                          | [Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media](http://aclweb.org/anthology/N18-1127.pdf) |
| SpinningBytes                          | 40.78 | 39.33                                    | [Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets](http://aclweb.org/anthology/W17-4422.pdf) |

### Ontonotes v5 (English)

The [Ontonotes corpus v5](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) is a richly annotated corpus with several layers of annotation, including named entities, coreference, part of speech, word sense, propositions, and syntactic parse trees. These annotations are over a large number of tokens, a broad cross-section of domains, and 3 languages (English, Arabic, and Chinese). The NER dataset (of interest here) includes 18 tags, consisting of 11 _types_ (PERSON, ORGANIZATION, etc) and 7 _values_ (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in [Pradhan et al 2013](https://www.semanticscholar.org/paper/Towards-Robust-Linguistic-Analysis-using-OntoNotes-Pradhan-Moschitti/a94e4fe6f475e047be5dcc9077f445e496240852) and can be found [here](http://cemantix.org/data/ontonotes.html).

| Model                                    |  F1   | Paper / Source                           | Code                                     |
| ---------------------------------------- | :---: | ---------------------------------------- | ---------------------------------------- |
| BERT+KVMN (Nie et al., 2020)    | 90.32 | [Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information](https://aclanthology.org/2020.findings-emnlp.378/) | [Official](https://github.com/cuhksz-nlp/AESINER) |
| Flair embeddings (Akbik et al., 2018)    | 89.71 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | [Official](https://github.com/zalandoresearch/flair) |
| CVT + Multi-Task (Clark et al., 2018)    | 88.81 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) |
| Bi-LSTM-CRF + Lexical Features (Ghaddar and Langlais 2018) | 87.95 | [Robust Lexical Features for Improved Neural Network Named-Entity Recognition](https://arxiv.org/pdf/1806.03489.pdf) | [Official](https://github.com/ghaddarAbs/NER-with-LS) |
| BiLSTM-CRF (Strubell et al, 2017)        | 86.99 | [Fast and Accurate Entity Recognition with Iterated Dilated Convolutions](https://arxiv.org/pdf/1702.02098.pdf) | [Official](https://github.com/iesl/dilated-cnn-ner) |
| Iterated Dilated CNN (Strubell et al, 2017) | 86.84 | [Fast and Accurate Entity Recognition with Iterated Dilated Convolutions](https://arxiv.org/pdf/1702.02098.pdf) | [Official](https://github.com/iesl/dilated-cnn-ner) |
| Chiu and Nichols (2016)                  | 86.28 | [Named entity recognition with bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308) |                                          |
| Joint Model (Durrett and Klein 2014)     | 84.04 | [A Joint Model for Entity Analysis: Coreference, Typing, and Linking](https://pdfs.semanticscholar.org/2eaf/f2205c56378e715d8d12c521d045c0756a76.pdf) |                                          |
| Averaged Perceptron (Ratinov and Roth 2009) | 83.45 | [Design Challenges and Misconceptions in Named Entity Recognition](https://www.semanticscholar.org/paper/Design-Challenges-and-Misconceptions-in-Named-Ratinov-Roth/27496a2ee337db705e7c611dea1fd8e6f41437c2) (These scores reported in ([Durrett and Klein 2014](https://pdfs.semanticscholar.org/2eaf/f2205c56378e715d8d12c521d045c0756a76.pdf))) | [Official](https://github.com/CogComp/cogcomp-nlp/tree/master/ner) |


### Few-NERD

[Few-NERD](https://arxiv.org/abs/2105.07464) is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built:

- Few-NERD (SUP) is a standard NER task;
- Few-NERD (INTRA) is a few-shot NER task across different coarse-grained types;
- Few-NERD (INTER) is a few-shot NER task within coarse-grained types.

Website: [Few-NERD page](https://ningding97.github.io/fewnerd/);

Download & code: https://github.com/thunlp/Few-NERD


#### Results on Few-NERD (SUP)

| Model                           |  F1   | Paper / Source                           | Code                                     |
| ------------------------------- | :---: | ---------------------------------------- | ---------------------------------------- |
| BERT-Tagger (Ding et al., 2021) | 68.88 | [Few-NERD: A Few-shot Named Entity Recognition Dataset](https://www.stingning.cn/assets/pdf/ACL2021-fewnerd.pdf) | [Official](https://github.com/thunlp/Few-NERD) |


#### 

#### 


[Go back to the README](../README.md)


================================================
FILE: english/natural_language_inference.md
================================================
# Natural language inference

Natural language inference is the task of determining whether a "hypothesis" is 
true (entailment), false (contradiction), or undetermined (neutral) given a "premise".

Example:

| Premise | Label | Hypothesis |
| --- | ---| --- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling. | neutral  | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

### SNLI

The [Stanford Natural Language Inference (SNLI) Corpus](https://arxiv.org/abs/1508.05326)
contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

State-of-the-art results can be seen on the [SNLI website](https://nlp.stanford.edu/projects/snli/).

### MultiNLI

The [Multi-Genre Natural Language Inference (MultiNLI) corpus](https://arxiv.org/abs/1704.05426)
contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but
covers a range of genres of spoken and written text and supports cross-genre evaluation. The data
can be downloaded from the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) website.

Public leaderboards for [in-genre (matched)](https://www.kaggle.com/c/multinli-matched-open-evaluation/leaderboard) 
and [cross-genre (mismatched)](https://www.kaggle.com/c/multinli-mismatched-open-evaluation/leaderboard)
evaluation are available, but entries do not correspond to published models.

| Model           | Matched  | Mismatched | Paper / Source | Code | 
| ------------- | :-----:| :-----:| --- | --- |
| RoBERTa (Liu et al., 2019) | 90.8 | 90.2 | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf) | [Official](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) |
| XLNet-Large (ensemble) (Yang et al., 2019) | 90.2 | 89.8 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 87.9 | 87.4 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 87.6 | 87.2 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |
| Finetuned Transformer LM (Radford et al., 2018) | 82.1 | 81.4 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) | |
| Multi-task BiLSTM + Attn (Wang et al., 2018) | 72.2 | 72.1 | [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/abs/1804.07461) | |
| GenSen (Subramanian et al., 2018) | 71.4 | 71.3 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | |

### SciTail

The [SciTail](http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf)
entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced
but created from sentences that already exist "in the wild". Hypotheses were created from
science questions and the corresponding answer candidates, while relevant web sentences from a large
corpus were used as premises. Models are evaluated based on accuracy.

| Model           | Accuracy  |  Paper / Source |
| ------------- | :-----:| --- |
| Finetuned Transformer LM (Radford et al., 2018) | 88.3 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) 
| Hierarchical BiLSTM Max Pooling (Talman et al., 2018) | 86.0 | [Natural Language Inference with Hierarchical BiLSTM Max Pooling](https://arxiv.org/abs/1808.08762)
| CAFE (Tay et al., 2018) | 83.3 | [A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference](https://arxiv.org/abs/1801.00102) |

[Go back to the README](../README.md)


================================================
FILE: english/paraphrase-generation.md
================================================
# Paraphrase Generation
[Paraphrase generation](https://arxiv.org/abs/1908.07831) is the task of generating an output sentence that preserves the meaning of the input sentence but contains variations in word choice and grammar. See the example given below:

| Input                      | Output                 |
| -------------------------  | ---------------------- |
|The need for investors to earn a commercial return may put upward pressure on prices| The need for profit is likely to push up prices|

### PRANMT-50M
[PARANMT-50M dataset](https://arxiv.org/pdf/1711.05732v2.pdf) is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.

| Model           | BLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Trigram (baseline)| 47.4| [Wieting and Gimpel, 2018](https://arxiv.org/pdf/1711.05732v2.pdf)| Unavailable|
| Unsupervised BART w/ Dynamic Blocking | 20.9 | [Niu et al., 2020](https://arxiv.org/pdf/2010.12885v1.pdf)| Unavailable|

### QQP-Pos
The [QQP-POS dataset](https://www.kaggle.com/c/quora-question-pairs/overview) is a paraphrase generation dataset with 400K source-target pairs. Each pair is labelled as negative if two questions are not duplicates and positive otherwise.

| Model           | BLEU  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Unsupervised BART w/ Dynamic Blocking | 26.76 | [Niu et al., 2020](https://arxiv.org/pdf/2010.12885v1.pdf)| Unavailable|
| ParafraGPT-UC| 35.9| [Bui et al., 2020](https://arxiv.org/pdf/2011.14344v1.pdf)| [Code](https://github.com/BH-So/unsupervised-paraphrase-generation)|

### MULTIPIT, MULTIPITCROWD and MULTIPITEXPERT

Past efforts on creating paraphrase corpora only consider one paraphrase criteria without taking into account the fact that the desired “strictness” of semantic equivalence in paraphrases varies from task to task (Bhagat and Hovy, 2013; Liu and Soh, 2022). For example, for the purpose of tracking unfolding events, “A tsunami hit Haiti.” and “303 people died because of the tsunami in Haiti” are sufficiently close to be considered as paraphrases; whereas for paraphrase generation, the extra information “303 people dead” in the latter sentence may lead models to learn to hallucinate and generate more unfaithful content. In this paper, the authors present an effective data collection and annotation method to address these issues.

MULTIPIT is a topic Paraphrase in Twitter corpus that consists of a total of 130k sentence pairs with crowdsoursing (MULTIPITCROWD ) and expert (MULTIPITEXPERT ) annotations. MULTIPITCROWD is a large crowdsourced set of 125K sentence pairs that is useful for tracking information onTwitter.
| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| DeBERTaV3large | 92.00 |[Improving Large-scale Paraphrase Acquisition and Generation](https://arxiv.org/pdf/2210.03235v2.pdf)| Unavailable|


MULTIPITEXPERT is an expert annotated set of 5.5K sentence pairs using a stricter definition that is more suitable for acquiring paraphrases for
generation purpose. 
| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| DeBERTaV3large | 83.20 |[Improving Large-scale Paraphrase Acquisition and Generation](https://arxiv.org/pdf/2210.03235v2.pdf)| Unavailable|


================================================
FILE: english/part-of-speech_tagging.md
================================================
# Part-of-speech tagging

Part-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of speech.
A part of speech is a category of words with similar grammatical properties. Common English
parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

Example: 

| Vinken | , | 61 | years | old |
| --- | ---| --- | --- | --- |
| NNP | , | CD | NNS | JJ |

### Penn Treebank

A standard dataset for POS tagging is the Wall Street Journal (WSJ) portion of the Penn Treebank, containing 45 
different POS tags. Sections 0-18 are used for training, sections 19-21 for development, and sections 
22-24 for testing. Models are evaluated based on accuracy.

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Meta BiLSTM (Bohnet et al., 2018) | 97.96 | [Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings](https://arxiv.org/abs/1805.08237) | [Official](https://github.com/google/meta_tagger) |
| Flair embeddings (Akbik et al., 2018) | 97.85 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | [Flair framework](https://github.com/zalandoresearch/flair) |
| Char Bi-LSTM (Ling et al., 2015) | 97.78 | [Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation](https://www.aclweb.org/anthology/D/D15/D15-1176.pdf) | |
| Adversarial Bi-LSTM (Yasunaga et al., 2018) | 97.59 | [Robust Multilingual Part-of-Speech Tagging via Adversarial Training](https://arxiv.org/abs/1711.04903) | |
| BiLSTM-CRF + IntNet (Xin et al., 2018) | 97.58 | [Learning Better Internal Structure of Words for Sequence Labeling](https://www.aclweb.org/anthology/D18-1279) | |
| Yang et al. (2017) | 97.55 | [Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks](https://arxiv.org/abs/1703.06345) | |
| Ma and Hovy (2016) | 97.55 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) | |
| LM-LSTM-CRF (Liu et al., 2018)| 97.53 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | |
| NCRF++ (Yang and Zhang, 2018)| 97.49 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) |
| Feed Forward (Vaswani et a. 2016) | 97.4 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) | |
| Bi-LSTM (Ling et al., 2017) | 97.36 | [Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation](https://www.aclweb.org/anthology/D/D15/D15-1176.pdf) | | 
| Bi-LSTM (Plank et al., 2016) | 97.22 | [Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss](https://arxiv.org/abs/1604.05529) | |


### Social media

The [Ritter (2011)](https://www.aclweb.org/anthology/D11-1141) dataset has become the benchmark for social media part-of-speech tagging. This is comprised of  some 50K tokens of English social media sampled in late 2011, and is tagged using an extended version of the PTB tagset.

| Model | Accuracy | Paper | Source|
| --- | --- | ---|---|
| ACE + fine-tune (Wang et al., 2020) | 93.4 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| PretRand (Meftah et al., 2019) | 91.46 | [Joint Learning of Pre-Trained and Random Units for Domain Adaptation in Part-of-Speech Tagging](https://www.aclweb.org/anthology/N19-1416.pdf) | |
| FastText + CNN + CRF | 90.53 | [Twitter word embeddings (Godin et al. 2019 (Chapter 3))](https://fredericgodin.com/research/twitter-word-embeddings/) |  |
| CMU | 90.0 ± 0.5 | [Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters](http://www.cs.cmu.edu/~ark/TweetNLP/owoputi+etal.naacl13.pdf) | |
| GATE  | 88.69 | [Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data](https://www.aclweb.org/anthology/R13-1026) |  |

### UD

[Universal Dependencies (UD)](http://universaldependencies.org/) is a framework for 
cross-linguistic grammatical annotation, which contains more than 100 treebanks in over 60 languages.
Models are typically evaluated based on the average test accuracy across 21 high-resource languages (♦ evaluated on 17 languages).

| Model           | Avg accuracy  |  Paper / Source |
| ------------- | :-----:| --- |
| XLM-R + SUB^2 data augmentation (Shi et al., 2021) | 97.7 | [Substructure Substitution: Structured Data Augmentation for NLP](https://aclanthology.org/2021.findings-acl.307/) / [code](https://github.com/ExplorerFreda/sub2-augmentation) |
| XLM-R (Shi et al., 2021) | 97.7 | [Substructure Substitution: Structured Data Augmentation for NLP](https://aclanthology.org/2021.findings-acl.307/) / [code](https://github.com/ExplorerFreda/sub2-augmentation) |
| Multilingual BERT and BPEmb (Heinzerling and Strube, 2019) | 96.77 | [Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation](https://arxiv.org/abs/1906.01569) |
| Adversarial Bi-LSTM (Yasunaga et al., 2018) | 96.65 | [Robust Multilingual Part-of-Speech Tagging via Adversarial Training](https://arxiv.org/abs/1711.04903) | 
| MultiBPEmb (Heinzerling and Strube, 2019) | 96.62 | [Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation](https://arxiv.org/abs/1906.01569) |
| Bi-LSTM (Plank et al., 2016) | 96.40 | [Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss](https://arxiv.org/abs/1604.05529) |
| Joint Bi-LSTM (Nguyen et al., 2017)♦ | 95.55 | [A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing](https://arxiv.org/abs/1705.05952) |

[Go back to the README](../README.md)


================================================
FILE: english/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [ARC](#arc)
- [ShARC](#sharc)
- [Reading comprehension](#reading-comprehension)
  - [AdversarialQA](#adversarialqa)
  - [CliCR](#clicr)
  - [CNN / Daily Mail](#cnn--daily-mail)
  - [CODAH](#codah)
  - [CoQA](#coqa)
  - [HotpotQA](#hotpotqa)
  - [MS MARCO](#ms-marco)
  - [MultiRC](#multirc)
  - [Natural Questions](#natural-questions)
  - [NewsQA](#newsqa)
  - [QAngaroo](#qangaroo)
  - [QuAC](#quac)
  - [RACE](#race)
  - [SQuAD](#squad)
  - [Story Cloze Test](#story-cloze-test)
  - [SWAG](#swag)
  - [Recipe QA](#recipeqa)
  - [NarrativeQA](#narrativeqa)
  - [DuoRC](#duorc)
  - [DROP](#drop)
  - [Cosmos QA](#cosmos-qa)
  - [ReClor (logical reasoning)](#reclor-logical-reasoning)
- [Open-domain Question Answering](#open-domain-question-answering)
  - [DuReader](#dureader)
  - [Quasar](#quasar)
  - [SearchQA](#searchqa)
- [Knowledge Base Question Answering](#knowledge-base-question-answering)

### ARC

The [AI2 Reasoning Challenge (ARC)](http://ai2-website.s3.amazonaws.com/publications/AI2ReasoningChallenge2018.pdf)
dataset is a question answering, which contains 7,787 genuine grade-school level, multiple-choice science questions.
The dataset is partitioned into a Challenge Set and an Easy Set. The Challenge Set contains only questions
answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. Models are evaluated
based on accuracy.

A public leaderboard is available on the [ARC website](http://data.allenai.org/arc/).

### ShARC

[ShARC](https://arxiv.org/abs/1809.01494) is a challenging QA dataset that requires  logical reasoning, elements of entailment/NLI and natural language generation.

Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. We formalise this task and introduce the challenging ShARC dataset with 32k task instances. 

The goal is to answer questions by possibly asking follow-up questions first. We assume that the question does not provide enough information to be answered directly. However, a model can use the supporting rule text to infer what needs to be asked in order to determine the final answer. Concretely, The model must decide whether to answer with "Yes", "No", "Irrelevant", or to generate a follow-up question given rule text, a user scenario and a conversation history. Performance is measured with Micro and Macro Accuracy for "Yes"/"No"/"Irrelevant"/"More" classifications, and the quality of follow-up questions are measured with BLEU.

The public data, further task details and public leaderboard are available on the [ShARC Website](https://sharc-data.github.io/).

## Reading comprehension

Most current question answering datasets frame the task as reading comprehension where the question is about a paragraph
or document and the answer often is a span in the document. The Machine Reading group
at UCL also provides an [overview of reading comprehension tasks](https://uclnlp.github.io/ai4exams/data.html).

### AdversarialQA

[AdversarialQA](https://adversarialqa.github.io/) provides three Reading Comprehension datasets constructed using an adversarial model-in-the-loop, where the annotator has to ask questions that the model fails to answer successfully. The passages are identical to the ones used in [SQuADv1.1](https://arxiv.org/abs/1606.05250).

AdversarialQA uses three different models; [BiDAF](https://arxiv.org/abs/1611.01603), [BERT-Large](https://arxiv.org/abs/1810.04805), and [RoBERTa-Large](https://arxiv.org/abs/1907.11692) in the annotation loop to create three datasets; D(BiDAF), D(BERT), and D(RoBERTa), each with 10,000 training examples, 1,000 validation, and 1,000 test examples.

The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models (at least the ones used as adversaries in the annotation loop) find challenging.

Examples:

| Dataset | Passage  | Question | Answer |
| ------------- | -----:| -----:| -----: |
| D(BiDAF) | Martin Luther married Katharina von Bora, one of 12 nuns he had helped escape from the Nimbschen Cistercian convent in April 1523, when he arranged for them to be smuggled out in herring barrels. "Suddenly, and while I was occupied with far different thoughts," he wrote to Wenceslaus Link, "the Lord has plunged me into marriage." At the time of their marriage, Katharina was 26 years old and Luther was 41 years old. | In a letter who did Luther credit for his union with Katharina? | the Lord |
| D(BERT) | This combination of cancellations and σ and π overlaps results in dioxygen's double bond character and reactivity, and a triplet electronic ground state. An electron configuration with two unpaired electrons as found in dioxygen (see the filled π* orbitals in the diagram), orbitals that are of equal energy—i.e., degenerate—is a configuration termed a spin triplet state. Hence, the ground state of the O2 molecule is referred to as triplet oxygen.[b] The highest energy, partially filled orbitals are antibonding, and so their filling weakens the bond order from three to two. Because of its unpaired electrons, triplet oxygen reacts only slowly with most organic molecules, which have paired electron spins; this prevents spontaneous combustion. | What are in the orbitals of atoms? | electrons |
| D(RoBERTa) | Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres (160,000 to 227,000 sq mi), with most of the lost forest becoming pasture for cattle. Seventy percent of formerly forested land in the Amazon, and 91% of land deforested since 1970, is used for livestock pasture. Currently, Brazil is the second-largest global producer of soybeans after the United States. New research however, conducted by Leydimere Oliveira et al., has shown that the more rainforest is logged in the Amazon, the less precipitation reaches the area and so the lower the yield per hectare becomes. So despite the popular perception, there has been no economical advantage for Brazil from logging rainforest zones and converting these to pastoral fields. | Of the two countries that produce soybeans, which country is clearing rain forest in order to increase production? | Brazil |

More details on the task and the dataset can be found in the [TACL paper](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00338/96474/Beat-the-AI-Investigating-Adversarial-Human). The public leaderboard is available through [Dynabench](https://dynabench.org/tasks/2#overall) as the QA round 1 task.

### CliCR

The [CliCR dataset](http://aclweb.org/anthology/N18-1140) is a gap-filling reading comprehension dataset consisting of around 100,000 queries and their associated documents. The dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity. The abilities to perform bridging inferences and track objects have been found to be the most frequently required skills for successful answering.

The instructions for accessing the dataset, the processing scripts, the baselines and the adaptations of some neural models can be found [here](https://github.com/clips/clicr).

Example:

| Document  | Question | Answer |
| ------------- | -----:| -----: |
| We report a case of a 72-year-old Caucasian woman with pl-7 positive antisynthetase syndrome. Clinical presentation included interstitial lung disease, myositis, mechanic’s hands and dysphagia. As lung injury was the main concern, treatment consisted of prednisolone and cyclophosphamide. Complete remission with reversal of pulmonary damage was achieved, as reported by CT scan, pulmonary function tests and functional status. [...] | Therefore, in severe cases an aggressive treatment, combining ________ and glucocorticoids as used in systemic vasculitis, is suggested.| cyclophoshamide |

| Model           | F1 |  Paper |
| ------------- | :-----:| --- |
| Gated-Attention Reader (Dhingra et al., 2017) | 33.9 | [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |
| Stanford Attentive Reader (Chen et al., 2016) | 27.2| [CliCR: A Dataset of Clinical Case Reports for Machine Reading Comprehension](http://aclweb.org/anthology/N18-1140) |

### CNN / Daily Mail

The [CNN / Daily Mail dataset](https://arxiv.org/abs/1506.03340) is a Cloze-style reading comprehension dataset
created from CNN and Daily Mail news articles using heuristics. [Close-style](https://en.wikipedia.org/wiki/Cloze_test)
means that a missing word has to be inferred. In this case, "questions" were created by replacing entities
from bullet points summarizing one or several aspects of the article. Coreferent entities have been replaced with an
entity marker @entityn where n is a distinct index.
The model is tasked to infer the missing entity
in the bullet point based on the content of the corresponding article and models are evaluated based on
their accuracy on the test set.

|         | CNN | Daily Mail |
| ------------- | -----:| -----: |
| # Train | 380,298 | 879,450 |
| # Dev | 3,924 | 64,835 |
| # Test | 3,198 | 53,182 |

Example:

| Passage  | Question | Answer |
| ------------- | -----:| -----: |
| ﻿( @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel " @entity11 " will feature a capable but flawed @entity13 official named @entity14 who " also happens to be a lesbian . " the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of " @entity6 " books at @entity28 imprint @entity26 . | characters in " @placeholder " movies have gradually become more diverse | @entity6 |

| Model           | CNN  | Daily Mail  |  Paper / Source |
| ------------- | :-----:| :-----:|--- |
| GA Reader(Dhingra et al., 2017) | 77.9 | 80.9 | [Gated-Attention Readers for Text Comprehension](http://aclweb.org/anthology/P17-1168) |
| BIDAF(Seo et al., 2017) | 76.9 | 79.6 |[Bidirectional Attention Flow for Machine Comprehension](https://arxiv.org/pdf/1611.01603.pdf)|
| AoA Reader(Cui et al., 2017) | 74.4 | - | [Attention-over-Attention Neural Networks for Reading Comprehension](http://aclweb.org/anthology/P17-1055) |
| Neural net (Chen et al., 2016) | 72.4 | 75.8 | [A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task](https://www.aclweb.org/anthology/P16-1223) |
| Classifier (Chen et al., 2016) | 67.9 | 68.3 | [A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task](https://www.aclweb.org/anthology/P16-1223) |
| Impatient Reader (Hermann et al., 2015) | 63.8 | 68.0 | [Teaching Machines to Read and Comprehend](https://arxiv.org/abs/1506.03340) |

### CODAH
[CODAH](https://arxiv.org/abs/1904.04365) is an adversarially-constructed evaluation dataset with 2.8k questions for testing common sense. CODAH forms a challenging extension to the SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video.

The dataset and more information can be found [here](https://github.com/Websail-NU/CODAH)

### CoQA

[CoQA](https://arxiv.org/abs/1808.07042) is a large-scale dataset for building Conversational Question Answering systems. 
CoQA contains 127,000+ questions with answers collected from 8000+ conversations.
Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.

The data and public leaderboard are available [here](https://stanfordnlp.github.io/coqa/).

### HotpotQA

HotpotQA is a dataset with 113k Wikipedia-based question-answer pairs. Questions require 
finding and reasoning over multiple supporting documents and are not constrained to any pre-existing knowledge bases.
Sentence-level supporting facts are available.

The data and public leaderboard are available from the [HotpotQA website](https://hotpotqa.github.io/).

### MS MARCO
[MS MARCO](http://www.msmarco.org/dataset.aspx) aka Human Generated MAchine
Reading COmprehension Dataset, is designed and developed by Microsoft AI & Research. [Link to paper](https://arxiv.org/abs/1611.09268)
- The questions are obtained from real anonymized user queries.
- The answers are human generated. The context passages from which the answers are obtained are extracted from real documents using the latest Bing search engine.
- The data set contains 100,000 queries and a subset of them contain multiple answers, and aim to release 1M queries in the future.  

The leaderboards for multiple tasks are available on the [MS MARCO leaderboard page](http://www.msmarco.org/leaders.aspx).

### MultiRC
MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
We have designed the dataset with three key challenges in mind:
 - The number of correct answer-options for each question is not pre-specified. This removes the over-reliance of current approaches on answer-options and forces them to decide on the correctness of each candidate answer independently of others. In other words, unlike previous work, the task here is not to simply identify the best answer-option, but to evaluate the correctness of each answer-option individually.
 - The correct answer(s) is not required to be a span in the text.
 - The paragraphs in our dataset have diverse provenance by being extracted from 7 different domains such as news, fiction, historical text etc., and hence are expected to be more diverse in their contents as compared to single-domain datasets.

The leaderboards for the dataset is available on the [MultiRC website](http://cogcomp.org/multirc/).

### Natural Questions

The [Natural Questions](https://research.google/pubs/pub47761/) corpus contains questions from real users issued to the Google search engine. It requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. Questions are presented along with a Wikipedia page and an extracted long answer (typically a paragraph) and short answer (one or more entities) if present on the page, or marked null if no long/short answer was present.

Example:

| Question | Wikipedia Page | Long Answer | Short Answer |
| ------------- | -----:| -----: | -----: |
| who lives in the imperial palace in tokyo | Tokyo_Imperial_Palace | The Tokyo Imperial Palace (, Kkyo, literally “Imperial Residence”) is the primary residence of the Emperor of Japan. It is a large park-like area located in the Chiyoda ward of Tokyo and contains buildings including the main palace (, Kyden), the private residences of the Imperial Family, an archive, museums and administrative offices. | The Imperial Family |

The leaderboard and the dataset are available in the [Google's Natural Question website](https://ai.google.com/research/NaturalQuestions)

### NewsQA

The [NewsQA dataset](https://arxiv.org/pdf/1611.09830.pdf) is a reading comprehension dataset of over 100,000
human-generated question-answer pairs from over 10,000 news articles from CNN, with answers consisting of spans of text
from the corresponding articles.
Some challenging characteristics of this dataset are:
- Answers are spans of arbitrary length;
- Some questions have no answer in the corresponding article;
- There are no candidate answers from which to choose.
Although very similar to the SQuAD dataset, NewsQA offers a greater challenge to existing models at time of
introduction (eg. the paragraphs are longer than those in SQuAD). Models are evaluated based on F1 and Exact Match.

Example:

| Story  | Question | Answer |
| ------------- | -----:| -----: |
| MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth. A South Korean bioengineer was one of three people on board the Soyuz capsule. The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said. Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry. [...] | Where did the Soyuz capsule land? | northern Kazakhstan |

The dataset can be downloaded [here](https://github.com/Maluuba/newsqa).

| Model           | F1 | EM | Paper / Source |
| ------------- | :-----: | :-----: | --- |
| DecaProp (Tay et al., 2018) | 66.3 | 53.1 | [Densely Connected Attention Propagation for Reading Comprehension](https://arxiv.org/abs/1811.04210) |
| AMANDA (Kundu et al., 2018) | 63.7 | 48.4| [A Question-Focused Multi-Factor Attention Network for Question Answering](https://arxiv.org/abs/1801.08290) |
| MINIMAL(Dyn) (Min et al., 2018) | 63.2 | 50.1 | [Efficient and Robust Question Answering from Minimal Context over Documents](https://arxiv.org/abs/1805.08092) |
| FastQAExt (Weissenborn et al., 2017) | 56.1 | 43.7 | [Making Neural QA as Simple as Possible but not Simpler](https://arxiv.org/abs/1703.04816) |

### QAngaroo

[QAngaroo](http://qangaroo.cs.ucl.ac.uk/index.html) is a set of two reading comprehension datasets,
which require multiple steps of inference that combine facts from multiple documents. The first dataset, WikiHop
is open-domain and focuses on Wikipedia articles. The second dataset, MedHop is based on paper abstracts from
PubMed.

The leaderboards for both datasets are available on the [QAngaroo website](http://qangaroo.cs.ucl.ac.uk/leaderboard.html).

### QuAC

Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog.
Data instances consist of an interactive dialog between two crowd workers:
(1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text,
and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.

The leaderboard and data are available on the [QuAC website](http://quac.ai/).

### RACE

The [RACE dataset](https://arxiv.org/abs/1704.04683) is a reading comprehension dataset
collected from English examinations in China, which are designed for middle school and high school students.
The dataset contains more than 28,000 passages and nearly 100,000 questions and can be
downloaded [here](http://www.cs.cmu.edu/~glai1/data/race/). Models are evaluated based on accuracy
on middle school examinations (RACE-m), high school examinations (RACE-h), and on the total dataset (RACE).

The public leaderboard is available on the [RACE leaderboard](http://www.qizhexie.com//data/RACE_leaderboard).

| Model           | RACE-m | RACE-h | RACE | Paper | Code |
| ------------- | :-----:| :-----:| :-----:| --- | --- |
| XLNet (Yang et al., 2019) | 85.45 | 80.21 | 81.75 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| OCN_large (Ran et al., 2019) | 76.7 | 69.6 | 71.7 | [Option Comparison Network for Multiple-choice Reading Comprehension](https://arxiv.org/pdf/1903.03033.pdf) | |
| DCMN_large (Zhang et al., 2019) | 73.4 | 68.1 | 69.7 | [Dual Co-Matching Network for Multi-choice Reading Comprehension](https://arxiv.org/pdf/1901.09381.pdf) | |
| Finetuned Transformer LM (Radford et al., 2018) | 62.9 | 57.4 | 59.0 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) | [Official](https://github.com/openai/finetune-transformer-lm) |
| BiAttention MRU (Tay et al., 2018) | 60.2 | 50.3 | 53.3 | [Multi-range Reasoning for Machine Comprehension](https://arxiv.org/abs/1803.09074) | |

### SQuAD

The [Stanford Question Answering Dataset (SQuAD)](https://arxiv.org/abs/1606.05250)
is a reading comprehension dataset, consisting of questions posed by crowdworkers
on a set of Wikipedia articles. The answer to every question is a segment of text (a span)
from the corresponding reading passage. Recently, [SQuAD 2.0](https://arxiv.org/abs/1806.03822)
has been released, which includes unanswerable questions.

The public leaderboard is available on the [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/).

### Story Cloze Test

The [Story Cloze Test](http://aclweb.org/anthology/W17-0906.pdf) is a dataset for
story understanding that provides systems with four-sentence stories and two possible
endings. The systems must then choose the correct ending to the story.

More details are available on the [Story Cloze Test Challenge](https://competitions.codalab.org/competitions/15333).

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Reading Strategies Model (Sun et al., 2018) | 88.3 | [Improving Machine Reading Comprehension by General Reading Strategies](https://arxiv.org/pdf/1810.13441v1.pdf) |
| Finetuned Transformer LM (Radford et al., 2018) | 86.5 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) |  [Official](https://github.com/openai/finetune-transformer-lm)
| Liu et al. (2018) | 78.7 | [Narrative Modeling with Memory Chains and Semantic Supervision](http://aclweb.org/anthology/P18-2045) | [Official](https://github.com/liufly/narrative-modeling)
| Hidden Coherence Model (Chaturvedi et al., 2017) | 77.6 | [Story Comprehension for Predicting What Happens Next](http://aclweb.org/anthology/D17-1168) |
| val-LS-skip (Srinivasan et al., 2018) | 76.5 | [A Simple and Effective Approach to the Story Cloze Test](http://aclweb.org/anthology/N18-2015) |

### SWAG

[SWAG](https://arxiv.org/abs/1808.05326) (Situations With Adversarial Generations) is a large-scale dataset for the task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning. The dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. The correct answer is the (real) video caption for the next event in the video; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.

The public leaderboard is available on the [AI2 website](https://leaderboard.allenai.org/swag/submissions/public).

### RecipeQA

[RecipeQA](https://arxiv.org/abs/1809.00812) is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

The public leaderboard is available on the [RecipeQA website](https://hucvl.github.io/recipeqa/).


### NarrativeQA
[NarrativeQA](https://arxiv.org/abs/1712.07040) is a dataset built to encourage deeper comprehension of language. This dataset involves reasoning over reading entire books or movie scripts. This dataset contains approximately 45K question answer pairs in free form text. There are two modes of this dataset (1) reading comprehension over summaries and (2) reading comprehension over entire books/scripts.

The results for the first, summary mode are below.

| Model                        | BLEU-1     | BLEU-4   | METEOR | Rouge-L | Paper / Source | Code |
| -------------                | :-----:   | :-----:|:-----:| :-----:|---            | ---  |
|DecaProp (Tay et al., 2018)	   |44.35    |27.61	 | 21.80 | 44.69   |[Densely Connected Attention Propagation for Reading Comprehension](https://arxiv.org/abs/1811.04210)       |  [official](https://github.com/vanzytay/NIPS2018_DECAPROP)    |
|BiAttention + DCU-LSTM (Tay et al., 2018)	   |36.55    |19.79	 | 17.87 | 41.44  |[Multi-Granular Sequence Encoding via Dilated Compositional Units for Reading Comprehension](http://aclweb.org/anthology/D18-1238)       |      |
|BiDAF (Seo et al., 2017)	   |33.45    |15.69	 | 15.68 | 36.74  |[Bidirectional Attention Flow for Machine Comprehension](https://arxiv.org/abs/1611.01603)       |      |

The results for the second mode (question answering over entire books or movie scripts) are below.

| Model                        | BLEU-1     | BLEU-4   | METEOR | Rouge-L | Paper / Source | Code |
| -------------                | :-----:   | :-----:|:-----:| :-----:|---            | ---  |
|Retriever + Reader (Izacard and Grave, 2020)     |35.3    |7.5  | 11.1 | 32.0   |[Distilling Knowledge from Reader to Retriever for Question Answering](https://openreview.net/forum?id=NTEz-6wysdb)       |  [Official](https://github.com/facebookresearch/FiD)    |
|Summary + Reader (UnifiedQA) (Wu et al., 2021)    |21.82    |3.87  | 10.52 | 21.03  |[Recursively Summarizing Books with Human Feedback](https://arxiv.org/abs/2109.10862)       |      |
|ReadTwice (Zemlyanskiy et al., 2021)     |21.1    |4.0  | 7.0 | 23.2  |[ReadTwice: Reading Very Large Documents with Memories](https://aclanthology.org/2021.naacl-main.408.pdf) | [Official](https://github.com/google-research/google-research/tree/master/readtwice) |

### DuoRC

[DuoRC](https://duorc.github.io) contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie. 

DuoRC pushes the NLP community to address challenges on incorporating knowledge and reasoning in neural architectures for reading comprehension. It poses several interesting challenges such as:
  - DuoRC using parallel plots is especially designed to contain a large number of questions with low lexical overlap between questions and their corresponding passages
  - It requires models to go beyond the content of the given passage itself and incorporate world-knowledge, background knowledge, and common-sense knowledge to arrive at the answer
  - It revolves around narrative passages from movie plots describing complex events and therefore naturally require complex reasoning (e.g. temporal reasoning, entailment, long-distance anaphoras, etc.) across multiple sentences to infer the answer to questions
  - Several of the questions in DuoRC, while seeming relevant, cannot actually be answered from the given passage. This requires the model to detect the unanswerability of questions. This aspect is important for machines to achieve in industrial settings in particular.
  
### DROP

[DROP](https://allennlp.org/drop) is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.

### Cosmos QA

[Cosmos QA](https://wilburone.github.io/cosmos/) is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.

### ReClor (logical reasoning)

The [ReClor dataset](https://openreview.net/forum?id=HJgJtT4tvB) is a reading comprehension dataset requiring logical reasoning, which is extracted from standardized exams GMAT (Graduate Management Admission Test) and LSAT (Law School Admission Test). This dataset is very challenging and even graduate students can only achieve 63% accuracy. It has various logical reasoning types, ie, Necessary/Sufficient Assumptions, Strengthen/Weaken, Evaluation, Implication, Conclusion/Main Point, Most Strongly Supported, Explain or Resolve, Principle, Dispute, Technique, Role, Identify a Flaw, Match Flaws, Match the Structure and others.

The dataset, public leaderboard, and code are available on the project page [ReClor (logical reasoning)](https://whyu.me/reclor/).

## Open-domain Question Answering

### DuReader
[DuReader](https://ai.baidu.com/broad/subordinate?dataset=dureader) is a large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC. [Link to paper](https://arxiv.org/pdf/1711.05073.pdf) 

DuReader has three advantages over other MRC datasets: 
- (1) data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated. 
- (2) question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community. 
- (3) scale: it contains 300K questions, 660K answers and 1.5M documents; it is the largest Chinese MRC dataset so far. 

To help the community make these improvements, both the [dataset](https://ai.baidu.com/broad/download?dataset=dureader) of DuReader and [baseline systems](https://github.com/baidu/DuReader) have been posted online. 

The [leaderboard](https://ai.baidu.com/broad/leaderboard?dataset=dureader) is avaiable on DuReader page.

### Quasar
[Quasar](https://arxiv.org/abs/1707.03904) is a dataset for open-domain question answering. It includes two parts: (1) The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. (2) The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources. 

| Model                        | EM (Quasar-T)     | F1 (Quasar-T)    |Paper / Source | Code |
| -------------                | :-----:| :-----:|---            | ---  |
|Denoising QA (Lin et al. 2018)|42.2	  |49.3    |[Denoising Distantly Supervised Open-Domain Question Answering](http://aclweb.org/anthology/P18-1161)|[official](https://github.com/thunlp/OpenQA)|
|DecaProp (Tay et al., 2018)	     |38.6	  |46.9	   |[Densely Connected Attention Propagation for Reading Comprehension](https://arxiv.org/abs/1811.04210)|[official](https://github.com/vanzytay/NIPS2018_DECAPROP)|
|R^3 (Wang et al., 2018)	     |35.3	  |41.7	   |[R^3: Reinforced Ranker-Reader for Open-Domain Question Answering](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712/16165)|[official](https://github.com/shuohangwang/mprc)|
|BiDAF (Seo et al., 2017)	     |25.9	  |28.5	   |[Bidirectional Attention Flow for Machine Comprehensio](https://arxiv.org/abs/1611.01603)               | [official](https://github.com/allenai/bi-att-flow)|
|GA (Dhingra et al., 2017)	   |26.4    |26.4	   |[Gated-Attention Readers for Text Comprehension](https://arxiv.org/pdf/1606.01549)       |      |


### SearchQA
[SearchQA](https://arxiv.org/abs/1704.05179) was constructed to reflect a full pipeline of general question-answering. SearchQA consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL.

| Model                        | Unigram Acc     | N-gram F1   | EM     |  F1   |Paper / Source | Code |
| -------------                | :-----:| :-----:| :-----:| :-----:|---            | ---  |
|DecaProp (Tay et al., 2018)	   |62.2    |70.8	   |56.8 |63.6    |[Densely Connected Attention Propagation for Reading Comprehension](https://arxiv.org/abs/1811.04210)       | [official](https://github.com/vanzytay/NIPS2018_DECAPROP)     |
|Denoising QA (Lin et al. 2018)| - |-    | 58.8| 64.5|[Denoising Distantly Supervised Open-Domain Question Answering](http://aclweb.org/anthology/P18-1161)|[official](https://github.com/thunlp/OpenQA)|
|R^3 (Wang et al., 2018)	     |-	  |-	 | 49.0| 55.3  |[R^3: Reinforced Ranker-Reader for Open-Domain Question Answering](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712/16165)|[official](https://github.com/shuohangwang/mprc)|
|Bi-Attention + DCU-LSTM (Tay et al., 2018)	   |49.4    |59.5	   |- |-    |[Multi-Granular Sequence Encoding via Dilated Compositional Units for Reading Comprehension](http://aclweb.org/anthology/D18-1238)       |      |
|AMANDA (Kundu et al., 2018)	     |46.8	  |56.6	   |- |-    |[A Question-Focused Multi-Factor Attention Network for Question Answering](https://arxiv.org/abs/1801.08290)               | [official](https://github.com/nusnlp/amanda)|
|Focused Hierarchical RNN	(Ke et al., 2018)     |46.8	  |53.4	   |- |-    |[Focused Hierarchical RNNs for Conditional Sequence Processing](http://proceedings.mlr.press/v80/ke18a/ke18a.pdf)||
|ASR (Kadlec et al, 2016) |41.3	  |22.8    |- |-    |[Text Understanding with the Attention Sum Reader Network](https://arxiv.org/abs/1603.01547)|

## Knowledge Base Question Answering

Knowledge Base Question Answering is the task of answering natural language question based on a knowledge base/knowledge graph such as [DBpedia](https://wiki.dbpedia.org/) or [Wikidata](https://www.wikidata.org/).

### QALD-9
[QALD-9](http://ceur-ws.org/Vol-2241/paper-06.pdf) is a manually curated superset of the previous eight editions of the [Question Answering over Linked Data (QALD) challenge](http://2018.nliwod.org/challenge) published in 2018. It is constructed by human experts to cover a wide range of natural language to SPARQL conversions based on DBpedia 2016-10 knowledge base. Each question-answer-pair has additional meta-data. QALD-9 is best evaluated using the [GERBIL QA platform](http://gerbil-qa.aksw.org/gerbil/config) for repeatability of the evaluation numbers.

| Annotator | Macro P | Macro R | Macro F1 | Error Count | Average Time/Doc ms | Macro F1 QALD | Paper (including links to webservices/source code)|
|------------------------|:-------:|:-------:|:--------:|:-----------:|:-------------------:|:-------------:|----------------------|
| Elon (WS)              |   0.049 |   0.053 |    0.050 |           2 |                 219 |         0.100 ||
| QASystem (WS)          |   0.097 |   0.116 |    0.098 |           0 |                1014 |         0.200 ||
| TeBaQA (WS)            |   0.129 |   0.134 |    0.130 |           0 |                2668 |         0.222 ||
| wdaqua-core1 (DBpedia) |   0.261 |   0.267 |    0.250 |           0 |                 661 |         0.289 | Diefenbach, Dennis, Kamal Singh, and Pierre Maret. "Wdaqua-core1: a question answering service for rdf knowledge bases." Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 2018. |
| gAnswer (WS)           |   0.293 |   0.327 |    0.298 |           1 |                3076 |         0.430 | Zou, Lei, et al. "Natural language question answering over RDF: a graph data driven approach." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.|

[Go back to the README](../README.md)


================================================
FILE: english/relation_prediction.md
================================================
# Relation Prediction

## Task

Relation Prediction is the task of recognizing a named relation between two named semantic entities. The common test setup is to hide one entity from the relation triplet, asking the system to recover it based on the other entity and the relation type.

For example, given the triple \<*Roman Jakobson*, *born-in-city*, *?*\>, the system is required to replace the question mark with *Moscow*.

Relation Prediction datasets are typically extracted from two types of resources: 
* *Knowledge Bases*: KBs such as [FreeBase](https://developers.google.com/freebase/) contain hundreds or thousands of relation types pertaining to world-knowledge obtained automatically or semi-automatically from various resources on millions of entities. These relations include *born-in*, *nationality*, *is-in* (for geographical entities), *part-of* (for organizations, among others), and more.
* *Semantic Graphs*: SGs such as [WordNet](https://wordnet.princeton.edu/) are often manually-curated resources of semantic concepts, restricted to more "linguistic" relations compared to free real-world knowledge. The most common semantic relation is *hypernym*, also known as the *is-a* relation (example: \<*cat*, *hypernym*, *feline*\>).

## Evaluation

Evaluation in Relation Prediction hinges on a list of ranked candidates given by the system to the test instance. The metrics below are derived from the location of correct candidate(s) in that list.

A common action performed before evaluation on a given list is *filtering*, where the list is cleaned of entities whose corresponding triples exist in the knowledge graph. Unless specified otherwise, results here are from filtered lists.

### Metrics

#### Mean Reciprocal Rank (MRR):

The mean of all reciprocal ranks for the true candidates over the test set (1/rank).

#### Hits at k (H@k):

The rate of correct entities appearing in the top *k* entries for each instance list. This number may exceed 1.00 if the average *k*-truncated list contains more than one true entity.

### Datasets

#### Freebase-15K-237 (FB15K-237)
The FB15K dataset was introduced in [Bordes et al., 2013](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf). It is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. This dataset was found to suffer from major test leakage through inverse relations and a large number of test triples can be obtained simply by inverting triples in the training set initially by [Toutanova et al.](http://aclweb.org/anthology/D15-1174). To create a dataset without this property, [Toutanova et al.](http://aclweb.org/anthology/D15-1174) introduced FB15k-237 – a subset of FB15k where inverse relations are removed.

#### WordNet-18-RR (WN18RR)

The WN18 dataset was also introduced in [Bordes et al., 2013](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf). It included the full 18 relations scraped from WordNet for roughly 41,000 synsets. Similar to FB15K, This dataset was found to suffer from test leakage by [Dettmers et al. (2018)](https://arxiv.org/abs/1707.01476) introduced the [WN18RR](https://github.com/villmow/datasets_knowledge_embedding). 

As a way to overcome this problem, [Dettmers et al. (2018)](https://arxiv.org/abs/1707.01476) introduced the [WN18RR](https://github.com/villmow/datasets_knowledge_embedding) dataset, derived from WN18, which features 11 relations only, no pair of which is reciprocal (but still include four internally-symmetric relations like *verb_group*, allowing the rule-based system to reach 35 on all three metrics).

### Experimental Results

#### WN18RR

The test set is composed of triplets, each used to create two test instances, one for each entity to be predicted. Since each instance is associated with a single true entity, the maximum value for all metrics is 1.00.
   
| Model           | H@10 | H@1 | MRR | Paper / Source | Code | 
| ------------- | :-----:| :-----:| :-----:| --- | --- | 
| Max-Margin Markov Graph Models (Pinter & Eisenstein, 2018) | 59.02 | 45.37 | 49.83 | [Predicting Semantic Relations using Global Graph Properties](https://arxiv.org/abs/1808.08644) | [Official](http://www.github.com/yuvalpinter/m3gm) |
| Concepts of Nearest Neighbors (Ferré, 2020) | 51.9 | 44.4 | 46.9 | [Application of Concepts of Neighbors to Knowledge Graph Completion](https://datasciencehub.net/system/files/ds-paper-633.pdf) | [Official](http://www.irisa.fr/LIS/ferre/pub/link_prediction2020/) |
| KBAT(Deepak et al., 2019) | 58.1 | 36.1 | 44 | [Learning Attention Based Embeddings for Relation Prediction](https://arxiv.org/pdf/1906.01195.pdf) | [Official](https://github.com/deepakn97/relationPrediction)
| TransE (reimplementation by Pinter & Eisenstein, 2018) | 55.55 | 42.26 | 46.59 | [Translating Embeddings for Modeling Multi-relational Data. ](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) |
| ConvKB (Nguyen et al., 2018) | 52.50 | - | 24.80 | [A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network](http://www.aclweb.org/anthology/N18-2053) | [Official](https://github.com/daiquocnguyen/ConvKB) |
| ConvE (v6; Dettmers et al., 2018) | 52.00 | 40.00 | 43.00 | [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476) | [Official](https://github.com/TimDettmers/ConvE) |
| ComplEx (Trouillon et al., 2016) | 51.00 | 41.00 | 44.00 | [Complex Embeddings for Simple Link Prediction](http://www.jmlr.org/proceedings/papers/v48/trouillon16.pdf) | [Official](https://github.com/ttrouill/complex) | 
| DistMult (reimplementation by Dettmers et al., 2018) | 49.00 | 40.00 | 43.00 | [Embedding Entities and Relations for Learning and Inference in Knowledge Bases.](https://arxiv.org/pdf/1412.6575) | [Link](https://github.com/thunlp/OpenKE) |

#### FB15K-237

| Model           | H@10 | H@1 | MRR | Paper / Source | Code | 
| ------------- | :-----:| :-----:| :-----:| --- | --- | 
| KBAT (Deepak et al., 2019) | 62.6 | 46 | 51.8 | [Learning Attention Based Embeddings for Relation Prediction](https://arxiv.org/pdf/1906.01195.pdf) | [Official](https://github.com/deepakn97/relationPrediction)
| Concepts of Nearest Neighbors (Ferré, 2020) | 44.6 | 22.2 | 29.6 | [Application of Concepts of Neighbors to Knowledge Graph Completion](https://datasciencehub.net/system/files/ds-paper-633.pdf) | [Official](http://www.irisa.fr/LIS/ferre/pub/link_prediction2020/) |
| TransE (reimplementation by Han et al., 2018) | 47.09 | 19.87 | 29.04 | [Translating Embeddings for Modeling Multi-relational Data. ](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) |
| TransH (reimplementation by Han et al., 2018) | 41.32 | 5.79 | 17.66 | [Knowledge Graph Embedding by Translating on Hyperplanes.](http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8531/8546) | [OpenKE](https://github.com/thunlp/OpenKE) |
| TransR (reimplementation by Han et al., 2018) | 40.67 | 16.35 | 24.44 | [ Learning Entity and Relation Embeddings for Knowledge Graph Completion.](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9571/9523/) | [OpenKE](https://github.com/thunlp/OpenKE) |
| TransD (reimplementation by Han et al., 2018) | 46.05 | 14.83 | 25.27 | [Knowledge Graph Embedding via Dynamic Mapping Matrix.](http://anthology.aclweb.org/P/P15/P15-1067.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) |
| ConvKB (Nguyen et al., 2018) | 51.70 | - | 39.60 | [A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network](http://www.aclweb.org/anthology/N18-2053) | [Official](https://github.com/daiquocnguyen/ConvKB) |
| ConvE (v6; Dettmers et al., 2018) | 50.10 | 23.70 | 32.50 | [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476) | [Official](https://github.com/TimDettmers/ConvE) |
| ComplEx (reimplementation by Dettmers et al., 2018) | 42.80 | 15.80 | 24.70 | [Complex Embeddings for Simple Link Prediction](http://www.jmlr.org/proceedings/papers/v48/trouillon16.pdf) | [Official](https://github.com/ttrouill/complex) | 
| DistMult (reimplementation by Dettmers et al., 2018) | 41.90 | 15.50 | 24.10 | [Embedding Entities and Relations for Learning and Inference in Knowledge Bases.](https://arxiv.org/pdf/1412.6575) | [Link](https://github.com/thunlp/OpenKE) |

## Resources
[OpenKE](http://aclweb.org/anthology/D18-2024) is an open toolkit for relational learning which provides a standard training and testing framework. Currently, the implemented models in OpenKE include TransE, TransH, TransR, TransD, RESCAL, DistMult, ComplEx and HolE.

[KRLPapers](https://github.com/thunlp/KRLPapers) is a must-read paper list for relational learning.

[datasets-knowledge-embedding](https://github.com/simonepri/datasets-knowledge-embedding) is a collection of common datasets used in knowledge embedding.

[Back to README](../README.md)


================================================
FILE: english/relationship_extraction.md
================================================
# Relationship Extraction

Relationship extraction is the task of extracting semantic relationships from a text. Extracted relationships usually
occur between two or more entities of a certain type (e.g. Person, Organisation, Location) and fall into a number of
semantic categories (e.g. married to, employed by, lives in).

### Capturing discriminative attributes (SemEval 2018 Task 10)

**Capturing discriminative attributes (SemEval 2018 Task 10)** is a binary classification task where participants were asked to identify whether an attribute could help discriminate between two concepts. Unlike other word similarity prediction tasks, this task focuses on the semantic differences between words.

e.g. red(attribute) can be used to discriminate apple (concept1) from banana (concept2) -> label 1

More examples:

| Attribute | concept1 | concept2 | label |
| --------- | -------- | -------- | ----- |
| bookcase | fridge | wood | 1 |
| bucket | mug | round | 0 |
| angle | curve | sharp | 1 |
| pelican | turtle | water | 0 |
| wire | coil | metal | 0 |

Task paper: [https://www.aclweb.org/anthology/S18-1117](https://www.aclweb.org/anthology/S18-1117)

Task Codalab: [https://competitions.codalab.org/competitions/17326](https://competitions.codalab.org/competitions/17326)

| Model | Explainability | F1 Score | Paper / Source | Code |
| ----- | -------------- | -------- | -------------- | ---- |
| **SVM** with GloVe                                                                            | **None**           | **0.76** | [SUNNYNLP at SemEval-2018 Task 10: A Support-Vector-Machine-Based Method for Detecting Semantic Difference using Taxonomy and Word Embedding Features](https://aclweb.org/anthology/S18-1118) | [Author's](https://github.com/Yermouth/sunnynlp)                    |
| **SVM** with ConceptNet, Wikipedia articles and WordNet synonyms                              | None               | 0.74     | [Luminoso at SemEval-2018 Task 10: Distinguishing Attributes Using Text Corpora and Relational Knowledge](https://aclweb.org/anthology/S18-1162)                                              | [Author's](https://github.com/LuminosoInsight/semeval-discriminatt) |
| **MLP** combining information from various DSMs, PMI, and ConceptNet                          | None               | 0.73     | [THU NGN at SemEval-2018 Task 10: Capturing Discriminative Attributes with MLP-CNN model](https://aclweb.org/anthology/S18-1157)                                                              |                                                                     |
| **Gradient boosting** with co-occurrence count features and JoBimText features                | None               | 0.73     | [BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern- and Graph-based Information to Identify Discriminative Attributes](https://aclweb.org/anthology/S18-1163)                         |                                                                     |
| LexVec, word co-occurrence, and ConceptNet data combined using **maximum entropy classifier** | None               | 0.72     | [UWB at SemEval-2018 Task 10: Capturing Discriminative Attributes from Word Distributions](https://aclweb.org/anthology/S18-1153)                                                             | [Author's](https://github.com/dpaperno/DiscriminAtt)                |
| Composes explicit **vector spaces** from WordNet Definitions, ConceptNet and Visual Genome    | **Fully Explainable**  | **0.69** | [Identifying and Explaining Discriminative Attributes](https://arxiv.org/abs/1909.05363)                                                                                                      | [Author's](https://github.com/ab-10/Hawk)                           |
| **Word2Vec** cosine similarities of WordNet glosses Transp. (No expl.)                        | Transp. (No expl.) | 0.69     | [Meaning space at SemEval-2018 Task 10: Combining explicitly encoded knowledge with information extracted from word embeddings](https://aclweb.org/anthology/S18-1154)                        | [Author's](https://github.com/cltl/meaning_space)                   |
| Use of Wikipedia and ConceptNet Transp. (No expl.)                                            | Transp. (No expl.) | 0.69     | [ELiRF-UPV at SemEval-2018 Task 10: Capturing Discriminative Attributes with Knowledge Graphs and Wikipedia](https://aclweb.org/anthology/S18-1159)                                           |                                                                     |

### FewRel

The Few-Shot Relation Classification Dataset (FewRel) is a different setting from the previous datasets. This dataset consists of 70K sentences expressing 100 relations annotated by crowdworkers on Wikipedia corpus. The few-shot learning task follows the N-way K-shot meta learning setting.

The public leaderboard is available on the [FewRel website](http://www.zhuhao.me/fewrel/).

### FewRel 2

FewRel 2 extends FewRel on (1) Adaptibility to a new  domain with only a hand-ful of instances (2) Ability to detect none-of-the-above relations? The paper is at [ACL Web](https://www.aclweb.org/anthology/D19-1649.pdf).

The public leaderboard is available on [FewRel 2 website](https://thunlp.github.io/2/fewrel2_da.html)

### Multi-Way Classification of Semantic Relations Between Pairs of Nominals (SemEval 2010 Task 8)

[SemEval-2010](http://www.aclweb.org/anthology/S10-1006) introduced 'Task 8 - Multi-Way Classification of Semantic
Relations Between Pairs of Nominals'. The task is, given a sentence and two tagged nominals, to predict the relation
between those nominals *and* the direction of the relation. The dataset contains nine general semantic relations
together with a tenth 'OTHER' relation.

Example:
 > There were apples, **pears** and oranges in the **bowl**.

 `(content-container, pears, bowl)`

The main evaluation metric used is macro-averaged F1, averaged across the nine proper relationships (i.e. excluding the
OTHER relation), taking directionality of the relation into account.

Several papers have used additional data (e.g. pre-trained word embeddings, WordNet) to improve performance. The figures
reported here are the highest achieved by the model using any external resources.

#### End-to-End Models

| Model                                  | F1    | Paper / Source  | Code           |
| -------------------------------------- | ----- | --------------- | -------------- |
| *BERT-based Models* |
| A-GCN (Tian et al., 2021) | **89.85** | [Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks](https://aclanthology.org/2021.acl-long.344/) | [Official](https://github.com/cuhksz-nlp/RE-AGCN) |
| Matching-the-Blanks (Baldini Soares et al., 2019) | 89.5 | [Matching the Blanks: Distributional Similarity for Relation Learning](https://www.aclweb.org/anthology/P19-1279) |
| R-BERT (Wu et al. 2019) | 89.25 | [Enriching Pre-trained Language Model with Entity Information for Relation Classification](https://arxiv.org/abs/1905.08284) | [mickeystroller's Reimplementation](https://github.com/mickeystroller/R-BERT)
| *CNN-based Models* |
| Multi-Attention CNN (Wang et al. 2016) | **88.0** | [Relation Classification via Multi-Level Attention CNNs](http://aclweb.org/anthology/P16-1123) | [lawlietAi's Reimplementation](https://github.com/lawlietAi/relation-classification-via-attention-model) |
| Attention CNN (Huang and Y Shen, 2016) | 84.3<br>85.9<sup>[\*](#footnote)</sup> | [Attention-Based Convolutional Neural Network for Semantic Relation Extraction](http://www.aclweb.org/anthology/C16-1238) |
| CR-CNN (dos Santos et al., 2015)       | 84.1  | [Classifying Relations by Ranking with Convolutional Neural Network](https://www.aclweb.org/anthology/P15-1061) | [pratapbhanu's Reimplementation](https://github.com/pratapbhanu/CRCNN) |
| CNN (Zeng et al., 2014)                | 82.7  | [Relation Classification via Convolutional Deep Neural Network](http://www.aclweb.org/anthology/C14-1220) | [roomylee's Reimplementation](https://github.com/roomylee/cnn-relation-extraction) |
| *RNN-based Models* |
| Entity Attention Bi-LSTM (Lee et al., 2019) | **85.2** | [Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing](https://arxiv.org/abs/1901.08163) | [Official](https://github.com/roomylee/entity-aware-relation-classification) |
| Hierarchical Attention Bi-LSTM (Xiao and C Liu, 2016) | 84.3 | [Semantic Relation Classification via Hierarchical Recurrent Neural Network with Attention](http://www.aclweb.org/anthology/C16-1119) |
| Attention Bi-LSTM (Zhou et al., 2016)  | 84.0 | [Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification](http://www.aclweb.org/anthology/P16-2034) | [SeoSangwoo's Reimplementation](https://github.com/SeoSangwoo/Attention-Based-BiLSTM-relation-extraction) |
| Bi-LSTM (Zhang et al., 2015)           | 82.7<br>84.3<sup>[\*](#footnote)</sup> | [Bidirectional long short-term memory networks for relation classification](http://www.aclweb.org/anthology/Y15-1009) |

<a name="footnote">*</a>: It uses external lexical resources, such as WordNet, part-of-speech tags, dependency tags, and named entity tags.

#### Dependency Models

| Model                               | F1    | Paper / Source  | Code           |
| ----------------------------------- | ----- | --------------- | -------------- |
| BRCNN (Cai et al., 2016)            | **86.3**  | [Bidirectional Recurrent Convolutional Neural Network for Relation Classification](http://www.aclweb.org/anthology/P16-1072) |
| DRNNs (Xu et al., 2016)             | 86.1  | [Improved Relation Classification by Deep Recurrent Neural Networks with Data Augmentation](https://arxiv.org/abs/1601.03651) |
| depLCNN + NS (Xu et al., 2015a)     | 85.6 | [Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling](https://www.aclweb.org/anthology/D/D15/D15-1062.pdf) |
| SDP-LSTM (Xu et al., 2015b)         | 83.7  | [Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Path](https://arxiv.org/abs/1508.03720) | [Sshanu's Reimplementation](https://github.com/Sshanu/Relation-Classification) |
| DepNN (Liu et al., 2015)            | 83.6  | [A Dependency-Based Neural Network for Relation Classification](http://www.aclweb.org/anthology/P15-2047) |
| FCN (Yu et al., 2014)               | 83.0  | [Factor-based compositional embedding models](https://www.cs.cmu.edu/~mgormley/papers/yu+gormley+dredze.nipsw.2014.pdf) |
| MVRNN (Socher et al., 2012)         | 82.4  | [Semantic Compositionality through Recursive Matrix-Vector Spaces](http://aclweb.org/anthology/D12-1110) | [pratapbhanu's Reimplementation](https://github.com/pratapbhanu/MVRNN) |

### New York Times Corpus

The standard corpus for distantly supervised relationship extraction is the New York Times (NYT) corpus, published in
[Riedel et al, 2010](http://www.riedelcastro.org//publications/papers/riedel10modeling.pdf).

This contains text from the [New York Times Annotated Corpus](https://catalog.ldc.upenn.edu/ldc2008t19) with named
entities extracted from the text using the Stanford NER system and automatically linked to entities in the Freebase
knowledge base. Pairs of named entities are labelled with relationship types by aligning them against facts in the
Freebase knowledge base. (The process of using a separate database to provide label is known as 'distant supervision')

Example:
 > **Elevation Partners**, the $1.9 billion private equity group that was founded by **Roger McNamee**

 `(founded_by, Elevation_Partners, Roger_McNamee)`

Different papers have reported various metrics since the release of the dataset, making it difficult to compare systems
directly. The main metrics used are either precision at N results or plots of the precision-recall. The range of recall
has increased over the years as systems improve, with earlier systems having very low precision at 30% recall.


| Model                               | P@10% | P@30% | Paper / Source | Code           |
| ----------------------------------- | ----- | ----- | --------------- | -------------- |
| KGPOOL (Nadgeri et al., 2021) | 92.3 | 86.7 | [KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction](https://arxiv.org/pdf/2106.00459.pdf) | [KGPOOL](https://github.com/nadgeri14/KGPool) |
| RECON (Bastos et al., 2021) | 87.5 | 74.1 | [RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network](https://arxiv.org/pdf/2009.08694.pdf) | [RECON](https://github.com/ansonb/RECON) |
| HRERE (Xu et al., 2019) | 84.9 | 72.8 | [Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction](https://arxiv.org/abs/1903.10126) | [HRERE](https://github.com/billy-inn/HRERE) |
| PCNN+noise_convert+cond_opt (Wu et al., 2019)         | 81.7   | 61.8   | [Improving Distantly Supervised Relation Extraction with Neural Noise Converter and Conditional Optimal Selector](https://arxiv.org/pdf/1811.05616.pdf) |  |
| Intra- and Inter-Bag (Ye and Ling, 2019)         | 78.9   | 62.4   | [Distant Supervision Relation Extraction with Intra-Bag and Inter-Bag Attentions](https://arxiv.org/pdf/1904.00143.pdf) | [Code](https://github.com/ZhixiuYe/Intra-Bag-and-Inter-Bag-Attentions) |
| RESIDE (Vashishth et al., 2018)         | 73.6   | 59.5   | [RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information](http://malllabiisc.github.io/publications/papers/reside_emnlp18.pdf) | [RESIDE](https://github.com/malllabiisc/RESIDE) |
| PCNN+ATT (Lin et al., 2016)         | 69.4   | 51.8   | [Neural Relation Extraction with Selective Attention over Instances](http://www.aclweb.org/anthology/P16-1200) | [OpenNRE](https://github.com/thunlp/OpenNRE/) |
| MIML-RE (Surdeneau et al., 2012)    | 60.7+  |   -   | [Multi-instance Multi-label Learning for Relation Extraction](http://www.aclweb.org/anthology/D12-1042) | [Mimlre](https://nlp.stanford.edu/software/mimlre.shtml) |
| MultiR (Hoffman et al., 2011)       | 60.9+  |   -   | [Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations](http://www.aclweb.org/anthology/P11-1055) | [MultiR](http://aiweb.cs.washington.edu/ai/raphaelh/mr/) |
| (Mintz et al., 2009)                | 39.9+  |   -   | [Distant supervision for relation extraction without labeled data](http://www.aclweb.org/anthology/P09-1113) | |

(+) Obtained from results in the paper "Neural Relation Extraction with Selective Attention over Instances"

#### WikiData dataset for Sentential Relation Extraction

The sentential RE ignores any other occurrence of the given entity pair, thereby making the target relation predictions on the sentence level ([Sorokin and Gurevych, 2017](https://www.aclweb.org/anthology/D17-1188.pdf)). The paper introduces a dataset on Wikidata KG containing 353 relations.

| Model                               | F1    | Paper / Source  | Code           |
| ----------------------------------- | ----- | --------------- | -------------- |
| KGPOOL (Nadgeri et al., 2021)            | **88.60**  | [KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction](https://arxiv.org/pdf/2106.00459.pdf) |
| RECON (Bastos et al., 2021)            | **87.23**  | [RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network](https://arxiv.org/pdf/2009.08694.pdf) |
| GPGNN (Zhu et al., 2019)             | 82.29  | [Graph Neural Networks with Generated Parameters for Relation Extraction](https://www.aclweb.org/anthology/P19-1128.pdf) |
| ContextAware (Sorokin and Gurevych, 2017)     | 72.07 | [Context-Aware Representations for Knowledge Base Relation Extraction](https://www.aclweb.org/anthology/D17-1188.pdf) |

#### Joint Entity and Relation Extraction

In this task binary relation tuples (two entities and a relation between them) are jointly extracted from sentences. The input to the models is just the sentences and a set of relations, output is a set of relation tuples. Models should extract all relation tuples present in the sentences with full entity names and overlapping entities. F1 score is used to evaluate the models. An extracted tuple is considered as correct if the two entities and the relation match with a ground truth tuple.

##### NYT29

This dataset is derived from the New York Times dataset of [Riedel et al., 2010](http://www.riedelcastro.org//publications/papers/riedel10modeling.pdf). It has 29 relations. 

| Model           | F1  |  Paper / Source | Code |
| ------------- | ----- | --- | --- |
| WDec (Nayak and Ng, 2020) | 0.682 | [Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction](https://arxiv.org/pdf/1911.09886.pdf) | [PtrNetDecoding4JERE](https://github.com/nusnlp/PtrNetDecoding4JERE) |
| PNDec (Nayak and Ng, 2020) | 0.673 | [Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction](https://arxiv.org/pdf/1911.09886.pdf) | [PtrNetDecoding4JERE](https://github.com/nusnlp/PtrNetDecoding4JERE) |
| HRLRE (Takanobu et at., 2019) | 0.643 | [A Hierarchical Framework for Relation Extraction with Reinforcement Learning](https://arxiv.org/pdf/1811.03925.pdf) | [HRLRE](https://github.com/truthless11/HRL-RE) |

##### NYT24

This dataset is derived from the New York Times dataset of [Hoffman et al., 2011](https://www.aclweb.org/anthology/P11-1055.pdf). It has 24 relations. 

| Model           | F1  |  Paper / Source | Code |
| ------------- | ----- | --- | --- |
| WDec (Nayak and Ng, 2020) | 0.817 | [Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction](https://arxiv.org/pdf/1911.09886.pdf) | [PtrNetDecoding4JERE](https://github.com/nusnlp/PtrNetDecoding4JERE) |
| PNDec (Nayak and Ng, 2020) | 0.789 | [Effective Modeling of Encoder-Decoder Architecture for Joint Entity and Relation Extraction](https://arxiv.org/pdf/1911.09886.pdf) | [PtrNetDecoding4JERE](https://github.com/nusnlp/PtrNetDecoding4JERE) |
| HRLRE (Takanobu et at., 2019) | 0.776 | [A Hierarchical Framework for Relation Extraction with Reinforcement Learning](https://arxiv.org/pdf/1811.03925.pdf) | [HRLRE](https://github.com/truthless11/HRL-RE) |

### TACRED

[TACRED](https://nlp.stanford.edu/projects/tacred/) is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the [corpus](https://catalog.ldc.upenn.edu/LDC2018T03) used in the yearly [TAC Knowledge Base Population (TAC KBP) challenges](https://tac.nist.gov/2017/KBP/index.html). Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., _per:schools_attended_ and _org:members_) or are labeled as _no_relation_ if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.

Example:
 > *Billy Mays*, the bearded, boisterious pitchman who, as the undisputed king of TV yell and sell, became an inlikely pop culture icon, died at his home in *Tampa*, Fla, on Sunday. 

 `(per:city_of_death, Billy Mays, Tampa)`

The main evaluation metric used is micro-averaged F1 over instances with proper relationships (i.e. excluding the
_no_relation_ type).

| Model                                  | F1    | Paper / Source  | Code           |
| -------------------------------------- | ----- | --------------- | -------------- |
| LUKE (Yamada et al., 2020) | **72.7** | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523) | [Official](https://github.com/studio-ousia/luke) |
| Matching-the-Blanks (Baldini Soares et al., 2019) | 71.5 | [Matching the Blanks: Distributional Similarity for Relation Learning](https://www.aclweb.org/anthology/P19-1279) |
| C-GCN + PA-LSTM (Zhang et al. 2018) | 68.2 | [Graph Convolution over Pruned Dependency Trees Improves Relation Extraction](http://aclweb.org/anthology/D18-1244) | [Offical](https://github.com/qipeng/gcn-over-pruned-trees) |
| PA-LSTM (Zhang et al, 2017) | 65.1 | [Position-aware Attention and Supervised Data Improve Slot Filling](http://aclweb.org/anthology/D17-1004) | [Official](https://github.com/yuhaozhang/tacred-relation) |

[Go back to the README](../README.md)


================================================
FILE: english/semantic_parsing.md
================================================
# Semantic parsing

### Table of contents

- [AMR parsing](#amr-parsing)
  - [LDC2014T12](#ldc2014t12)
  - [LDC2015E86](#ldc2015e86)
  - [LDC2016E25](#ldc2016e25)
- [DRS parsing](#drs-parsing)
  - [PMB 2.2.0](#pmb-220)
  - [PMB 3.0.0](#pmb-300)
  - [RST-DT](#rst-dt)
- [UCCA parsing](#ucca-parsing)
  - [SemEval 2019 Task 1](semeval-2019-task-1)
  - [CoNLL 2019](conll-2019)
- [Semantic Dependency Parsing](#semantic-dependency-parsing)
  - [SemEval 2015 Task 18](#semeval-2015-task-18)
- [SQL parsing](#sql-parsing)
  - [ATIS](#atis)
  - [Advising](#advising)
  - [GeoQuery](#geoquery)
  - [Scholar](#scholar)
  - [Spider](#spider)
  - [WikiSQL](#wikisql)
  - [Smaller datasets](#smaller-datasets)

Semantic parsing is the task of translating natural language into a formal meaning
representation on which a machine can act. Representations may be an executable language
such as SQL or more abstract representations such as [Abstract Meaning Representation (AMR)](https://en.wikipedia.org/wiki/Abstract_Meaning_Representation)
and [Universal Conceptual Cognitive Annotation (UCCA)](http://www.cs.huji.ac.il/~oabend/ucca.html).

## AMR parsing

Each AMR is a single rooted, directed graph. AMRs include PropBank semantic roles, within-sentence coreference, named entities and types, modality, negation, questions, quantities, and so on. [See](https://amr.isi.edu/index.html).
In the following tables, systems marked with &hearts; are pipeline systems that require POS as input,
&spades; is for those require NER,
&diams; is for those require syntax parsing,
and &clubs; is for those require SRL.

### LDC2014T12:
The dataset contains 13,051 AMRs split across training, dev, and test partitions.

Models are evaluated on the newswire section and the full dataset based on [smatch](https://amr.isi.edu/smatch-13.pdf).

| Model           | F1 Newswire  | F1 Full |  Paper / Source |
| ------------- | :-----:| :-----:| --- |
| StructBART (Structure-aware Fine-tuning of BART, Zhou et al., 2021) | -- | 81.7 | [Structure-aware Fine-tuning of Sequence-to-sequence Transformers for Transition-based AMR Parsing](https://aclanthology.org/2021.emnlp-main.507/) |
| APT (Action-Pointer Transformer, Zhou et al., 2021) | -- | 79.8 | [AMR Parsing with Action-Pointer Transformer](https://aclanthology.org/2021.naacl-main.443/) |
| Pushing the Limits of AMR Parsing with Self-Learning (Young-Suk Lee et al., 2020) | -- | 78.2 | [Pushing the Limits of AMR Parsing with Self-Learning](https://arxiv.org/abs/2010.10673) |
| AMR Parsing via Graph-Sequence Iterative Inference (Cai and Lam , 2020)&hearts;&spades; | -- | 75.4 | [AMR Parsing via Graph-Sequence Iterative Inference](https://arxiv.org/pdf/2004.05572.pdf) |
| Broad-Coverage Semantic Parsing as Transduction (Zhang et al., 2019)&hearts; | -- | 71.3 | [Broad-Coverage Semantic Parsing as Transduction](https://www.aclweb.org/anthology/D19-1392.pdf) |
| Two-stage Sequence-to-Graph Transducer (Zhang et al., 2019)&hearts; | -- | 70.2 | [AMR Parsing as Sequence-to-Graph Transduction](https://www.aclweb.org/anthology/P19-1009.pdf) |
| Transition-based+improved aligner+ensemble (Liu et al. 2018)&hearts; | 73.3 | 68.4 | [An AMR Aligner Tuned by Transition-based Parser](http://aclweb.org/anthology/D18-1264) |
| Improved CAMR (Wang and Xue, 2017)&spades;&diams; | --| 68.1| [Getting the Most out of AMR Parsing](http://aclweb.org/anthology/D17-1129) |
| Incremental joint model (Zhou et al., 2016)&hearts;&spades; | 71 | 66 | [AMR Parsing with an Incremental Joint Model](https://aclweb.org/anthology/D16-1065) |
| Transition-based transducer (Wang et al., 2015)&hearts;&diams;&clubs; | 70 | 66 | [Boosting Transition-based AMR Parsing with Refined Actions and Auxiliary Analyzers](http://www.aclweb.org/anthology/P15-2141) |
| Imitation learning  (Goodman et al., 2016)&hearts;&spades; | 70 |  -- | [Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing](http://www.aclweb.org/anthology/P16-1001) |
| MT-Based (Pust et al., 2015)&spades; | -- | 66 | [Parsing English into Abstract Meaning Representation Using Syntax-Based Machine Translation ](http://www.aclweb.org/anthology/D15-1136)
| Transition-based parser-Stack-LSTM (Ballesteros and Al-Onaizan, 2017)&hearts;&diams; | 69 | 64  | [AMR Parsing using Stack-LSTMs](http://www.aclweb.org/anthology/D17-1130) |
| Transition-based parser-Stack-LSTM (Ballesteros and Al-Onaizan, 2017) | 68 | 63  | [AMR Parsing using Stack-LSTMs](http://www.aclweb.org/anthology/D17-1130) |

### LDC2015E86:
The dataset contains 19,572 AMRs split across training, dev, and test partitions.

Models are evaluated based on [smatch](https://amr.isi.edu/smatch-13.pdf).

| Model           | Smatch  |  Paper / Source |
| ------------- | :-----:| --- |
| Joint model (Lyu and Titov, 2018)&hearts;&spades; | 73.7 | [AMR Parsing as Graph Prediction with Latent Alignment](https://arxiv.org/abs/1805.05286) |
| Mul-BiLSTM (Foland and Martin, 2017)&spades; | 70.7 | [Abstract Meaning Representation Parsing using LSTM Recurrent Neural Networks](http://aclweb.org/anthology/P17-1043) |
| JAMR (Flanigan et al., 2016)&hearts;&diams;&clubs; | 67.0 | [CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss](http://www.aclweb.org/anthology/S16-1186) |
| CAMR (Wang et al., 2016)&hearts;&diams;&clubs; | 66.5 | [CAMR at SemEval-2016 Task 8: An Extended Transition-based AMR Parser](http://www.aclweb.org/anthology/S16-1181) |
| AMREager (Damonte et al., 2017)&hearts;&spades;&diams; | 64.0 | [An Incremental Parser for Abstract Meaning Representation](http://www.aclweb.org/anthology/E17-1051) |
| SEQ2SEQ + 20M (Konstas et al., 2017)&spades; | 62.1 | [Neural AMR: Sequence-to-Sequence Models for Parsing and Generation](https://arxiv.org/abs/1704.08381) |

### LDC2017T10 (LDC2016E25):
The dataset contains 39,260 AMRs split across training, dev, and test partitions.

Models are evaluated based on [smatch](https://amr.isi.edu/smatch-13.pdf).

| Model           | Smatch  |  Paper / Source |
| ------------- | :-----:| --- |
| StructBART (Structure-aware Fine-tuning of BART, Zhou et al., 2021) | 84.9 | [Structure-aware Fine-tuning of Sequence-to-sequence Transformers for Transition-based AMR Parsing](https://aclanthology.org/2021.emnlp-main.507/) |
| One SPRING to Rule Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline (Bevilacqua et al., 2020) | 84.5 | [One SPRING to Rule Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline](https://ojs.aaai.org/index.php/AAAI/article/view/17489) |
| APT (Action-Pointer Transformer, Zhou et al., 2021) | 83.4 | [AMR Parsing with Action-Pointer Transformer](https://aclanthology.org/2021.naacl-main.443/) |
| AMR Parsing with Sequence-to-Sequence Pre-training (Xu, et al., 2020) | 81.4 | [Improving AMR Parsing with Sequence-to-Sequence Pre-training](https://arxiv.org/pdf/2010.01771.pdf) |
| Pushing the Limits of AMR Parsing with Self-Learning (Young-Suk Lee et al., 2020) | 81.3 | [Pushing the Limits of AMR Parsing with Self-Learning](https://arxiv.org/abs/2010.10673) |
| AMR Parsing via Graph-Sequence Iterative Inference (Cai and Lam, 2020)&hearts;&spades; | 80.2 | [AMR Parsing via Graph-Sequence Iterative Inference](https://arxiv.org/pdf/2004.05572.pdf) |
| Broad-Coverage Semantic Parsing as Transduction (Zhang et al., 2019)&hearts; | 77.0 | [Broad-Coverage Semantic Parsing as Transduction](https://www.aclweb.org/anthology/D19-1392.pdf) |
| Two-stage Sequence-to-Graph Transducer (Zhang et al., 2019)&hearts; | 76.3 | [AMR Parsing as Sequence-to-Graph Transduction](https://www.aclweb.org/anthology/P19-1009.pdf) |
| Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning (Naseem et al., 2019)&hearts;&spades;&diams; | 75.5 | [Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning](https://arxiv.org/pdf/1905.13370) |
| Joint model (Lyu and Titov, 2018)&hearts;&spades; | 74.4 | [AMR Parsing as Graph Prediction with Latent Alignment](https://arxiv.org/abs/1805.05286) |
| Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning (Naseem et al., 2019); | 73.4 | [Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning](https://arxiv.org/pdf/1905.13370) |
| Core Semantic First: A Top-down Approach for AMR Parsing (Cai and Lam, 2019)&hearts;&spades; | 73.2 | [Core Semantic First: A Top-down Approach for AMR Parsing](https://www.aclweb.org/anthology/D19-1393.pdf) |
| ChSeq + 100K (van Noord and Bos, 2017)&hearts; | 71.0 | [Neural Semantic Parsing by Character-based Translation: Experiments with Abstract Meaning Representations](https://clinjournal.org/clinj/article/view/72/64) |
| Neural-Pointer (Buys and Blunsom, 2017)&hearts;&spades; | 61.9 | [Oxford at SemEval-2017 Task 9: Neural AMR Parsing with Pointer-Augmented Attention](http://aclweb.org/anthology/S17-2157) |

### LDC2020T02
The dataset contains 59,255 AMRs split across training, dev, and test partitions.

Models are evaluated based on [smatch](https://amr.isi.edu/smatch-13.pdf).

| Model           | Smatch  |  Paper / Source |
| ------------- | :-----:| --- |
| StructBART (Structure-aware Fine-tuning of BART, Zhou et al., 2021) | 83.1 | [Structure-aware Fine-tuning of Sequence-to-sequence Transformers for Transition-based AMR Parsing](https://aclanthology.org/2021.emnlp-main.507/) |
| One SPRING to Rule Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline (Bevilacqua et al., 2020) | 83.0 | [One SPRING to Rule Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline](https://ojs.aaai.org/index.php/AAAI/article/view/17489) |
| APT (Action-Pointer Transformer, Zhou et al., 2021) | 81.2 | [AMR Parsing with Action-Pointer Transformer](https://aclanthology.org/2021.naacl-main.443/) |


## DRS parsing

Discourse Representation Structures (DRS) are formal meaning representations introduced by [Discourse Representation Theory](https://en.wikipedia.org/wiki/Discourse_representation_theory). DRS parsing is a complex task, comprising other NLP tasks, such as semantic role labeling, word sense disambiguation, co-reference resolution and named entity tagging. Also, DRSs show explicit scope for certain operators, which allows for a more principled and linguistically motivated treatment of negation, modals and quantification, as has been advocated in formal semantics. Moreover, DRSs can be translated to formal logic, which allows for automatic forms of inference by third parties.

### Parallel Meaning Bank (PMB)

The results listed here are from annotated English DRSs released by the [Parallel Meaning Bank](pmb.let.rug.nl). An introduction of the PMB and the annotation process is described in [this paper](https://www.aclweb.org/anthology/E17-2039.pdf). A DRS consists of a list of clauses. Each clause contains a number of variables, which are matched during evaluation using the evaluation tool Counter ([paper](https://www.aclweb.org/anthology/L18-1267.pdf), [code](https://github.com/RikVN/DRS_parsing)). Counter calculates an F-score over the matching clauses for each DRS-pair and micro-averages these to calculate a final F-score, similar to the Smatch procedure of AMR parsing.

The scores listed here are for PMB release [2.2.0](https://pmb.let.rug.nl/data.php) and [3.0.0](https://pmb.let.rug.nl/data.php), specifically. The development and test sets differ per release, but have a considerable overlap. The results listed here are on the test set. The data sets can be downloaded on the official [PMB webpage](https://pmb.let.rug.nl/data.php), but note that a more user-friendly format can be downloaded by following the steps in the [Neural_DRS repository](https://github.com/RikVN/Neural_DRS).

#### PMB-2.2.0

The gold standard train, dev and test sets contain 4,597, 682 and 650 documents, respectively.

| Model         | Authors | F1  |  Paper / Source |
| ------------- |------- | :-----:| --- |
| Bi-LSTM seq2seq: BERT + characters in 1 encoder | Van Noord et al. (2020) | 88.3 | [Character-level Representations Improve DRS-based Semantic Parsing Even in the Age of BERT](https://www.aclweb.org/anthology/2020.emnlp-main.371.pdf) |
| Transformer seq2seq | Liu et al. (2019) | 87.1 | [Discourse Representation Structure Parsing with Recurrent Neural Networks and the Transformer Model](https://www.aclweb.org/anthology/W19-1203.pdf)
| Character-level bi-LSTM seq2seq + linguistic features | Van Noord et al. (2019) | 86.8 | [Linguistic Information in Neural Semantic Parsing with Multiple Encoders](https://www.aclweb.org/anthology/W19-0504.pdf)|
| Character-level bi-LSTM seq2seq | Van Noord et al. (2018) | 83.3 | [Exploring Neural Methods for Parsing Discourse Representation Structures](https://www.aclweb.org/anthology/Q18-1043.pdf)|
| Neural graph-based system using DAG-grammars | Fancellu et al. (2019) | 76.4 | [Semantic graph parsing with recurrent neural network DAG grammars](https://www.aclweb.org/anthology/D19-1278.pdf) |
| Transition-based Stack-LSTM | Evang (2019) | 74.4 | [Transition-based DRS Parsing Using Stack-LSTMs](https://www.aclweb.org/anthology/W19-1202.pdf) |

#### PMB-3.0.0

The gold standard train, dev and test sets contain 6,620, 885 and 898 documents, respectively.

| Model         | Authors | F1  |  Paper / Source |
| ------------- |------- | :-----:| --- |
| Bi-LSTM seq2seq: BERT + characters in 1 encoder | Van Noord et al. (2020) | 89.3 | [Character-level Representations Improve DRS-based Semantic Parsing Even in the Age of BERT](https://www.aclweb.org/anthology/2020.emnlp-main.371.pdf) |
| Character-level bi-LSTM seq2seq + linguistic features | Van Noord et al. (2019) | 87.7 | [Linguistic Information in Neural Semantic Parsing with Multiple Encoders](https://www.aclweb.org/anthology/W19-0504.pdf)|
| Character-level bi-LSTM seq2seq | Van Noord et al. (2018) | 84.9 | [Exploring Neural Methods for Parsing Discourse Representation Structures](https://www.aclweb.org/anthology/Q18-1043.pdf)|

### RST-DT

RST-DT [(Carlson et al., 2001)](https://www.aclweb.org/anthology/W01-1605.pdf) contains 385 documents of American English selected from the Penn Treebank (Marcus et al., 1993), annotated in the framework of Rhetorical Structure Theory.
The dataset was officially divided into 347 documents as the training dataset and 38 documents as the test dataset. Note that there is no officially available development dataset.
In the evaluation, micro-averaged F1 scores of unlabeled spans (Span), those of nuclearity labeled spans (Nuclearity), those of rhetorical relation labeled spans (Relation), and those of both nuclearity and rhetorical relation labeled spans (Full) based on RST-Parseval [(Marcu 2000)](https://mitpress.mit.edu/books/theory-and-practice-discourse-parsing-and-summarization) are used. An implementation of the standard evaluation metrics is [here](http://alt.qcri.org/tools/discourse-eval/).

| Model    | Span     | Nuclearity | Relation | Full     | Paper / Source | Code |
| -------- | -------- | -------- | -------- | -------- | -------------- | ---- |
| Span-based Top-down Parser (ensemble) (Kobayashi et al., 2020)  | 87.0 | 74.6 | 60.0 | -- | [Top-Down RST Parsing Utilizing Granularity Levels in Documents](https://doi.org/10.1609/aaai.v34i05.6321) | [Official](https://github.com/nttcslab-nlp/Top-Down-RST-Parser) |
| Two-stage Parsing (Wang et al., 2017)  | 86.0 | 72.4 | 59.7 | -- | [A Two-Stage Parsing Method for Text-Level Discourse Analysis](https://www.aclweb.org/anthology/P17-2029.pdf) | [Official](https://github.com/yizhongw/StageDP)    |
| Bottom-up Linear-chain CRF-based Parser (Feng and Hirst, 2014) | 85.7 | 71.0 | 58.2 | -- | [A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing](https://www.aclweb.org/anthology/P14-1048.pdf) | [Official](http://www.cs.toronto.edu/~weifeng/software.html) |
| Transition-based Parser with Implicit Syntax Features (Yu et al., 2018)  | 85.5 | 73.1 | 60.2 | 59.9 | [Transition-based Neural RST Parsing with Implicit Syntax Features](https://www.aclweb.org/anthology/C18-1047.pdf) | [Official](https://github.com/fajri91/NeuralRST) |
| Two-stage Discourse Parser with a Sliding Window (Joty et al., 2015) | 83.84 | 68.90 | 55.87 | -- | [CODRA: A Novel Discriminative Framework for Rhetorical Analysis](https://www.aclweb.org/anthology/J15-3002.pdf) | [Official](http://alt.qcri.org/tools/discourse-parser/) |
|  HILDA Parser (Hernault et al., 2010) | 83.0 | 68.4 | 55.3 | 54.8 | [HILDA: a discourse parser using support vector machine classification](http://journals.linguisticsociety.org/elanguage/dad/article/download/591/591-2300-1-PB.pdf) |  |
| Greedy Bottom-up Parser with Syntactic Features (Surdeanu et al., 2015) | 82.6* | 67.1* | 55.4* | 54.9* | [Two Practical Rhetorical Structure Theory Parsers](https://www.aclweb.org/anthology/N15-3001.pdf) | [Official](https://github.com/clulab/processors) |
| Re-implemented HILDA RST parser (Hayashi et al., 2016)| 82.6* | 66.6* | 54.6* | 54.3* |[Empirical comparison of dependency conversions for RST discourse trees](https://www.aclweb.org/anthology/W16-3616.pdf) | -- |
| Discourse Parser with Hierarchical Attention (Li et al., 2016) | 82.2* | 66.5* | 51.4* | 50.6* | [Discourse Parsing with Attention-based Hierarchical Neural Networks](https://www.aclweb.org/anthology/D16-1035.pdf) | -- |
| Discourse Parsing from Linear Projection (Ji et al., 2014) | 82.0* | 68.2* | 57.8* | 57.6* | [Representation Learning for Text-level Discourse Parsing](https://www.aclweb.org/anthology/P14-1002.pdf) | [Official](https://github.com/jiyfeng/DPLP) |
| Transition-Based Parser Trained on Cross-Lingual Corpus (Braud et al., 2017) | 81.3* | 68.1* | 56.3* | 56.0* | [Cross-lingual RST Discourse Parsing](https://www.aclweb.org/anthology/E17-1028.pdf) | [Official](https://bitbucket.org/) |
|  LSTM Sequential Discourse Parser (Braud et al., 2016) | 79.7* | 63.6* | 47.7* | 47.5* | [Multi-view and multi-task training of RST discourse parsers](https://www.aclweb.org/anthology/C16-1179.pdf) | [Official](http://bitbucket.org/chloebt/discourse) |

*: The score is reported in [Morey et al.2017](https://www.aclweb.org/anthology/D17-1136.pdf).

## UCCA parsing

UCCA ([Abend and Rappoport, 2013](https://www.aclweb.org/anthology/P13-1023.pdf))
is a semantic representation whose main design principles
are ease of annotation, cross-linguistic applicability, and a modular architecture. UCCA represents
the semantics of linguistic utterances as directed acyclic graphs (DAGs), where terminal (childless)
nodes correspond to the text tokens, and non-terminal nodes to semantic units that participate in
some super-ordinate relation. Edges are labeled,
indicating the role of a child in the relation the parent represents.
UCCA's foundational layer mostly covers predicate-argument structure,
semantic heads and inter-Scene relations.
UCCA distinguishes primary edges, corresponding to explicit relations, from remote edges
that allow for a unit to participate in several super-ordinate relations.
Primary edges form a tree in each layer, whereas remote edges enable reentrancy, forming a DAG.

Evaluation is done by labeled F1 on the graph edges, matched by child terminal yield.

### [SemEval 2019 Task 1](https://www.aclweb.org/anthology/S19-2001.pdf)

Open and closed tracks on English, French and German [UCCA corpora](https://github.com/UniversalConceptualCognitiveAnnotation) from Wikipedia and *Twenty Thousand Leagues Under the Sea*.
Results for the English open track data are given here, with 5,141 training sentences.

| Model | English-Wiki (open) F1 | English-20K (open) F1 | Paper / Source | Code |
| ------| :--------------------: | :-------------------: | -------------- | ---- |
| Constituent Tree Parsing + BERT (Jiang et al., 2019) | 80.5 | 76.7 | [HLT@SUDA at SemEval-2019 Task 1: UCCA Graph Parsing as Constituent Tree Parsing](https://www.aclweb.org/anthology/S19-2002/) | https://github.com/SUDA-LA/ucca-parser |
| Neural Transducer (Zhang et al., 2019) | 76.6 | -- | [Broad-Coverage Semantic Parsing as Transduction](https://www.aclweb.org/anthology/D19-1392.pdf) | https://github.com/sheng-z/stog |
| Transition-based + MTL (Hershcovich et al., 2018) | 73.5 | 68.4 | [Multitask Parsing Across Semantic Representations](https://www.aclweb.org/anthology/P18-1035.pdf) | https://github.com/danielhers/tupa |
| Transition-based (Hershcovich et al., 2017) | 72.8 | 67.2 | [A Transition-Based Directed Acyclic Graph Parser for UCCA](https://www.aclweb.org/anthology/P17-1104.pdf) | https://github.com/danielhers/tupa |

### [CoNLL 2019](https://www.aclweb.org/anthology/K19-2001.pdf)

The CoNLL 2019 shared task included parsing to AMR, UCCA, DM, PSD, and EDS.
The [UCCA training data](http://svn.nlpl.eu/mrp/2019/public/ucca.tgz) is freely available.

UCCA evaluation is done both by UCCA F1 (as in SemEval 2019) and by the MRP metric, which is similar to smatch.
The training data contains 6,572 sentences from web reviews and Wikipedia.
There are two evaluation sets: one with 1,131, from the same domains (Full), and one with 87 sentences, from *The Little Prince* (LPP). Note that due to an error, 535 of the 1,131 Full Evaluation sentences were included in the training data, and therefore the full evaluation scores are an overestimate. The LPP scores are unaffected by this.

| Model | Full UCCA F1 | Full MRP F1 | LPP UCCA F1 | LPP MRP F1 | Paper / Source | Code |
| ------| :----------: | :---------: | :---------: | :--------: | -------------- | ---- |
| Transition-based + BERT + Efficient Training + Effective Encoding (Che et al., 2019) | 66.7 | 81.7 | 64.4 | 82.6 | [HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding](https://www.aclweb.org/anthology/K19-2007.pdf) | https://github.com/DreamerDeo/HIT-SCIR-CoNLL2019 |
| Transition-based + BERT (Hershcovich and Arviv, 2019) | 57.4 | 77.7 |  65.9 | 82.2 | [TUPA at MRP 2019: A Multi-Task Baseline System](https://www.aclweb.org/anthology/K19-2002.pdf) | https://github.com/danielhers/tupa/tree/mrp |
| Transition-based + BERT + MTL (Hershcovich and Arviv, 2019) | 35.6 | 64.1 |  50.3 | 73.1 | [TUPA at MRP 2019: A Multi-Task Baseline System](https://www.aclweb.org/anthology/K19-2002.pdf) | https://github.com/danielhers/tupa/tree/mrp |

## Semantic Dependency Parsing

Broad-coverage semantic dependency parsing (SDP) [(Oepen et al., 2014)](aclweb.org/anthology/S14-2008) is defined as the task of recovering sentence-internal predicate–argument relationships for all content words, i.e. the semantic structure constituting the relational core of sentence meaning.  Target representations, thus, are semantic dependency graphs, as shown in our running example for the (Wall Street Journal) sentence:

```
A similar technique is almost impossible to apply to other crops, such as cotton, soybeans, and rice.
```
 Here, ‘technique’ for example, is the argument of at least the determiner (as the quantificational locus), the intersective modifier ‘similar’, and the predicate ‘apply’.  Conversely, the predicative copula, infinitival ‘to’, and the vacuous preposition marking the deep object of ‘apply’ arguably have no semantic contribution of their own.  For general background on the 2014 variant and an overview of participating systems (and results), please see the [(Oepen et al., 2014)](aclweb.org/anthology/S14-2008).
 
 ### [SemEval 2015 Task 18](http://alt.qcri.org/semeval2015/task18/)
 
 This Task is a re-run with some extensions of [Task 8 at SemEval 2014](http://alt.qcri.org/semeval2014/task8/). The task has three distinct target representations, dubbed DM, PAS, and PSD (renamed from what was PCEDT at SemEval 2014), representing different traditions of semantic annotation. More detail on the linguistic ‘pedigree’ of these formats is available in the [summary](http://alt.qcri.org/semeval2015/task18/index.php?id=representations) of target representations, and there is also an [on-line search](http://wesearch.delph-in.net/sdp/) interface available to interactively explore these representations (like the initial release of the training data, this interface in early August 2014 still lacks semantic dependency graphs for languages other than English).

In each dataset, there is a in-domain (ID) and out-of-domain (OOD) test set. The evaluation metric is the labeled F1 score.

| Model           | DM ID | DM OOD | PAS ID | PAS OOD | PSD ID | PSD OOD | Paper / Source | Code |
| --------------- | :-----: |  :-----:|:-----: |  :-----:|:-----: |  :-----:| --------------- | ---- |
| ACE + fine-tune (Wang et al., 2020) | 95.6 | 92.6 | 95.8 | 94.6 | 83.8 | 83.4| [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
|Transition-based Pointer Network+Char+Lemma+BERT (Fernández-González & Gómez-Rodríguez, 2020)|94.4|91.0|95.1|93.4|82.6|82.0|[Transition-based Semantic Dependency Parsing with Pointer Networks](https://www.aclweb.org/anthology/2020.acl-main.629/)|[Official](https://github.com/danifg/SemanticPointer)|
|Second-Order+Biaffine+Char+Lemma (Wang et al., 2019)|94.0|89.7|94.1|91.3|81.4|79.6|[Second-Order Semantic Dependency Parsing with End-to-End Neural Networks](https://www.aclweb.org/anthology/P19-1454/)|[Official](https://github.com/wangxinyu0922/Second_Order_SDP)|
|Biaffine+Char+Lemma (Dozat and Manning, 2018)|93.7|88.9|93.9|90.6|81.0|79.4|[Simpler but More Accurate Semantic Dependency Parsing](https://www.aclweb.org/anthology/P18-2077/)|[Official](https://github.com/tdozat/Parser-v3)|


## SQL parsing

### ATIS

5,280 user questions for a flight-booking task:

- Collected and manually annotated with SQL [Dahl et al., (1994)](http://dl.acm.org/citation.cfm?id=1075823)
- Modified by [Iyer et al., (2017)](http://www.aclweb.org/anthology/P17-1089) to reduce nesting
- Bugfixes and changes to a canonical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029)


Example:

| Question | SQL query |
| ------------- |  --- |
| what flights from any city land at MKE | `SELECT DISTINCT FLIGHTalias0.FLIGHT_ID FROM AIRPORT AS AIRPORTalias0 , AIRPORT_SERVICE AS AIRPORT_SERVICEalias0 , CITY AS CITYalias0 , FLIGHT AS FLIGHTalias0 WHERE AIRPORTalias0.AIRPORT_CODE = "MKE" AND CITYalias0.CITY_CODE = AIRPORT_SERVICEalias0.CITY_CODE AND FLIGHTalias0.FROM_AIRPORT = AIRPORT_SERVICEalias0.AIRPORT_CODE AND FLIGHTalias0.TO_AIRPORT = AIRPORTalias0.AIRPORT_CODE ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 51 | 32 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 45 | 17 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Template Baseline (Finegan-Dollak et al., 2018) | 45 | 0 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

### Advising

4,570 user questions about university course advising, with manually annotated SQL [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029).

Example:

| Question | SQL query |
| ------------- |  --- |
| Can undergrads take 550 ? | `SELECT DISTINCT COURSEalias0.ADVISORY_REQUIREMENT , COURSEalias0.ENFORCED_REQUIREMENT , COURSEalias0.NAME FROM COURSE AS COURSEalias0 WHERE COURSEalias0.DEPARTMENT = \"department0\" AND COURSEalias0.NUMBER = 550 ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Template Baseline (Finegan-Dollak et al., 2018) | 80 | 0 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 70 | 0  | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 41 | 1 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |

### GeoQuery

877 user questions about US geography:

- Collected and manually annotated with Prolog [Zelle and Mooney (1996)](http://dl.acm.org/citation.cfm?id=1864519.1864543)
- Most questions were converted to SQL by [Popescu et al., (2003)](http://doi.acm.org/10.1145/604045.604070)
- Remaining question converted to SQL by [Giordani and Moschitti (2012)](https://doi.org/10.1007/978-3-642-45260-4_5), and independently by [Iyer et al., (2017)](http://www.aclweb.org/anthology/P17-1089)
- Bugfixes and changes to a canonical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029)

Example:

| Question | SQL query |
| ------------- |  --- |
| what is the biggest city in arizona | `SELECT CITYalias0.CITY_NAME FROM CITY AS CITYalias0 WHERE CITYalias0.POPULATION = ( SELECT MAX( CITYalias1.POPULATION ) FROM CITY AS CITYalias1 WHERE CITYalias1.STATE_NAME = "arizona" ) AND CITYalias0.STATE_NAME = "arizona"` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 71 | 20 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 66 | 40 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Template Baseline (Finegan-Dollak et al., 2018) | 66 | 0  | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

### Scholar

817 user questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct.

- Collected by [Iyer et al., (2017)](http://www.aclweb.org/anthology/P17-1089)
- Bugfixes and changes to a canonical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029)

Example:

| Question | SQL query |
| ------------- |  --- |
| What papers has sharon goldwater written ? | `SELECT DISTINCT WRITESalias0.PAPERID FROM AUTHOR AS AUTHORalias0 , WRITES AS WRITESalias0 WHERE AUTHORalias0.AUTHORNAME = "sharon goldwater" AND WRITESalias0.AUTHORID = AUTHORalias0.AUTHORID ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 59 | 5 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Template Baseline (Finegan-Dollak et al., 2018) | 52 | 0   | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 44 | 3 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |

### Spider

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL
dataset. It consists of 10,181 questions and 5,693 unique complex SQL queries on
200 databases with multiple tables covering 138 different domains. In Spider 1.0,
different complex SQL queries and databases appear in train and test sets.

The Spider dataset can be accessed and leaderboard can be accessed [here](https://yale-lily.github.io/spider).

### WikiSQL

The [WikiSQL dataset](https://arxiv.org/abs/1709.00103) consists of 87,673
examples of questions, SQL queries, and database tables built from 26,521 tables.
Train/dev/test splits are provided so that each table is only in one split.
Models are evaluated based on accuracy on execute result matches.

Example:

| Question | SQL query |
| ------------- |  --- |
| How many engine types did Val Musetti use? | `SELECT COUNT Engine WHERE Driver = Val Musetti` |

The WikiSQL dataset and leaderboard can be accessed [here](https://github.com/salesforce/WikiSQL).

### Smaller Datasets

Restaurants - 378 questions about restaurants, their cuisine and locations, collected by [Tang and Mooney (2000)](http://www.aclweb.org/anthology/W/W00/W00-1317.pdf), converted to SQL by [Popescu et al., (2003)]((http://doi.acm.org/10.1145/604045.604070) and [Giordani and Moschitti (2012)](https://doi.org/10.1007/978-3-642-45260-4_5), improved and converted to canonical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029)

Example:

| Question | SQL query |
| ------------- |  --- |
| where is a restaurant in alameda ? | `SELECT LOCATIONalias0.HOUSE_NUMBER , RESTAURANTalias0.NAME FROM LOCATION AS LOCATIONalias0 , RESTAURANT AS RESTAURANTalias0 WHERE LOCATIONalias0.CITY_NAME = "alameda" AND RESTAURANTalias0.ID = LOCATIONalias0.RESTAURANT_ID ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Iyer et al., (2017) | 100 | 8 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 100 | 4 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Template Baseline (Finegan-Dollak et al., 2018) | 95 | 0  | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

Academic - 196 questions about publications generated by enumerating all of the different queries possible with the Microsoft Academic Search interface, then writing questions for each query [Li and Jagadish (2014)](http://dx.doi.org/10.14778/2735461.2735468). Improved and converted to a cononical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029).

Example:

| Question | SQL query |
| ------------- |  --- |
| return me the homepage of PVLDB | `SELECT JOURNALalias0.HOMEPAGE FROM JOURNAL AS JOURNALalias0 WHERE JOURNALalias0.NAME = "PVLDB" ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 81 | 74 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 76 | 70 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Template Baseline (Finegan-Dollak et al., 2018) | 0 | 0 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

Yelp - 128 user questions about the Yelp website [Yaghmazadeh et al., 2017](http://doi.org/10.1145/3133887). Improved and converted to a cononical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029).

Example:

| Question | SQL query |
| ------------- |  --- |
| List all businesses with rating 3.5 | `SELECT BUSINESSalias0.NAME FROM BUSINESS AS BUSINESSalias0 WHERE BUSINESSalias0.RATING = 3.5 ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 12 | 4 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 6 | 6 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Template Baseline (Finegan-Dollak et al., 2018) | 1 | 0 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

IMDB - 131 user questions about the Internet Movie Database [Yaghmazadeh et al., 2017](http://doi.org/10.1145/3133887). Improved and converted to a cononical style by [Finegan-Dollak et al., (2018)](http://arxiv.org/abs/1806.09029).

Example:

| Question | SQL query |
| ------------- |  --- |
| What year was the movie " The Imitation Game " produced | `SELECT MOVIEalias0.RELEASE_YEAR FROM MOVIE AS MOVIEalias0 WHERE MOVIEalias0.TITLE = "The Imitation Game" ;` |

| Model           | Question Split | Query Split |  Paper / Source | Code |
| --------------- | ----- |  :-----:| --------------- | ---- |
| Seq2Seq with copying (Finegan-Dollak et al., 2018) | 26 | 9 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |
| Iyer et al., (2017) | 10 | 4 | [Learning a neural semantic parser from user feedback](http://www.aclweb.org/anthology/P17-1089) | [System](https://github.com/sriniiyer/nl2sql) |
| Template Baseline (Finegan-Dollak et al., 2018) | 0 | 0 | [Improving Text-to-SQL Evaluation Methodology](http://arxiv.org/abs/1806.09029) | [Data and System](https://github.com/jkkummerfeld/text2sql-data) |

[Go back to the README](../README.md)


================================================
FILE: english/semantic_role_labeling.md
================================================
# Semantic role labeling

Semantic role labeling aims to model the predicate-argument structure of a sentence
and is often described as answering "Who did what to whom". BIO notation is typically
used for semantic role labeling.

Example:

| Housing | starts | are | expected | to | quicken | a | bit | from | August’s | pace | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| B-ARG1 | I-ARG1 | O |  O  |  O  |   V  | B-ARG2 | I-ARG2 | B-ARG3 | I-ARG3 | I-ARG3 |    

### OntoNotes

Models are typically evaluated on the [OntoNotes benchmark](http://www.aclweb.org/anthology/W13-3516) based on F1.

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Tian et al., (2022) + XLNet | 87.67 | [Syntax-driven Approach for Semantic Role Labeling](https://aclanthology.org/2022.lrec-1.772/) | [Official](https://github.com/synlp/SRL-MM) |
| He et al., (2018) + ELMO | 85.5 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) |
| (He et al., 2017) + ELMo (Peters et al., 2018) | 84.6 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) |
| Tan et al. (2018) | 82.7 | [Deep Semantic Role Labeling with Self-Attention](https://arxiv.org/abs/1712.01586) |
| He et al. (2018) | 82.1 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) | 
| He et al. (2017) | 81.7 | [Deep Semantic Role Labeling: What Works and What’s Next](http://aclweb.org/anthology/P17-1044) |

### CoNLL-2005

Models are typically evaluated on the [CoNLL-2005 dataset](https://www.cs.upc.edu/~srlconll/soft.html) based on F1.

| Model           | F1 |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Tian et al., (2022) + XLNet | 89.80 | [Syntax-driven Approach for Semantic Role Labeling](https://aclanthology.org/2022.lrec-1.772/) | [Official](https://github.com/synlp/SRL-MM) |
| He et al., (2018) + ELMO | 87.4 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) |
| He et al. (2018) | 83.9 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) |
| Tan et al. (2018) | 82.9 | [Deep Semantic Role Labeling with Self-Attention](https://arxiv.org/abs/1712.01586) | 
| He et al. (2017) | 81.5 | [Deep Semantic Role Labeling: What Works and What’s Next](http://aclweb.org/anthology/P17-1044) |

[Go back to the README](../README.md)


================================================
FILE: english/semantic_textual_similarity.md
================================================
# Semantic textual similarity

Semantic textual similarity deals with determining how similar two pieces of texts are.
This can take the form of assigning a score from 1 to 5. Related tasks are paraphrase or duplicate identification.

### SentEval

[SentEval](https://arxiv.org/abs/1803.05449) is an evaluation toolkit for evaluating sentence
representations. It includes 17 downstream tasks, including common semantic textual similarity
tasks. The semantic textual similarity (STS) benchmark tasks from 2012-2016 (STS12, STS13, STS14, STS15, STS16, STS-B) measure the relatedness
of two sentences based on the cosine similarity of the two representations. The evaluation criterion is Pearson correlation.

The SICK relatedness (SICK-R) task trains a linear model to output a score from 1 to 5 indicating the relatedness of two sentences. For
the same dataset (SICK-E) can be treated as a three-class classification problem using the entailment labels (classes are 'entailment', 'contradiction', and 'neutral').
The evaluation metric for SICK-R is Pearson correlation and classification accuracy for SICK-E.

The Microsoft Research Paraphrase Corpus (MRPC) corpus is a paraphrase identification dataset, where systems
aim to identify if two sentences are paraphrases of each other. The evaluation metric is classification accuracy and F1.

The data can be downloaded from [here](https://github.com/facebookresearch/SentEval).

| Model           | MRPC | SICK-R | SICK-E | STS | Paper / Source | Code |
| ------------- | :-----:| :-----:| :-----:| :-----:| --- | --- |
| XLNet-Large (ensemble) (Yang et al., 2019) | 93.0/90.7 | - | - | 91.6/91.1* | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 92.7/90.3 | - | - | 91.1/90.7* | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 91.5/88.5 | - | - | 90.1/89.7* | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |
| GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | [Official](https://github.com/Maluuba/gensen) |
| InferSent (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 | [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) | [Official](https://github.com/facebookresearch/InferSent) |
| TF-KLD (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - | [Discriminative Improvements to Distributional Sentence Similarity](http://www.aclweb.org/anthology/D/D13/D13-1090.pdf) |  |

\* only evaluated on STS-B

## Paraphrase identification

### Quora Question Pairs

The [Quora Question Pairs dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)
consists of over 400,000 pairs of questions on Quora. Systems must identify whether one question is a
duplicate of the other. Models are evaluated based on accuracy.

| Model           | F1 | Accuracy  |  Paper / Source | Code |
| ------------- | :-----: | :-----:| --- | --- |
| XLNet-Large (ensemble) (Yang et al., 2019) | 74.2 | 90.3 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 73.7 | 89.9 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 73.1 | 89.9 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |
| MwAN (Tan et al., 2018) | | 89.12 | [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf) | |
| DIIN (Gong et al., 2018) | | 89.06 | [Natural Language Inference Over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf) | [Official](https://github.com/YichenGong/Densely-Interactive-Inference-Network) |
| pt-DecAtt (Char) (Tomar et al., 2017) | | 88.40 | [Neural Paraphrase Identification of Questions with Noisy Pretraining](https://arxiv.org/abs/1704.04565) | |
| BiMPM (Wang et al., 2017) | | 88.17 | [Bilateral Multi-Perspective Matching for Natural Language Sentences](https://arxiv.org/abs/1702.03814) | [Official](https://github.com/zhiguowang/BiMPM) |
| GenSen (Subramanian et al., 2018) | | 87.01 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | [Official](https://github.com/Maluuba/gensen) |

[Go back to the README](../README.md)


================================================
FILE: english/sentiment_analysis.md
================================================
# Sentiment analysis

Sentiment analysis is the task of classifying the polarity of a given text.

### IMDb

The [IMDb dataset](https://ai.stanford.edu/~ang/papers/acl11-WordVectorsSentimentAnalysis.pdf) is a binary
sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or
negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. 
A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are 
included per movie. Models are evaluated based on accuracy.

| Model           | Accuracy  |  Paper / Source |
| ------------- | :-----:| --- |
| XLNet (Yang et al., 2019) | 96.21 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
| BERT_large+ITPT (Sun et al., 2019) | 95.79 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| BERT_base+ITPT (Sun et al., 2019) | 95.63 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| ULMFiT (Howard and Ruder, 2018) | 95.4 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) |
| Block-sparse LSTM (Gray et al., 2017) | 94.99 | [GPU Kernels for Block-Sparse Weights](https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/blocksparsepaper.pdf) |
| oh-LSTM (Johnson and Zhang, 2016) | 94.1 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) |
| Virtual adversarial training (Miyato et al., 2016) | 94.1 | [Adversarial Training Methods for Semi-Supervised Text Classification](https://arxiv.org/abs/1605.07725) |
| BCN+Char+CoVe (McCann et al., 2017) | 91.8 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) |

### SST

The [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) 
contains 215,154 phrases with fine-grained sentiment labels in the parse trees
of 11,855 sentences in movie reviews. Models are evaluated either on fine-grained
(five-way) or binary classification based on accuracy.

Fine-grained classification (SST-5, 94,2k examples):

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| BCN+Suffix BiLSTM-Tied+CoVe (Brahma, 2018) | 56.2 | [Improved Sentence Modeling using Suffix Bidirectional LSTM](https://arxiv.org/pdf/1805.07340v2.pdf) |
| BCN+ELMo (Peters et al., 2018) | 54.7 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) |
| BCN+Char+CoVe (McCann et al., 2017) | 53.7 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) |

Binary classification (SST-2, 56.4k examples):

| Model           | Accuracy  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| XLNet-Large (ensemble) (Yang et al., 2019) | 96.8 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| MT-DNN-ensemble (Liu et al., 2019) | 96.5 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) |
| Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 96.2 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) |
| MT-DNN (Liu et al., 2019) | 95.6 | [Multi-Task Deep Neural Networks for Natural Language Understanding](https://arxiv.org/abs/1901.11504) | [Official](https://github.com/namisan/mt-dnn/) |
| Bidirectional Encoder Representations from Transformers (Devlin et al., 2018) | 94.9 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | [Official](https://github.com/google-research/bert) |
| Block-sparse LSTM (Gray et al., 2017) | 93.2 | [GPU Kernels for Block-Sparse Weights](https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/blocksparsepaper.pdf) | [Offical](https://github.com/openai/blocksparse) |
| bmLSTM (Radford et al., 2017) | 91.8 | [Learning to Generate Reviews and Discovering Sentiment](https://arxiv.org/abs/1704.01444) | [Unoffical](https://github.com/NVIDIA/sentiment-discovery) |
| Single layer bilstm distilled from BERT (Tang et al., 2019)| 90.7 |[Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)|  |
| BCN+Char+CoVe (McCann et al., 2017) | 90.3 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) | [Official](https://github.com/salesforce/cove) |
| Neural Semantic Encoder (Munkhdalai and Yu, 2017) | 89.7 | [Neural Semantic Encoders](http://www.aclweb.org/anthology/E17-1038) | |
| BLSTM-2DCNN (Zhou et al., 2017) | 89.5 | [Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling](http://www.aclweb.org/anthology/C16-1329) | |

### Yelp

The [Yelp Review dataset](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)
consists of more than 500,000 Yelp reviews. There is both a binary and a fine-grained (five-class)
version of the dataset. Models are evaluated based on error (1 - accuracy; lower is better).

Fine-grained classification: 

| Model           | Error  |  Paper / Source |
| ------------- | :-----:| --- |
| XLNet (Yang et al., 2019) | 27.80 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
| BERT_large+ITPT (Sun et al., 2019) | 28.62 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| BERT_base+ITPT (Sun et al., 2019) | 29.42 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| ULMFiT (Howard and Ruder, 2018) | 29.98 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) |
| DPCNN (Johnson and Zhang, 2017) | 30.58 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) |
| CNN (Johnson and Zhang, 2016) | 32.39 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) |
| Char-level CNN (Zhang et al., 2015) | 37.95 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) |

Binary classification:

| Model           | Error |  Paper / Source |
| ------------- | :-----:| --- |
| XLNet (Yang et al., 2019) | 1.55 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) |
| BERT_large+ITPT (Sun et al., 2019) | 1.81 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| BERT_base+ITPT (Sun et al., 2019) | 1.92 | [How to Fine-Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583.pdf) |
| ULMFiT (Howard and Ruder, 2018) | 2.16 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) |
| DPCNN (Johnson and Zhang, 2017) | 2.64 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) |
| CNN (Johnson and Zhang, 2016) | 2.90 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) |
| Char-level CNN (Zhang et al., 2015) | 4.88 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) |


### SemEval
SemEval (International Workshop on Semantic Evaluation) has a specific task for Sentiment analysis.
Latest year overview of such task (Task 4) can be reached at: http://www.aclweb.org/anthology/S17-2088

SemEval-2017 Task 4 consists of five subtasks, each offered for both Arabic and English:

1. Subtask A: Given a tweet, decide whether it expresses POSITIVE, NEGATIVE or NEUTRAL
sentiment.

2. Subtask B: Given a tweet and a topic, classify the sentiment conveyed towards that
topic on a two-point scale: POSITIVE vs. NEGATIVE.

3. Subtask C: Given a tweet and a topic, classify the sentiment conveyed in the
tweet towards that topic on a five-point scale: STRONGLYPOSITIVE, WEAKLYPOSITIVE,
NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.

4. Subtask D: Given a set of tweets about a topic, estimate the distribution of tweets
across the POSITIVE and NEGATIVE classes. 

5. Subtask E: Given a set of tweets about a topic, estimate the distribution of tweets
across the five classes: STRONGLYPOSITIVE, WEAKLYPOSITIVE, NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.

Subtask A  results:

| Model           | F1-score |  Paper / Source |
| ------------- | :-----:| --- |
| LSTMs+CNNs ensemble with multiple conv. ops (Cliche. 2017) | 0.685 | [BB twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs](http://www.aclweb.org/anthology/S17-2094) |
| Deep Bi-LSTM+attention (Baziotis et al., 2017) | 0.677 | [DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis](http://aclweb.org/anthology/S17-2126) |


## Aspect-based sentiment analysis

### Sentihood

[Sentihood](http://www.aclweb.org/anthology/C16-1146) is a dataset for targeted aspect-based sentiment analysis (TABSA), which aims
to identify fine-grained polarity towards a specific aspect. The dataset consists of 5,215 sentences,
3,862 of which contain a single target, and the remainder multiple targets.

Dataset mirror: https://github.com/uclmr/jack/tree/master/data/sentihood

| Model           | Aspect (F1) | Sentiment (acc) |  Paper / Source |  Code |
| ------------- | :-----:| :-----:| --- | --- |
| QACG-BERT (Wu and Ong, 2020) | 89.7 | 93.8 | [Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis](https://arxiv.org/abs/2010.07523) | [Official](https://github.com/frankaging/Quasi-Attention-ABSA)
| Sun et al. (2019) | 87.9 | 93.6 | [Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence](https://arxiv.org/pdf/1903.09588.pdf) | [Official](https://github.com/HSLCY/ABSA-BERT-pair)
| Liu et al. (2018) | 78.5 | 91.0 | [Recurrent Entity Networks with Delayed Memory Update for Targeted Aspect-based Sentiment Analysis](http://aclweb.org/anthology/N18-2045) | [Official](https://github.com/liufly/delayed-memory-update-entnet)
| SenticLSTM (Ma et al., 2018) | 78.2 | 89.3 | [Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM](http://sentic.net/sentic-lstm.pdf) | 
| LSTM-LOC (Saeidi et al., 2016) | 69.3 | 81.9 | [Sentihood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods](http://www.aclweb.org/anthology/C16-1146) |

### SemEval-2014 Task 4

The [SemEval-2014 Task 4](http://alt.qcri.org/semeval2014/task4/) contains two domain-specific datasets for laptops and restaurants, consisting of over 6K sentences with fine-grained aspect-level human annotations.

The task consists of the following subtasks:

- Subtask 1: Aspect term extraction

- Subtask 2: Aspect term polarity

- Subtask 3: Aspect category detection

- Subtask 4: Aspect category polarity

Preprocessed dataset: https://github.com/songyouwei/ABSA-PyTorch/tree/master/datasets/semeval14   
https://github.com/howardhsu/BERT-for-RRC-ABSA (with both subtask 1 and subtask 2)

Subtask 1 results (SemEval-2014 Task 4 for Laptop and SemEval-2016 Task 5 for Restaurant):

| Model           | Laptop (F1) | Restaurant (F1) |  Paper / Source |  Code |
| ------------- | :-----:| :-----:| --- | --- |
| ACE + fine-tune (Wang et al., 2020) | 87.4 | 81.3 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| BERT-PT (Hu, Xu, et al., 2019) | 84.26 | 77.97 | [BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis](https://arxiv.org/pdf/1904.02232.pdf) | [official](https://github.com/howardhsu/BERT-for-RRC-ABSA)
| DE-CNN (Hu, Xu, et al., 2018) | 81.59 | 74.37 | [Double Embeddings and CNN-based Sequence Labeling for Aspect Extraction](https://www.aclweb.org/anthology/P18-2094) | [official](https://github.com/howardhsu/DE-CNN)
| MIN (Li, Xin, et al., 2017) | 77.58 | 73.44 | [Deep Multi-Task Learning for Aspect Term Extraction with Memory Interaction] | 
| RNCRF (Wang, Wenya. et al., 2016) | 78.42 | 69.74 | [Recursive Neural Conditional Random Fields for Aspect-based Sentiment Analysis](https://www.aclweb.org/anthology/papers/D/D16/D16-1059/) | [official](https://github.com/happywwy/Recursive-Neural-Conditional-Random-Field)

Subtask 2 results:

| Model           | Restaurant (acc) | Laptop (acc) |  Paper / Source |  Code |
| ------------- | :-----:| :-----:| --- | --- |
| BERT-ADA (Rietzler, Alexander, et al., 2019) | 87.89 | 80.23 | [Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification](https://arxiv.org/pdf/1908.11860.pdf) | [official](https://github.com/deepopinion/domain-adapted-atsc)
| LCF-BERT (Zeng, Yang, et al., 2019) | 87.14 | 82.45 | [LCF: A Local Context Focus Mechanism for Aspect-Based Sentiment Classification](https://www.mdpi.com/2076-3417/9/16/3389/pdf) | [official](https://github.com/yangheng95/LCF-ABSA) / [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/lcf_bert.py)
| BERT-PT (Hu, Xu, et al., 2019) | 84.95 | 78.07 | [BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis](https://arxiv.org/pdf/1904.02232.pdf) | [official](https://github.com/howardhsu/BERT-for-RRC-ABSA)
| AOA (Huang, Binxuan, et al., 2018) | 81.20 | 74.50 | [Aspect Level Sentiment Classification with Attention-over-Attention Neural Networks](https://arxiv.org/pdf/1804.06536.pdf) | [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/aoa.py)
| TNet (Li, Xin, et al., 2018) | 80.79 | 76.01 | [Transformation Networks for Target-Oriented Sentiment Classification](http://aclweb.org/anthology/P18-1087) | [Official](https://github.com/lixin4ever/TNet) / [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/tnet_lf.py)
| RAM (Chen, Peng, et al., 2017) | 80.23 | 74.49 | [Recurrent Attention Network on Memory for Aspect Sentiment Analysis](http://www.aclweb.org/anthology/D17-1047) | [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/ram.py)
| MemNet (Tang, Duyu, et al., 2016) | 80.95 | 72.21 | [Aspect Level Sentiment Classification with Deep Memory Network](https://pdfs.semanticscholar.org/b28f/7e2996b6ee2784dd2dbb8212cfa0c79ba9e7.pdf) | [Official](https://drive.google.com/open?id=1Hc886aivHmIzwlawapzbpRdTfPoTyi1U) / [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/memnet.py)
| IAN (Ma, Dehong, et al., 2017) | 78.60 | 72.10 | [Interactive Attention Networks for Aspect-Level Sentiment Classification](http://www.ijcai.org/proceedings/2017/0568.pdf) | [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/ian.py)
| ATAE-LSTM (Wang, Yequan, et al. 2016) | 77.20 | 68.70 | [Attention-based lstm for aspect-level sentiment classification](https://aclweb.org/anthology/D16-1058) | [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/atae_lstm.py)
| TD-LSTM (Tang, Duyu, et al., 2016) | 75.63 | 68.13 | [Effective LSTMs for Target-Dependent Sentiment Classification](https://www.aclweb.org/anthology/C/C16/C16-1311.pdf) | [Official](https://drive.google.com/open?id=17RF8MZs456ov9MDiUYZp0SCGL6LvBQl6) / [Link](https://github.com/songyouwei/ABSA-PyTorch/blob/master/models/td_lstm.py)

## Sentiment classification with user and product information

This is the same task on sentiment classification, where the given text is a review, but we are also additionally given (a) the *user* who wrote the text, and (b) the *product* which the text is written for. There are three widely used datasets, introduced by [Tang et. al (2015)](http://aclweb.org/anthology/P15-1098): IMDB, Yelp 2013, and Yelp 2014. Evaluation is done using both accuracy and RMSE, but for brevity, we only provide the accuracy here. Please look at the papers for the RMSE values.

| Model              | IMDB (acc) | Yelp 2013 (acc) | Yelp 2014 (acc) | Paper / Source | Code |
| ------------------ | :--------: | :-------------: | :-------------: | -------------- | ---- |
| MA-BERT (Zhang, et al., 2021) | 57.3 | 70.3 | 71.4 | [MA-BERT: Learning Representation by Incorporating Multi-Attribute Knowledge in Transformers](https://aclanthology.org/2021.findings-acl.206.pdf) | [Link](https://github.com/yoyo-yun/MA-Bert) |
| IUPC (Lyu, et al., 2020) | 53.8 | 70.5 | 71.2 | [Improving Document-Level Sentiment Analysis with User and Product Context](https://aclanthology.org/2020.coling-main.590/) | [Link](https://paperswithcode.com/paper/improving-document-level-sentiment-analysis) |
| BiLSTM+CHIM (Amplayo, 2019) | 56.4 | 67.8 | 69.2 | [Rethinking Attribute Representation and Injection for Sentiment Classification](https://arxiv.org/pdf/1908.09590.pdf) | [Link](https://github.com/rktamplayo/CHIM) |
| BiLSTM + linear-basis-cust (Kim, et al., 2019) | - | 67.1 | - | [Categorical Metadata Representation for Customized Text Classification](https://arxiv.org/pdf/1902.05196.pdf) | [Link](https://github.com/zizi1532/BasisCustomize) |
| CMA (Ma, et al., 2017) | 54.0 | 66.4 | 67.6 | [Cascading Multiway Attention for Document-level Sentiment Classification](http://aclweb.org/anthology/I17-1064) | - |
| DUPMN (Long, et al., 2018) | 53.9 | 66.2 | 67.6 | [Dual Memory Network Model for Biased Product Review Classification](http://aclweb.org/anthology/W18-6220) | - |
| HCSC (Amplayo, et al., 2018) | 54.2 | 65.7 | - | [Cold-Start Aware User and Product Attention for Sentiment Classification](http://aclweb.org/anthology/P18-1236) | [Link](https://github.com/rktamplayo/HCSC) |
| NSC (Chen, et al., 2016) | 53.3 | 65.0 | 66.7 | [Neural Sentiment Classification with User and Product Attention](http://aclweb.org/anthology/D16-1171) | [Link](https://github.com/thunlp/NSC) |
| UPDMN (Dou, 2017) | 46.5 | 63.9 | 61.3 | [Capturing User and Product Information for Document Level Sentiment Analysis with Deep Memory Network](http://aclweb.org/anthology/D17-1054) | - |
| UPNN (Tang, et al., 2016) | 43.5 | 59.6 | 60.8 | [Learning Semantic Representations of Users and Products for Document Level Sentiment Classification](http://aclweb.org/anthology/P15-1098) | [Link](http://ir.hit.edu.cn/~dytang/paper/acl2015/UserTextNN.zip) |

# Subjectivity analysis

A related task to sentiment analysis is the subjectivity analysis with the goal of labeling an opinion as either subjective or objective.

### SUBJ

[Subjectivity dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/) includes 5,000 subjective and 5,000 objective processed sentences. 

| Model           | Accuracy |  Paper / Source |
| ------------- | :-----:| --- |
| AdaSent (Zhao et al., 2015) | 95.50 | [Self-Adaptive Hierarchical Sentence Model](https://arxiv.org/pdf/1504.05070.pdf) |
| CNN+MCFA (Amplayo et al., 2018) | 94.80 | [Translations as Additional Contexts for Sentence Classification](https://arxiv.org/abs/1806.05516) |
| Byte mLSTM (Radford et al., 2017) | 94.60 | [Learning to Generate Reviews and Discovering Sentiment](https://arxiv.org/pdf/1704.01444.pdf) |
| USE (Cer et al., 2018) | 93.90 | [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) |
| Fast Dropout (Wang and Manning, 2013) | 93.60 | [Fast Dropout Training](http://proceedings.mlr.press/v28/wang13a.pdf) |

[Go back to the README](../README.md)


================================================
FILE: english/shallow_syntax.md
================================================
# Shallow syntax

Shallow syntactic tasks provide an analysis of a text on the level of the syntactic structure 
of the text.

## Chunking

Chunking, also known as shallow parsing, identifies continuous spans of tokens that form syntactic units such as noun phrases or verb phrases.

Example:

| Vinken | , | 61 | years | old |
| --- | ---| --- | --- | --- |
| B-NLP| I-NP | I-NP | I-NP | I-NP |

### Penn Treebank

The [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is typically used for evaluating chunking.
Sections 15-18 are used for training, section 19 for development, and and section 20
for testing. Models are evaluated based on F1.

| Model           | F1 score  |  Paper | Source |
| ------------- | :-----:| --- | --- |
| ACE + fine-tune (Wang et al., 2020) | 97.30 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| Flair embeddings (Akbik et al., 2018) | 96.72 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | [Flair](https://github.com/flairNLP/flair) |
| JMT (Hashimoto et al., 2017) | 95.77 | [A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks](https://www.aclweb.org/anthology/D17-1206) |
| Low supervision (Søgaard and Goldberg, 2016) | 95.57 | [Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038) |
| Suzuki and Isozaki (2008) | 95.15 | [Semi-Supervised Sequential Labeling and Segmentation using Giga-word Scale Unlabeled Data](https://aclanthology.info/pdf/P/P08/P08-1076.pdf) | 
| NCRF++ (Yang and Zhang, 2018)| 95.06 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) |

### CoNLL 2003

Though the [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) datasets are typically used for evaluating NER, the datasets can be used for evaluating chunking as well. The dataset split is official standard split. Models are evaluated based on F1.

| Model           | English | German  |  Paper | Source |
| ------------- | :-----:| :-----: | --- | --- |
| ACE + fine-tune (Wang et al., 2020) | 92.5 | 95.0 | [Automated Concatenation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [Official](https://github.com/Alibaba-NLP/ACE)|
| Flair + BERT + Word + Char embeddings (Wang et al., 2020) | 92.0 | 94.4| [More Embeddings, Better Sequence Labelers?](https://arxiv.org/abs/2009.08330) | 
| Word + Char + MFVI (Wang et al., 2020) | 91.71 | 94.03| [AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network](https://arxiv.org/abs/2009.08229) |[Official](https://github.com/Alibaba-NLP/AIN)|

## Resolving the Scope and focus of negation

Scope of negation is the part of the meaning that is negated and focus the part of the scope that is most prominently negated (Huddleston and Pullum 2002).

Example:

`[John had] never [said %as much% before].`

Scope is enclosed in square brackets and focus marked between % signs.

The [CD-SCO (Conan Doyle Scope) dataset](https://www.clips.uantwerpen.be/sem2012-st-neg/data.html) is for scope detection.
 The [PB-FOC (PropBank Focus) dataset](https://www.clips.uantwerpen.be/sem2012-st-neg/data.html) is for focus detection.
The public leaderboard is available on the [*SEM Shared Task 2012 website](https://www.clips.uantwerpen.be/sem2012-st-neg/results.html).

[Go back to the README](../README.md)


================================================
FILE: english/simplification.md
================================================
# Simplification

Simplification consists of modifying the content and structure of a text in order to make it easier to read and understand, while preserving its main idea and approximating its original meaning. A simplified version of a text could benefit low literacy readers, English learners, children, and people with aphasia, dyslexia or autism. Also, simplifying a text automatically could improve performance on other NLP tasks, such as parsing, summarisation, information extraction, semantic role labeling, and machine translation.

## Sentence Simplification

Research on automatic simplification has been traditionally limited to executing transformations at the sentence-level. What should we expect from a sentence simplificatin model? Let's take a look at how humans simplify (from [here](http://videolectures.net/esslli2011_lapata_simplification/)):

| Original Sentence   | Simplified   Sentence |
| ---------------------------------- | ---------------------------------- |
| Owls are the order Strigiformes, comprising 200 bird of prey species. | An owl is a bird. There are about 200 kinds of owls. |
| Owls hunt mostly small mammals, insects, and other birds though some species specialize in hunting fish. | Owls’ prey may be birds, large insects (such as crickets), small reptiles (such as lizards) or small mammals (such as mice, rats, and rabbits). |

Notice the simplification transformations performed: 

- Unusual concepts are explained: insects *(such as crickets)*, small reptiles *(such as lizards)* or small mammals *(such as mice, rats, and rabbits)*. 

- Uncommon words are replaced with a more familiar term or phrase: "comprising" &rarr; "There are about".

- Syntactic structures are changed by a simpler pattern. For example, the first sentence is split into two.

- Some *unimportant* information is removed: the clause "though some species specialize in hunting fish" in the second sentence does not appear in its simplified version. 

When the set of transformations is limited to replacing a word or phrase by a simpler synonym, we are dealing with *Lexical Simplification* (an overview of that area can be found [here](https://www.jair.org/index.php/jair/article/view/11091/26278)). In this section, we consider research that attempts to develop models that learn as many text transformations as possible.

### Evaluation

The ideal method for determining the quality of a simplification is through human evaluation. Traditionally, a simplified output is judged in terms of *grammaticality* (or fluency), *meaning preservation* (or adequacy) and *simplicity*, using Likert scales (1-3 or 1-5) . **Warning:** Are these criteria (at the sentence level) the most appropriate for assessing a simplified sentence? It has been suggested [(Siddharthan, 2014)](https://www.jbe-platform.com/content/journals/10.1075/itl.165.2.06sid) that a task-oriented evaluation (e.g. through reading comprehension tests [(Angrosh et al., 2014)](http://aclweb.org/anthology/C14-1188)) could be more informative of the usefulness of the generated simplification. However, this is not general practice.

For tuning and comparing models, the most commonly-used automatic metrics are:

- **BLEU** [(Papineni et al., 2012)](https://aclweb.org/anthology/P02-1040), borrowed from Machine Translation. This metric is not one without [problems](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213) for different text generation tasks. However, simplification studies ([Stajner et al., 2014](http://aclweb.org/anthology/W14-1201); [Wubben et al., 2012](http://aclweb.org/anthology/P12-1107); [Xu et al., 2016](http://aclweb.org/anthology/Q16-1029)) have shown that it correlates with human judgments of grammaticality and meaning preservation. BLEU is not well suited, though, for assessing simplicity from a lexical [(Xu et al., 2016)](http://aclweb.org/anthology/Q16-1029) nor a structural [(Sulem et al., 2018b)](http://aclweb.org/anthology/D18-1081) point of view.
- **SARI** [(Xu et al., 2016)](http://aclweb.org/anthology/Q16-1029) is a *lexical simplicity* metric that measures "how good" are the words added, deleted and kept by a simplification model. The metric compares the model's output to *multiple simplification references* and the original sentence. SARI has shown high correlation with human judgements of simplicity gain [(Xu et al., 2016)](http://aclweb.org/anthology/Q16-1029). Currently, this is the main metric used for evaluating sentence simplification models.

The previous two metrics will be used to rank the models in the following sections. Despite popular practice, we refrain from using **Flesch Reading Ease** or **Flesch-Kincaid Grade Level**. Because of the way these metrics are [computed](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests), short sentences could get good scores, even if they are ungrammatical or non-meaning preserving [(Wubben et al., 2012)](http://aclweb.org/anthology/P12-1107), resulting in a missleading ranking. 

Since a simplification could involve text transformations beyond paraphrasing (which SARI intends to assess).  For these cases, it could be more suitable to use **SAMSA** [(Sulem et al., 2018a)](http://aclweb.org/anthology/N18-1063), a metric designed to measure *structural simplicity* (i.e. sentence splitting). However, it has not been used in papers besides the one where it was introduced (yet).

**EASSE:** [Alva-Manchego et al. (2019)](https://www.aclweb.org/anthology/D19-3009) released a [tool](https://github.com/feralvam/easse) that provides easy access to all of the above metrics (and several others) through the command line and as a python package. EASSE also contains commonly-used test sets for the task. Its aim is to help standarise automatic evaluation for sentence simplification.

**IMPORTANT NOTE:** In the tables of the following sections, a score with a \* means that it was not reported by the original authors but by future research that re-implemented and/or re-trained and re-tested the model. In these cases, the original reported score (if there is one) is shown in parentheses. 

### Main - Simple English Wikipedia

[Simple English Wikipedia](https://simple.wikipedia.org) is an online encyclopedia aimed at English learners. Its articles are expected to contain fewer words and simpler grammar structures than those in their [Main English Wikipedia](https://en.wikipedia.org) counterpart. Much of the popularity of using Wikipedia for research in Simplification comes from publicly available sentence alignments between “equivalent” articles in Main and Simple English Wikipedia. 

#### PWKP / WikiSmall

[Zhu et al. (2010)](http://aclweb.org/anthology/C10-1152) compiled a parallel corpus with more than 108K sentence pairs from 65,133 Wikipedia articles, allowing **1-to-1 and 1-to-N alignments**. The latter type of alignments represents instances of sentence splitting. The original full corpus can be found [here](https://www.informatik.tu-darmstadt.de/ukp/research_6/data/sentence_simplification/simple_complex_sentence_pairs/index.en.jsp). The test set is composed of 100 instances, with **one simplification reference per original sentence**. [Zhang and Lapata (2017)](http://aclweb.org/anthology/D17-1062) released a more standardised split of this dataset called [*WikiSmall*](https://github.com/XingxingZhang/dress), with 89,042 instances for training, 205 for development and the same original 100 instances for testing. 

We present the models tested in this dataset **ranked by BLEU score** (or SARI if BLEU is not available). SARI cannot be reliably computed in this dataset since it does not contain multiple simplification references per original sentence. In addition, there are instances of more advanced simplification transformations (e.g. splitting) which SARI does not assess by definition.

| Model           | BLEU | SARI | Paper / Source | Code |
| --------------- | :-----: | :-----: | -------------- | ---- |
| TST (Omelianchuk et al., 2021) | | 44.67 | [Text Simplification by Tagging](https://aclanthology.org/2021.bea-1.2.pdf) |   |
| EditNTS (Dong et al., 2019) |  | 32.35 | [EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing](https://www.aclweb.org/anthology/P19-1331) | [Official](https://github.com/yuedongP/EditNTS) |
| SeqLabel (Alva-Manchego et al., 2017) |  | 30.50\* | [Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs](https://www.aclweb.org/anthology/I17-1030) | |
| Hybrid (Narayan and Gardent, 2014) | 53.94\* (53.6) | 30.46\* | [Hybrid Simplification using Deep Semantics and Machine Translation](http://aclweb.org/anthology/P/P14/P14-1041.pdf) | [Official](https://github.com/shashiongithub/Sentence-Simplification-ACL14) |
| NSELSTM-B (Vu et al., 2018) | 53.42 | 17.47 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) |  |
| PBMT-R (Wubben et al., 2012) | 46.31\* (43.0) | 15.97\* | [Sentence Simplification by Monolingual Machine Translation](http://aclweb.org/anthology/P12-1107) |  |
| RevILP (Woodsend and Lapata, 2011) | 42.0 | | [Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming](http://aclweb.org/anthology/D11-1038) |  |
| UNSUP (Narayan and Gardent, 2016) | 38.47 | | [Unsupervised Sentence Simplification Using Deep Semantics](http://aclweb.org/anthology/W16-6620) |  |
| TSM (Zhu et al., 2010) | 38.0 | | [A Monolingual Tree-based Translation Model for Sentence Simplification](http://aclweb.org/anthology/C10-1152) |  |
| DRESS-LS (Zhang and Lapata, 2017) | 36.32 | 27.24 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| DRESS (Zhang and Lapata, 2017) | 34.53 | 27.48 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| NSELSTM-S (Vu et al., 2018) | 29.72 | 29.75 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) |  |
| Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) | 27.23 | 29.58 | [Dynamic Multi-Level Multi-Task Learning for Sentence Simplification](http://aclweb.org/anthology/C18-1039) | [Official](https://github.com/HanGuo97/MultitaskSimplification) |

#### Coster and Kauchack (2011)

[Coster and Kauchack (2011)](http://aclweb.org/anthology/P11-2117) automatically aligned 137K sentence pairs from 10K Wikipedia articles, considering **1-to-1 and 1-to-N alignments**, with **one simplification reference per original sentence**. The corpus was split into 124K instances for training, 12K for development, and 1.3K for testing. The dataset is available [here](http://www.cs.pomona.edu/~dkauchak/simplification/). As before, models tested in this dataset are **ranked by BLEU score** and not SARI.

| Model           | BLEU | SARI | Paper / Source | Code |
| --------------- | :-----: | :-----: | -------------- | ---- |
| Moses-Del (Coster and Kauchak, 2011b) | 60.46 |      | [Learning to Simplify Sentences Using Wikipedia](http://aclweb.org/anthology/W11-1601) |  |
| Moses (Coster and Kauchak, 2011a) | 59.87 | |[Simple English Wikipedia: A New Text Simplification Task](http://aclweb.org/anthology/P11-2117)  |  |
| SimpleTT (Feblowitz and Kauchak, 2013) | 56.4 | | [Sentence Simplification as Tree Transduction](aclweb.org/anthology/W13-2901) |  |
| PBMT-R (Wubben et al., 2012) | 54.3\* |      | [Sentence Simplification by Monolingual Machine Translation](http://aclweb.org/anthology/P12-1107) |  |

#### Turk Corpus

Together with defining SARI, [Xu et al. (2016)](http://aclweb.org/anthology/Q16-1029) released a dataset properly collected to calculate this simplicity metric: **1-to-1 alignments** focused on paraphrasing transformations (extracted from PWKP), and **multiple (8) simplification references per original sentence** (collected through Amazon Mechanical Turk). The  [dataset](https://github.com/cocoxu/simplification/) contains 2,350 sentences split into 2,000 instances for tuning and 350 for testing. For training, most models use [*WikiLarge*](https://github.com/XingxingZhang/dress), which was compiled by [Zhang and Lapata (2017)](http://aclweb.org/anthology/D17-1062) using alignments from other Wikipedia-based datasets ([Zhu et al., 2010](http://aclweb.org/anthology/C10-1152); [Woodsend and Lapata, 2011](http://aclweb.org/anthology/D11-1038); [Kauchak, 2013](http://aclweb.org/anthology/P13-1151)), and contains 296K instances of not only 1-to-1 alignments. 

We present the models tested in this dataset **ranked by SARI score**.

| Model           | BLEU | SARI | Paper / Source | Code |
| --------------- | :-----: | :-----: | -------------- | ---- |
| MUSS (Martin et al., 2020) | 78.17 | 42.53 | [Multilingual Unsupervised Sentence Simplification](https://arxiv.org/abs/2005.00352v1) |   |
| Trans-SS (Lu et al., 2021) | 73.72 | 41.97 | [An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages](https://arxiv.org/abs/2109.00165) | [Official](https://github.com/luxinyu1/Trans-SS) | 
| ACCESS (Martin et al., 2019) | 72.53 | 41.87  | [Controllable Sentence Simplification](https://arxiv.org/abs/1910.02677) | [Official](https://github.com/facebookresearch/access) |
| TST (Omelianchuk et al., 2021) | | 41.46 | [Text Simplification by Tagging](https://aclanthology.org/2021.bea-1.2.pdf) |   |
| DMASS + DCSS (Zhao et al., 2018) |  | 40.45 | [Integrating Transformer and Paraphrase Rules for Sentence Simplification](http://aclweb.org/anthology/D18-1355) | [Official](https://github.com/Sanqiang/text_simplification) |
| SBSMT + PPDB + SARI (Xu et al, 2016) | 73.08\* (72.36) | 39.96\* (37.91) | [Optimizing Statistical Machine Translation for Text Simplification](http://aclweb.org/anthology/Q16-1029) | [Official](https://github.com/cocoxu/simplification/) |
| PBMT-R (Wubben et al., 2012) | 81.11\* | 38.56\* | [Sentence Simplification by Monolingual Machine Translation](http://aclweb.org/anthology/P12-1107) |  |
| EditNTS (Dong et al., 2019) | 86.69 | 38.22 | [EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing](https://www.aclweb.org/anthology/P19-1331) | [Official](https://github.com/yuedongP/EditNTS) |
| Edit-Unsup-TS (Kumar et al., 2020) | 73.62 | 37.85 | [Iterative Edit-Based Unsupervised Sentence Simplification](https://www.aclweb.org/anthology/2020.acl-main.707.pdf) | [Official](https://github.com/ddhruvkr/Edit-Unsup-TS) |
| Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) | 81.49 | 37.45 | [Dynamic Multi-Level Multi-Task Learning for Sentence Simplification](http://aclweb.org/anthology/C18-1039) | [Official](https://github.com/HanGuo97/MultitaskSimplification) |
| NTS + SARI (Nisioi et al., 2017) | 80.69 | 37.25 | [Exploring Neural Text Simplification Models](http://aclweb.org/anthology/P17-2014) | [Official](https://github.com/senisioi/NeuralTextSimplification) |
| DRESS-LS (Zhang and Lapata, 2017) | 80.12 | 37.27 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| UnsupNTS (Surya et al., 2019) | 74.02 | 37.20| [Unsupervised Neural Text Simplification](https://www.aclweb.org/anthology/P19-1198) | [Official](https://github.com/subramanyamdvss/UnsupNTS) |
| DRESS (Zhang and Lapata, 2017) | 77.18 | 37.08 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| SeqLabel (Alva-Manchego et al., 2017) |  | 37.08\* | [Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs](https://www.aclweb.org/anthology/I17-1030) | |
| NSELSTM-S (Vu et al., 2018) | 80.43 | 36.88 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) |  |
| SEMoses (Sulem et al., 2018)                                 |      74.49       |     36.70      | [Simple and Effective Text Simplification Using Semantic and Neural Methods](http://aclweb.org/anthology/P18-1016) | [Official](https://github.com/eliorsulem/simplification-acl2018) |
| NSELSTM-B (Vu et al., 2018) | 92.02 | 33.43 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) | |
| Hybrid (Narayan and Gardent, 2014) | 48.97\* | 31.40\* | [Hybrid Simplification using Deep Semantics and Machine Translation](http://aclweb.org/anthology/P/P14/P14-1041.pdf) | [Official](https://github.com/shashiongithub/Sentence-Simplification-ACL14) |

#### ASSET

[Alva-Manchego et al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.424/) released a dataset aligned with TurkCorpus that contains the same set of original sentences, but with manual references where multiple simplification operations could have been applied, namely lexical paraphrasing, compression and/or sentence splitting. The authors showed that human judges found this type of simplifications simpler than those from TurkCorpus. Due to its multi-operation nature, ASSET contains **1-to-1 and 1-to-N alignments**, with **10 simplification references per original sentence** (collected through Amazon Mechanical Turk). Same as TurkCorpus, [ASSET](https://github.com/facebookresearch/asset) contains 2,350 sentences split into 2,000 instances for tuning and 350 for testing.

We present the models tested in this dataset **ranked by SARI score**.

| Model           | BLEU | SARI | Paper / Source | Code |
| --------------- | :-----: | :-----: | -------------- | ---- |
| MUSS (Martin et al., 2020) | 72.98 | 44.15 | [Multilingual Unsupervised Sentence Simplification](https://arxiv.org/abs/2005.00352v1) |   |
| TST (Omelianchuk et al., 2021) | | 43.21 | [Text Simplification by Tagging](https://aclanthology.org/2021.bea-1.2.pdf) |   |
| Trans-SS (Lu et al., 2021) | 71.83 | 42.69 | [An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages](https://arxiv.org/abs/2109.00165) | [Official](https://github.com/luxinyu1/Trans-SS) | 
| ACCESS (Martin et al., 2019) | 75.99\* | 40.13\*  | [Controllable Sentence Simplification](https://arxiv.org/abs/1910.02677) | [Official](https://github.com/facebookresearch/access) |
| DMASS + DCSS (Zhao et al., 2018) | 71.44\* | 38.67\* | [Integrating Transformer and Paraphrase Rules for Sentence Simplification](http://aclweb.org/anthology/D18-1355) | [Official](https://github.com/Sanqiang/text_simplification) |
| DRESS-LS (Zhang and Lapata, 2017) | 86.39\* | 36.59\* | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| UnsupNTS (Surya et al., 2019) | 76.14\* | 35.19\* | [Unsupervised Neural Text Simplification](https://www.aclweb.org/anthology/P19-1198) | [Official](https://github.com/subramanyamdvss/UnsupNTS) |
| PBMT-R (Wubben et al., 2012) | 79.39\* | 34.63\* | [Sentence Simplification by Monolingual Machine Translation](http://aclweb.org/anthology/P12-1107) |  |

#### Other Datasets

[Hwang et al. (2015)](http://aclweb.org/anthology/N15-1022) released a [dataset](http://ssli.ee.washington.edu/tial/projects/simplification/) of 392K instances, while [Kajiwara and Komachi (2016)](http://aclweb.org/anthology/C16-1109) collected the [sscorpus](https://github.com/tmu-nlp/sscorpus) of 493K instances, also from Main - Simple English Wikipedia article pairs. Both datasets contain only **1-to-1 alignments** with **one simplification reference per original sentence**. Despite their bigger sizes and the more sophisticated sentence alignment algorithms used to collect them, these datasets are not commonly used in simplification research.

### Newsela

[Xu et al. (2015)](http://aclweb.org/anthology/Q15-1021) introduced the Newsela corpus, which contains 1,130 news articles with four simplification versions each. The original article is considered version 0, and each simplification version goes from 1 to 4 (the highest being the simplest). These simplifications were produced manually by professional editors, considering children of different grade levels as target audience. Through manual evaluation on a subset of the data, [Xu et al. (2015)](http://aclweb.org/anthology/Q15-1021) showed that there is a better presence and distribution of simplification transformations in Newsela than in PWKP. 

The dataset can be requested [here](https://newsela.com/data/). However, researchers are not allowed to publicly shared splits of the data. This is not ideal for proper reproducibility and comparison among models.

#### Splits by Zhang and Lapata (2017)

[Xu et al. (2015)](http://aclweb.org/anthology/Q15-1021) generated sentence alignments between all versions of each article in the Newsela corpus. [Zhang and Lapata (2017)](http://aclweb.org/anthology/D17-1062) imply that they used those alignments but removed some sentence pairs that are "too similar". In the end, they used a dataset composed of 94,208 instances for training, 1,129 instances for development, and 1,076 instances for testing. Their test set, in particular, contains only **1-to-1 alignments** with **one simplification reference per original sentence**.

Using their splits, [Zhang and Lapata (2017)](http://aclweb.org/anthology/D17-1062) trained and tested several models, which we include in our ranking. Other research that claims to have used the same dataset splits is also considered. Despite not being the ideal scenario, the models tested in this dataset are commonly **ranked by SARI score**.

| Model           | BLEU | SARI | Paper / Source | Code |
| --------------- | :-----: | :-----: | -------------- | ---- |
| CRF Alignment + Transformer (Jiang et al., 2020) |   | 36.6 | [Neural CRF Model for Sentence Alignment in Text Simplification](https://arxiv.org/abs/2005.02324) | [Official](https://github.com/chaojiang06/wiki-auto) |
| Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) | 11.14 | 33.22 | [Dynamic Multi-Level Multi-Task Learning for Sentence Simplification](http://aclweb.org/anthology/C18-1039) | [Official](https://github.com/HanGuo97/MultitaskSimplification) |
|S2S-Cluster-FA (Kriz et al., 2019) | 19.55 | 30.73 | [Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification](https://www.aclweb.org/anthology/N19-1317) | [Official](https://github.com/rekriz11/sockeye-recipes) |
| Edit-Unsup-TS (Kumar et al., 2020) | 17.36 | 30.44 | [Iterative Edit-Based Unsupervised Sentence Simplification](https://www.aclweb.org/anthology/2020.acl-main.707.pdf) | [Official](https://github.com/ddhruvkr/Edit-Unsup-TS) |
| EditNTS (Dong et al., 2019) | 19.85 | 30.27 | [EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing](https://www.aclweb.org/anthology/P19-1331) | [Official](https://github.com/yuedongP/EditNTS) |
| NSELSTM-S (Vu et al., 2018) | 22.62 | 29.58 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) | |
| SeqLabel (Alva-Manchego et al., 2017) |  | 29.53\* | [Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs](https://www.aclweb.org/anthology/I17-1030) | |
| Hybrid (Narayan and Gardent, 2014) | 14.46\* | 28.61\* | [Hybrid Simplification using Deep Semantics and Machine Translation](http://aclweb.org/anthology/P/P14/P14-1041.pdf) | [Official](https://github.com/shashiongithub/Sentence-Simplification-ACL14) |
| NSELSTM-B (Vu et al., 2018) | 26.31 | 27.42 | [Sentence Simplification with Memory-Augmented Neural Networks](http://aclweb.org/anthology/N18-2013) | |
| DRESS (Zhang and Lapata, 2017) | 23.21 | 27.37 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| DMASS + DCSS (Zhao et al., 2018) |  | 27.28 | [Integrating Transformer and Paraphrase Rules for Sentence Simplification](http://aclweb.org/anthology/D18-1355) | [Official](https://github.com/Sanqiang/text_simplification) |
| DRESS-LS (Zhang and Lapata, 2017) | 24.30 | 26.63 | [Sentence Simplification with Deep Reinforcement Learning](http://aclweb.org/anthology/D17-1062) | [Official](https://github.com/XingxingZhang/dress) |
| PBMT-R (Wubben et al., 2012) | 18.19\* | 15.77\* | [Sentence Simplification by Monolingual Machine Translation](http://aclweb.org/anthology/P12-1107) |  |

As mentioned before, a big disadvantage of the Newsela corpus is that a unique train/dev/test split of the data is not (cannot be made?) publicly available. In addition, due to its characteristics, it is not clear what should be the best way to generate sentence alignments and split the data:

-  [Zhang and Lapata (2017)](http://aclweb.org/anthology/D17-1062) removed sentences from version pairs 0–1, 1–2, and 2–3 because they are "too similar to each other". This could prevent the model from learning when a sentence should not be simplified. In addition, their test set only considers 1-to-1 sentence alignments, even though it is possible to generate 1-to-N and N-to-1 sentence pairs as shown by other researchers ([Scarton et al., 2018](http://aclweb.org/anthology/L18-1553); [Stajner et al., 2018](http://aclweb.org/anthology/L18-1615)).
- [Alva-Manchego et al. (2017)](http://aclweb.org/anthology/I17-1030), [Scarton et al. (2018)](http://aclweb.org/anthology/L18-1553), and [Stajner and Nisioi (2018)](http://www.aclweb.org/anthology/L18-1479) generate sentence alignments (using different algorithms) only between adjacent article versions (i.e. 0-1, 1-2, 2-3, and 3-4). Meanwhile, [Scarton and Specia (2018)](http://aclweb.org/anthology/P18-2113) generate alignments between all versions (i.e., 0-{1,2,3,4}, 1-{2,3,4}, 2-{3,4}, and 3-4). The assumption behind using only adjacent versions is that, to write an article's simplification, an editor takes the immediately previous simplified version as basis (i.e. 0&rarr;1, 1&rarr;2, etc.). However, since the simplification manual followed by the Newsela editors is not public, it is not possible to corroborate that hypothesis.


[Go back to the README](../README.md)


================================================
FILE: english/stance_detection.md
================================================
# Stance detection

Stance detection is the extraction of a subject's reaction to a claim made by a primary actor. It is a core part of a set of approaches to fake news assessment.

Example:

* Source: "Apples are the most delicious fruit in existence"
* Reply: "Obviously not, because that is a reuben from Katz's"
* Stance: deny

### RumourEval

The [RumourEval 2017](http://www.aclweb.org/anthology/S/S17/S17-2006.pdf) dataset has been used for stance detection in English (subtask A). It features multiple stories and thousands of reply:response pairs, with train, test and evaluation splits each containing a distinct set of over-arching narratives.

This dataset subsumes the large [PHEME collection of rumors and stance](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150989), which includes German.

| Model           | Accuracy  |  Paper / Source |
| ------------- | ----- | --- |
| Kochkina et al. 2017 | 0.784 | [Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM](http://www.aclweb.org/anthology/S/S17/S17-2083.pdf)|
| Bahuleyan and Vechtomova 2017| 0.780 | [UWaterloo at SemEval-2017 Task 8: Detecting Stance towards Rumours with Topic Independent Features](http://www.aclweb.org/anthology/S/S17/S17-2080.pdf) |

[Go back to the README](../README.md)


================================================
FILE: english/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as CNN/DailyMail and Gigaword provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### CNN / Daily Mail

The [CNN / Daily Mail dataset](https://arxiv.org/abs/1506.03340) as processed by 
[Nallapati et al. (2016)](http://www.aclweb.org/anthology/K16-1028) has been used
for evaluating summarization. The dataset contains online news articles (781 tokens 
on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average).
The processed version contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs.
Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional).
The multilingual version of CNN / Daily Mail dataset exists and is available for five different languages ([French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum)).

#### Anonymized version

The following models have been evaluated on the entitiy-anonymized version of the dataset introduced by [Nallapati et al. (2016)](http://www.aclweb.org/anthology/K16-1028).

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- |
| RNES w/o coherence (Wu and Hu, 2018) | 41.25 | 18.87 | 37.75 | - | [Learning to Extract Coherent Summary via Deep Reinforcement Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118) | |
| SWAP-NET (Jadhav and Rajan, 2018) | 41.6 | 18.3 | 37.7 | - | [Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks](http://aclweb.org/anthology/P18-1014) | |
| HSASS (Al-Sabahi et al., 2018) | 42.3 | 17.8 | 37.6 | - | [A Hierarchical Structured Self-Attentive Model for Extractive Document Summarization (HSSAS)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8344797) | |
| GAN (Liu et al., 2018) | 39.92 | 17.65 | 36.71 | - | [Generative Adversarial Network for Abstractive Text Summarization](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16238/16492) | |
| KIGN+Prediction-guide (Li et al., 2018) | 38.95| 17.12 | 35.68 | - | [Guiding Generation for Abstractive Text Summarization based on Key Information Guide Network](http://aclweb.org/anthology/N18-2009) | |
| SummaRuNNer (Nallapati et al., 2017) | ﻿39.6 | 16.2 | 35.3 | - | [SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents](https://arxiv.org/abs/1611.04230) | |
| rnn-ext + abs + RL + rerank (Chen and Bansal, 2018) | 39.66 | 15.85 | 37.34 | - | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1063) | [Official](https://github.com/ChenRocks/fast_abs_rl) |
| ML+RL, with intra-attention (Paulus et al., 2018) | 39.87 | 15.82 | 36.90 | - | [A Deep Reinforced Model for Abstractive Summarization](https://openreview.net/pdf?id=HkAClQgA-) | |
| Lead-3 baseline (Nallapati et al., 2017) | 39.2 | 15.7 | 35.5 | - | [SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents](https://arxiv.org/abs/1611.04230) | |
| ML+RL ROUGE+Novel, with LM (Kryscinski et al., 2018) | 40.02 | 15.53 | 37.44 | - | [Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207) |  |
| (Tan et al., 2017) | 38.1 | 13.9 | 34.0 | - | [Abstractive Document Summarization with a Graph-Based Attentional Neural Model](http://aclweb.org/anthology/P17-1108) | |
| words-lvt2k-temp-att (Nallapti et al., 2016) | 35.46 | 13.30 | 32.65 | - | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | |


#### Non-Anonymized Version: Extractive Models

The following models have been evaluated on the non-anonymized version of the dataset introduced by [See et al. (2017)](http://aclweb.org/anthology/P17-1099).

The first table covers Extractive Models, while the second covers abstractive approaches.

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- |
| MatchSum (Zhong et al., 2020) | 44.41 | 20.86 | 40.55 | - | [Extractive Summarization as Text Matching](https://arxiv.org/abs/2004.08795) | [Official](https://github.com/maszhongming/MatchSum) |
| DiscoBERT w.G_R & G_C (Xu et al. 2019) | 43.77 | 20.85 | 40.67 | - | [A Discourse-Aware Neural Extractive Model for Text Summarization](http://www.cs.utexas.edu/~jcxu/material/ACL20/DiscoBERT_ACL2020.pdf) |[Official](https://github.com/jiacheng-xu/DiscoBERT) |
| BertSumExt (Liu and Lapata 2019) | 43.85 | 20.34 | 39.90 | - | [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) |[Official](https://github.com/nlpyang/PreSumm) | 
| BERT-ext + RL (Bae et al., 2019) | 42.76 | 19.87 | 39.11 | - | [Summary Level Training of Sentence Rewriting for Abstractive Summarization](https://arxiv.org/abs/1909.08752) | |
| PNBERT (Zhong et al., 2019) | 42.69 | 19.60 | 38.85 | - | [Searching for Effective Neural Extractive Summarization: What Works and What's Next](https://arxiv.org/abs/1907.03491) | [Official](https://github.com/maszhongming/Effective_Extractive_Summarization) |
| HIBERT (Zhang et al., 2019) | 42.37 | 19.95 | 38.83 | - | [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization](https://arxiv.org/abs/1905.06566) | |
| NeuSUM (Zhou et al., 2018) | 41.59 | 19.01 | 37.98 | - | [Neural Document Summarization by Jointly Learning to Score and Select Sentences](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/magic282/NeuSum) |
| Latent (Zhang et al., 2018) | 41.05 | 18.77 | 37.54 | - | [Neural Latent Extractive Document Summarization](http://aclweb.org/anthology/D18-1088) | | 
| BanditSum (Dong et al., 2018) | 41.5 | 18.7 | 37.6 | - | [BANDITSUM: Extractive Summarization as a Contextual Bandit](https://aclweb.org/anthology/D18-1409) | [Official](https://github.com/yuedongP/BanditSum)|
| REFRESH (Narayan et al., 2018) | 40.0 | 18.2 | 36.6 | - | [Ranking Sentences for Extractive Summarization with Reinforcement Learning](http://aclweb.org/anthology/N18-1158) | [Official](https://github.com/EdinburghNLP/Refresh) |
| Lead-3 baseline (See et al., 2017) | 40.34 | 17.70 | 36.57 | 22.21 | [Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099) | [Official](https://github.com/abisee/pointer-generator) |

#### Non-Anonymized: Abstractive Models & Mixed Models

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- |
| BRIO (Liu et al., 2022) | 47.78 | 23.55 | 44.57 | - | [BRIO: Bringing Order to Abstractive Summarization](https://arxiv.org/pdf/2203.16804.pdf) | [Official](https://github.com/yixinL7/BRIO) |
| SimCLS (Liu et al., 2021) | 46.67 | 22.15 | 43.54 | - | [SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization](https://arxiv.org/abs/2106.01890.pdf) | [Official](https://github.com/yixinL7/SimCLS) |
| GSum (Dou et al., 2020) | 45.94 | 22.32 | 42.48 | - | [GSum: A General Framework for Guided Neural Abstractive Summarization](https://arxiv.org/pdf/2010.08014.pdf) | [Official](https://github.com/neulab/guided_summarization) |
| ProphetNet (Yan, Qi, Gong, Liu et al., 2020) | 44.20 | 21.17 | 41.30 | - | [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/pdf/2001.04063.pdf) | [Official](https://github.com/microsoft/ProphetNet) |
| PEGASUS (Zhang et al., 2019) | 44.17 | 21.47 | 41.11 | - | [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf) | [Official](https://github.com/google-research/pegasus) |
| BART (Lewis et al., 2019) | 44.16 | 21.28 | 40.90 | - | [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) | [Official](https://github.com/pytorch/fairseq/tree/master/examples/bart) |
| T5 (Raffel et al., 2019) | 43.52 | 21.55 | 40.69 | - | [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) | [Official](https://github.com/google-research/text-to-text-transfer-transformer) |
| UniLM (Dong et al., 2019) | 43.33 | 20.21 | 40.51 | - | [Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/pdf/1905.03197.pdf) | [Official](https://github.com/microsoft/unilm) |
| CNN-2sent-hieco-RBM (Zhang et al., 2019) | 42.04 | 19.77 | 39.42 | - |[Abstract Text Summarization with a Convolutional Seq2Seq Model](https://www.mdpi.com/2076-3417/9/8/1665/pdf) | |
| BertSumExtAbs (Liu and Lapata 2019) | 42.13 | 19.60 | 39.18 | - | [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) |[Official](https://github.com/nlpyang/PreSumm) |
| BERT-ext + abs + RL + rerank (Bae et al., 2019) | 41.90 | 19.08 | 39.64 | - | [Summary Level Training of Sentence Rewriting for Abstractive Summarization](https://arxiv.org/abs/1909.08752) | |
| Two-Stage + RL (Zhang et al., 2019) | 41.71 | 19.49 | 38.79 | - | [Pretraining-Based Natural Language Generation for Text Summarization](https://arxiv.org/abs/1902.09243) | |
| DCA (Celikyilmaz et al., 2018) | 41.69 | 19.47 | 37.92 | - | [Deep Communicating Agents for Abstractive Summarization](http://aclweb.org/anthology/N18-1150) | |
| EditNet (Moroshko et al., 2018) | 41.42 | 19.03 | 38.36 | - | [An Editorial Network for Enhanced Document Summarization](https://arxiv.org/abs/1902.10360) | |
| rnn-ext + RL (Chen and Bansal, 2018) | 41.47 | 18.72 | 37.76 | 22.35 | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/chenrocks/fast_abs_rl) |
| Bottom-Up Summarization (Gehrmann et al., 2018) | 41.22 | 18.68 | 38.34 | - | [Bottom-Up Abstractive Summarization](https://arxiv.org/abs/1808.10792) | [Official](https://github.com/sebastianGehrmann/bottom-up-summary) |
| (Li et al., 2018a) | 41.54 | 18.18 | 36.47 | - | [Improving Neural Abstractive Document Summarization with Explicit Information Selection Modeling](http://aclweb.org/anthology/D18-1205) | |
| (Li et al., 2018b) | 40.30 | 18.02 | 37.36 | - | [Improving Neural Abstractive Document Summarization with Structural Regularization](http://aclweb.org/anthology/D18-1441) | |
| ROUGESal+Ent RL (Pasunuru and Bansal, 2018) | 40.43 | 18.00 | 37.10 | 20.02 | [Multi-Reward Reinforced Summarization with Saliency and Entailment](http://aclweb.org/anthology/N18-2102) | |
| end2end w/ inconsistency loss (Hsu et al., 2018) | 40.68 | 17.97 | 37.13 | - | [A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss](http://aclweb.org/anthology/P18-1013) | |
| RL + pg + cbdec (Jiang and Bansal, 2018) | 40.66 | 17.87 | 37.06 | 20.51 | [Closed-Book Training to Improve Summarization Encoder Memory](http://aclweb.org/anthology/D18-1440) | |
| rnn-ext + abs + RL + rerank (Chen and Bansal, 2018) | 40.88 | 17.80 | 38.54 | 20.38 | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/chenrocks/fast_abs_rl) |
| Pointer + Coverage + EntailmentGen + QuestionGen (Guo et al., 2018) | 39.81 | 17.64 | 36.54 | 18.54 | [Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064) | |
| ML+RL ROUGE+Novel, with LM (Kryscinski et al., 2018) | 40.19 | 17.38 | 37.52 | - | [Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207) | |
| Pointer-generator + coverage (See et al., 2017) | 39.53 | 17.28 | 36.38 | 18.72 | [Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099) | [Official](https://github.com/abisee/pointer-generator) |

### Gigaword

The Gigaword summarization dataset has been first used by [Rush et al., 2015](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) and represents a sentence summarization / headline generation task with very short input documents (31.4 tokens) and summaries (8.3 tokens). It contains 3.8M training, 189k development and 1951 test instances. Models are evaluated with ROUGE-1, ROUGE-2 and ROUGE-L using full-length F1-scores.

Below Results are ranking by ROUGE-2 Scores.

| Model | ROUGE-1 | ROUGE-2* | ROUGE-L | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | -------------- | ---- |
| ControlCopying (Song et al., 2020) | 39.08 | 20.47 | 36.69 | [Controlling the Amount of Verbatim Copying in Abstractive Summarization](https://arxiv.org/pdf/1911.10390.pdf) | [Official](https://github.com/ucfnlp/control-over-copying) |
| ProphetNet (Yan, Qi, Gong, Liu et al., 2020) | 39.51 | 20.42 | 36.69 | [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/pdf/2001.04063.pdf) | [Official](https://github.com/microsoft/ProphetNet) |
| UniLM (Dong et al., 2019) | 38.90 | 20.05 | 36.00 | [Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/pdf/1905.03197.pdf) | [Official](https://github.com/microsoft/unilm) |
| PEGASUS (Zhang et al., 2019) | 39.12 | 19.86 | 36.24 | [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf) | [Official](https://github.com/google-research/pegasus) |
| BiSET (Wang et al., 2019) | 39.11 | 19.78 | 36.87 | [BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization](https://www.aclweb.org/anthology/P19-1207) | [Official](https://github.com/InitialBug/BiSET) |
| MASS (Song et al., 2019) | 38.73 | 19.71 | 35.96 | [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://arxiv.org/pdf/1905.02450v5.pdf) | [Official](https://github.com/microsoft/MASS) |
| Re^3 Sum (Cao et al., 2018) | 37.04 | 19.03 | 34.46 | [Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization](http://aclweb.org/anthology/P18-1015) | |
| JointParsing (Song el at., 2020) | 36.61 | 18.85 | 34.33 | [Joint Parsing and Generation for Abstractive Summarization](https://arxiv.org/pdf/1911.10389.pdf) | [Official](https://github.com/KaiQiangSong/joint_parse_summ) |
| CNN-2sent-hieco-RBM (Zhang et al., 2019) | 37.95 | 18.64 | 35.11 | [Abstract Text Summarization with a Convolutional Seq2Seq Model](https://www.mdpi.com/2076-3417/9/8/1665/pdf) | | |
| Reinforced-Topic-ConvS2S (Wang et al., 2018) | 36.92 | 18.29 | 34.58 | [A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text Summarization](https://www.ijcai.org/proceedings/2018/0619.pdf) | |
| CGU (Lin et al., 2018) | 36.3 | 18.0 | 33.8 | [Global Encoding for Abstractive Summarization](http://aclweb.org/anthology/P18-2027) | [Official](https://www.github.com/lancopku/Global-Encoding) |
| Pointer + Coverage + EntailmentGen + QuestionGen (Guo et al., 2018) | 35.98 | 17.76 | 33.63 | [Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064) | |
| Struct+2Way+Word (Song et al., 2018) | 35.47 | 17.66 | 33.52 | [Structure-Infused Copy Mechanisms for Abstractive Summarization](http://aclweb.org/anthology/C18-1146) | [Official](https://github.com/KaiQiangSong/struct_infused_summ)|
| FTSum_g (Cao et al., 2018) | 37.27 | 17.65 | 34.24 | [Faithful to the Original: Fact Aware Neural Abstractive Summarization](https://arxiv.org/pdf/1711.04434.pdf) | |
| DRGD (Li et al., 2017) | 36.27 | 17.57 | 33.62 | [Deep Recurrent Generative Decoder for Abstractive Text Summarization](http://aclweb.org/anthology/D17-1222) | |
| SEASS (Zhou et al., 2017) | 36.15 | 17.54 | 33.63 | [Selective Encoding for Abstractive Sentence Summarization](http://aclweb.org/anthology/P17-1101) | [Official](https://github.com/magic282/SEASS) |
| EndDec+WFE (Suzuki and Nagata, 2017) | 36.30 | 17.31 | 33.88 | [Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization](http://aclweb.org/anthology/E17-2047) | |
| Seq2seq + selective + MTL + ERAM (Li et al., 2018) | 35.33 | 17.27 | 33.19 | [Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization](http://aclweb.org/anthology/C18-1121) | |
| Seq2seq + E2T_cnn (Amplayo et al., 2018) | 37.04 | 16.66 | 34.93 | [Entity Commonsense Representation for Neural Abstractive Summarization](http://aclweb.org/anthology/N18-1064) | |
| RAS-Elman (Chopra et al., 2016) | 33.78 | 15.97 | 31.15 | [Abstractive Sentence Summarization with Attentive Recurrent Neural Networks](http://www.aclweb.org/anthology/N16-1012) | |
| words-lvt5k-1sent (Nallapti et al., 2016) | 32.67 | 15.59 | 30.64 | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | |
| ABS+ (Rush et al., 2015) | 29.76 | 11.88 | 26.96 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) * | |
| ABS (Rush et al., 2015) | 29.55 | 11.32 | 26.42 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) * | |

(*) [Rush et al., 2015](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf)  report ROUGE recall, the table here contains ROUGE F1-scores for Rush's model reported by [Chopra et al., 2016](http://www.aclweb.org/anthology/N16-1012)

### X-Sum

X-Sum (standing for _Extreme Summarization_), introduced by [Narayan et al., 2018](https://arxiv.org/pdf/1808.08745.pdf), is a summarization dataset which does not favor extractive strategies and calls for an abstractive modeling approach.  
The idea of this dataset is to create a short, one sentence news summary.  
Data is collected by harvesting online articles from the BBC.  
The dataset contain **204 045** samples for the training set, **11 332** for the validation set, and **11 334** for the test set. In average the length of article is 431 words (~20 sentences) and the length of summary is 23 words. It can be downloaded [here](https://github.com/EdinburghNLP/XSum).  
Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L.

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | -------------- | ---- |
| BRIO (Liu et al., 2022) | 49.07 | 25.59 | 40.40 | [BRIO: Bringing Order to Abstractive Summarization](https://arxiv.org/pdf/2203.16804.pdf) | [Official](https://github.com/yixinL7/BRIO) |
| PEGASUS (Zhang et al., 2019) | 47.21 | 24.56 | 39.25 | [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf) | [Official](https://github.com/google-research/pegasus) |
| BART (Lewis et al., 2019) | 45.14 | 22.27 | 37.25 | [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) | [Official](https://github.com/pytorch/fairseq/tree/master/examples/bart) |
| BertSumExtAbs (Liu et al., 2019) | 38.81 | 16.50 | 31.27 | [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf) | [Official](https://github.com/nlpyang/PreSumm) |
| T-ConvS2S | 31.89 | 11.54 | 25.75 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| PtGen | 29.70 | 9.21 | 23.24 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| Seq2Seq | 28.42 | 8.77 | 22.48 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| PtGen-Covg | 28.10 | 8.02 | 21.72 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| Baseline : Extractive Oracle | 29.79 | 8.81 | 22.66 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| Baseline : Lead-3 | 16.30 | 1.60 | 11.95 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |
| Baseline : Random | 15.16 | 1.78 | 11.27 | [Don’t Give Me the Details, Just the Summary!](https://arxiv.org/pdf/1808.08745.pdf) | [Official](https://github.com/EdinburghNLP/XSum) |

### DUC 2004 Task 1

Similar to Gigaword, task 1 of [DUC 2004](https://duc.nist.gov/duc2004/) is a sentence summarization task. The dataset contains 500 documents with on average 35.6 tokens and summaries with 10.4 tokens. Due to its size, neural models are typically trained on other datasets and only tested on DUC 2004. Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L recall @ 75 bytes.

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | -------------- | ---- |
|Transformer + LRPE + PE + ALONE + RE-ranking (Takase and Kobayashi, 2020)| 32.57 | 11.63 | 28.24 | [All Word Embeddings from One Embedding](https://arxiv.org/abs/2004.12073) | [Official](https://github.com/takase/alone_seq2seq) |
| Transformer + LRPE + PE + Re-ranking (Takase and Okazaki, 2019) | 32.29 | 11.49 | 28.03 | [Positional Encoding to Control Output Sequence Length](https://arxiv.org/abs/1904.07418) | [Official](https://github.com/takase/control-length) |
| DRGD (Li et al., 2017) | 31.79 | 10.75 | 27.48 | [Deep Recurrent Generative Decoder for Abstractive Text Summarization](http://aclweb.org/anthology/D17-1222) | |
| EndDec+WFE (Suzuki and Nagata, 2017) | 32.28 | 10.54 | 27.8 | [Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization](http://aclweb.org/anthology/E17-2047) | |
| Reinforced-Topic-ConvS2S (Wang et al., 2018) | 31.15 | 10.85 | 27.68 | [A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text Summarization](https://www.ijcai.org/proceedings/2018/0619.pdf) | |
| CNN-2sent-hieco-RBM (Zhang et al., 2019) | 29.74 | 9.85 | 25.81 | [Abstract Text Summarization with a Convolutional Seq2Seq Model](https://www.mdpi.com/2076-3417/9/8/1665/pdf) |
| Seq2seq + selective + MTL + ERAM (Li et al., 2018) | 29.33 | 10.24 | 25.24 | [Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization](http://aclweb.org/anthology/C18-1121) | |
| SEASS (Zhou et al., 2017) | 29.21 | 9.56 | 25.51 | [Selective Encoding for Abstractive Sentence Summarization](http://aclweb.org/anthology/P17-1101) | |
| words-lvt5k-1sent (Nallapti et al., 2016) | 28.61 | 9.42 | 25.24 | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | |
| ABS+ (Rush et al., 2015) | 28.18 | 8.49 | 23.81 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) | |
| RAS-Elman (Chopra et al., 2016) | 28.97 | 8.26 | 24.06 | [Abstractive Sentence Summarization with Attentive Recurrent Neural Networks](http://www.aclweb.org/anthology/N16-1012) | |
| ABS (Rush et al., 2015) | 26.55 | 7.06 | 22.05 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) | |

## Webis-TLDR-17 Corpus

This [dataset](https://zenodo.org/record/1168855) contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain. For more details, refer to [TL;DR: Mining Reddit to Learn Automatic Summarization](https://aclweb.org/anthology/W17-4508)

| Model              | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper/Source                                                                                                                     | Code |
|--------------------|---------|---------|---------|----------------------------------------------------------------------------------------------------------------------------------|------|
| Transformer + Copy (Gehrmann et al., 2019) | 22      | 6       | 17      | [Generating Summaries with Finetuned Language Models](https://www.aclweb.org/anthology/W19-8665/)                                |      |
| Unified VAE + PGN (Choi et al., 2019) | 19      | 4       | 15      | [VAE-PGN based Abstractive Model in Multi-stage Architecture for Text Summarization](https://www.aclweb.org/anthology/W19-8664/) |      |

## Webis-Snippet-20 Corpus
This [dataset](https://webis.de/data/webis-snippet-20.html) contains approximately 10 Million (webpage content, abstractive snippet) pairs and 3.5 Million (query term, webpage content, abstractive snippet) triples for the novel task of (query-biased) abstractive snippet generation of web pages. The corpus is compiled from ClueWeb09, ClueWeb12 and the DMOZ Open Directory Project. For more details, refer to [Abstractive Snippet Generation](https://arxiv.org/abs/2002.10782)

| Model                                             | ROUGE-1 | ROUGE-2 | ROUGE-L | Usefulness | Paper/Source                                                       | Code |
|---------------------------------------------------|---------|---------|---------|------------|-------------------------------------------------------|------|
| Anchor-context + Query biased (Chen et al., 2020) | 25.7    | 5.2     | 20.1    | 66.18 | [Abstractive Snippet Generation](https://arxiv.org/abs/2002.10782) |      |

## Sentence Compression

Sentence compression produces a shorter sentence by removing redundant information,
preserving the grammatically and the important content of the original sentence. 

### Google Dataset

The [Google Dataset](https://github.com/google-research-datasets/sentence-compression) was built by Filippova et al., 2013([Overcoming the Lack of Parallel Data in Sentence Compression](https://www.aclweb.org/anthology/D/D13/D13-1155.pdf)). The first dataset released contained only 10,000 sentence-compression pairs, but last year was released an additional 200,000 pairs. 

Example of a sentence-compression pair:
> Sentence: Floyd Mayweather is open to fighting Amir Khan in the future, despite snubbing the Bolton-born boxer in favour of a May bout with Argentine Marcos Maidana, according to promoters Golden Boy

> Compression: Floyd Mayweather is open to fighting Amir Khan in the future. 

In short, this is a deletion-based task where the compression is a subsequence from the original sentence. From the 10,000 pairs of the eval portion([repository](https://github.com/google-research-datasets/sentence-compression/tree/master/data)) it is used the very first 1,000 sentence for automatic evaluation and the 200,000 pairs for training.

Models are evaluated using the following metrics:
* F1 - compute the recall and precision in terms of tokens kept in the golden and the generated compressions.
* Compression rate (CR) - the length of the compression in characters divided over the sentence length. 

| Model           | F1 | CR |Paper / Source | Code |
| -------------   | :-----:| --- | --- | --- |
| SLAHAN with syntactic information (Kamigaito et al. 2020) | 0.855 | 0.407 | [Syntactically Look-Ahead Attention Network for Sentence Compression](https://arxiv.org/abs/2002.01145) | https://github.com/kamigaito/SLAHAN |
| BiRNN + LM Evaluator (Zhao et al. 2018) | 0.851 | 0.39 | [A Language Model based Evaluator for Sentence Compression](https://aclweb.org/anthology/P18-2028) | https://github.com/code4conference/code4sc |
| LSTM (Filippova et al., 2015) | 0.82 | 0.38 | [Sentence Compression by Deletion with LSTMs](https://research.google.com/pubs/archive/43852.pdf) | |
| BiLSTM (Wang et al., 2017) | 0.8 | 0.43 | [Can Syntax Help? Improving an LSTM-based Sentence Compression Model for New Domains](http://www.aclweb.org/anthology/P17-1127) |  |

[Go back to the README](../README.md)


================================================
FILE: english/taxonomy_learning.md
================================================
# Taxonomy Learning

Taxonomy learning is the task of hierarchically classifying concepts in an automatic manner from text corpora. The process of building taxonomies is usually divided into two main steps: (1) extracting hypernyms for concepts, which may constitute a field of research in itself (see Hypernym Discovery below) and (2) refining the structure into a taxonomy.

## Hypernym Discovery

Given a corpus and a target term (hyponym), the task of hypernym discovery consists of extracting a set of its most appropriate hypernyms from the corpus. For example, for the input word “dog”, some valid hypernyms would be “canine”, “mammal” or “animal”.

### SemEval 2018

The SemEval-2018 hypernym discovery evaluation benchmark ([Camacho-Collados et al. 2018](http://aclweb.org/anthology/S18-1115)), which can be freely downloaded [here](https://competitions.codalab.org/competitions/17119), contains three domains (general, medical and music) and is also available in Italian and Spanish (not in this repository). For each domain a target corpus and vocabulary (i.e. hypernym search space) are provided. The dataset contains both concepts (e.g. dog) and entities (e.g. Manchester United) up to trigrams. The following table lists the number of hyponym-hypernym pairs for each dataset: 

| Partition           | General | Medical | Music |
| ------------- | :-----:|:-----:|:-----:|
|Trial | 200 | 101 | 355 |
|Training |  11779 | 3256 | 5455 |
|Test | 7048 | 4116 | 5233 |

The results for each model and dataset (general, medical and music) are presented below (MFH stands for “Most Frequent Hypernyms” and is used as a baseline).

**General:**

| Model           | MAP | MRR | P@5 |  Paper / Source |
| ------------- | :-----:|:-----:|:-----:| --- |
|CRIM (Bernier-Colborne and Barrière, 2018) | 19.78 | 36.10 | 19.03 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) |
|vTE (Espinosa-Anke et al., 2016) | 10.60 | 23.83 | 9.91 |  [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) |
|NLP_HZ (Qui et al., 2018) | 9.37 | 17.29 | 9.19 |  [A Nearest Neighbor Approach](http://aclweb.org/anthology/S18-1148) |
|300-sparsans (Berend et al., 2018) | 8.95 | 19.44 | 8.63 |  [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) |
|MFH | 8.77 | 21.39 | 7.81 |  -- |
|SJTU BCMI (Zhang et al., 2018) | 5.77 | 10.56 | 5.96 |  [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) |
|Apollo (Onofrei et al., 2018) | 2.68 | 6.01 | 2.69 |  [Detecting Hypernymy Relations Using Syntactic Dependencies ](http://aclweb.org/anthology/S18-1146) |
|balAPInc (Shwartz et al., 2017) | 1.36 | 3.18 | 1.30 |  [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) |


**Medical domain:**

| Model           | MAP | MRR | P@5 |  Paper / Source |
| ------------- | :-----:|:-----:|:-----:| --- |
|CRIM (Bernier-Colborne and Barrière, 2018) | 34.05 | 54.64 | 36.77 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) |
|MFH | 28.93 | 35.80 | 34.20 | -- |
|300-sparsans (Berend et al., 2018) | 20.75 | 40.60 | 21.43 | [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) |
|vTE (Espinosa-Anke et al., 2016) | 18.84 | 41.07 | 20.71 | [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) |
|EXPR (Issa Alaa Aldine et al., 2018) | 13.77 | 40.76 | 12.76 | [A Combined Approach for Hypernym Discovery](http://aclweb.org/anthology/S18-1150) |
|SJTU BCMI (Zhang et al., 2018) | 11.69 | 25.95 | 11.69 | [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) |
|ADAPT (Maldonado and Klubička, 2018) | 8.13 | 20.56 | 8.32 | [Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora ](http://aclweb.org/anthology/S18-1151) |
|balAPInc (Shwartz et al., 2017) | 0.91 | 2.10 | 1.08 | [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) |


**Music domain:**

| Model           | MAP | MRR | P@5 |  Paper / Source |
| ------------- | :-----:|:-----:|:-----:| --- |
|CRIM (Bernier-Colborne and Barrière, 2018) | 40.97 | 60.93 | 41.31 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) |
|MFH | 33.32 | 51.48 | 35.76 | -- |
|300-sparsans (Berend et al., 2018) | 29.54 | 46.43 | 28.86 | [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) |
|vTE (Espinosa-Anke et al., 2016) | 12.99 | 39.36 | 12.41 | [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) |
|SJTU BCMI (Zhang et al., 2018) | 4.71 | 9.15 | 4.91 | [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) |
|ADAPT (Maldonado and Klubička, 2018) | 2.63 | 7.46 | 2.64 | [Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora ](http://aclweb.org/anthology/S18-1151) |
|balAPInc (Shwartz et al., 2017) | 1.95 | 5.01 | 2.15 | [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) |


## Taxonomy Enrichment

Given words that are not included in a taxonomy, the task is to associate each query word with its appropriate hypernyms. For instance, given an input word “milk” we need to provide a list of the most probable hypernyms the word could be attached to, e.g. “dairy product”, “beverage”. A word may have multiple hypernyms.

### Datasets

#### SemEval 2016 Task 14

The SemEval-2016 Task 14 aims to enrich the WordNet taxonomy with new words and word senses. A system's task is to identify the WordNet synset with which the new word sense should be merged (i.e., the term is synonymous with those in the synset) or added as a hyponym (i.e., the new word sense is a specialization of an exisiting word sense).

The following table gives examples of word senses that are not defined in WordNet and their corresponding operations, illustrating the type of data that might be seen in the task.

| OOV word | Definition |	Target synset |	Operation |
| :------------- | :-----: | :-----:|:-----:|
|geoscience#n | any of several sciences that deal with the Earth	| earth_science (any of the sciences that deal with the earth or its parts) |	MERGE |
|mudslide#n | a mixed drink consisting of vodka, Kahlua and Bailey's.	| cocktail (a short mixed drink)	| ATTACH |
| euthanize#v | to submit or animal to euthanasia |	destroy, put down (put (an animal) to death)	| MERGE |

The SemEval-2016 taxonomy enrichment evaluation benchmark ([Jurgens and Pilehvar 2016](https://aclanthology.org/S16-1169/)), which can be freely downloaded [here](http://alt.qcri.org/semeval2016/task14/data/uploads/semeval-2016-task-14.zip).

Novel concepts were limited to nouns and verbs, as only these parts of speech have fully-developed taxonomies in WordNet. For each item, in addition to the target synset and
the operation MERGE/ATTACH, the glosses were also provided along with the source URL from which the new word sense was obtained. The dataset consists of a total of 1000 items, split into training and test datasets containing 400 and 600 items, respectively. The following table lists the number of items for each dataset:

| Partition           | Noun | Verb |
| ------------- | :-----:|:-----:|
|Trial | 93 | 34 | 
|Training |  349 | 51|
|Test |  516 | 84 | 

The results for each model participant are presented below.

| Model | Lemma Match | Wu&P | Recall  | F1 | Paper / Source |
| ------------- | :-----:|:-----:|:-----:|:-----:|  --- |
|MSejrKU (Schlichtkrull and Alonso, 2016) |0.428 |0.523| 0.973| 0.680|[MSejrKu at SemEval-2016 Task 14: Taxonomy Enrichment by Evidence Ranking](https://aclanthology.org/S16-1209/)|
|TALN (Anke et al., 2016) |0.360| 0.476| 1.000| 0.645|[Semantic Taxonomy Enrichment Via Sense-Based Embeddings](https://aclanthology.org/S16-1208/)|
|VCU (McInnes, 2016) |0.161 |0.432 |0.997| 0.602|[Evaluating definitional-based similarity measure for semantic taxonomy enrichment](https://aclanthology.org/S16-1212/)|
|Duluth (Pedersen, 2016) |0.043| 0.347| 1.000| 0.515|[Extending Gloss Overlaps to Enrich Semantic Taxonomies](https://aclanthology.org/S16-1207/)|
|Deftor (Tanev and Rotondi, 2016) |0.066| 0.347| 0.987| 0.513|[Taxonomy Enrichment using Definition Vectors](https://aclanthology.org/S16-1210)|
|UMNDuluth (Rusert and Pedersen, 2016) |0.098 |0.340| 0.998 |0.507|[WordNet’s Missing Lemmas](https://aclanthology.org/S16-1211/)|
|Baseline: First word, first sense (Jurgens and Pilehvar, 2016) |0.415| 0.514| 1.000| 0.679|[SemEval-2016 Task 14: Semantic Taxonomy Enrichment](https://aclanthology.org/S16-1169/)|
|Baseline: Random synset (Jurgens and Pilehvar, 2016) | 0.000 |0.227| 1.000| 0.370|[SemEval-2016 Task 14: Semantic Taxonomy Enrichment](https://aclanthology.org/S16-1169/)|


####  Diachronic WordNet Datasets

The SemEval-2016 Task 14 setting implies pre-defined glosses. However, it is possible that new words that should be added to the taxonomy may have no definition in any other sources: they could be very rare (“apparatchik”, “falanga”), relatively new (“selfie”, “hashtag”) or come from a narrow domain (“vermiculite”).

[Nikishina et al., 2020](https://aclanthology.org/2020.coling-main.276/) created multiple datasets for studying diachronic evolution of wordnets, which can be downloaded from [here](https://doi.org/10.5281/zenodo.4279821). They chose two versions of WordNet and then select words which appear only in a newer version. For each word, they got its hypernyms from the newer WordNet version and consider them as gold standard
hypernyms. The words were added to the dataset if only their hypernyms appear in both snippets. They skipped one or more WordNet versions, otherwise the dataset would be too small. 

| Dataset | Noun | Verb |
| ------------- | :-----:|:-----:|
|WordNet 1.6 - WordNet 3.0 | 17 043 | 755 | 
|WordNet 1.7 - WordNet 3.0 |  6 161 | 362|
|WordNet 2.0 - WordNet 3.0 | 2 620 | 193 | 

The results for each system on the current dataset are presented below.

##### WordNet 1.6 - WordNet 3.0

| Model           | MAP (Nouns) | MAP (Verbs)|  Paper / Source |
| ------------- | :-----:| :-----:| --- |
|DWRank-Meta (Meta-embeddings based on Word and Graph Embeddings) | 0.367 | 0.288 | [Taxonomy enrichment with text and graph vector representations](https://content.iospress.com/articles/semantic-web/sw212955)|
|AAEME triplet loss (Tikhomirov and Loukachevitch, 2021) | 0.345 | 0.289 | [Meta-Embeddings in Taxonomy Enrichment Task](http://www.dialog-21.ru/media/5287/tikhomirovmmplusloukachevitchnv091.pdf)|
|Ranking + Wiki (Nikishina et al., 2020) | 0.337 | 0.270 | [Studying Taxonomy Enrichment on Diachronic WordNet Versions](https://aclanthology.org/2020.coling-main.276/) |
|Ranking + Wiki + node2vec + Poincare (Nikishina et al., 2021) | 0.311 | 0.251 | [Exploring Graph-based Representations for Taxonomy Enrichment](https://aclanthology.org/2021.gwc-1.15/)|


##### WordNet 1.7 - WordNet 3.0

| Model           | MAP (Nouns) | MAP (Verbs)|  Paper / Source |
| ------------- | :-----:| :-----:| --- |
|DWRank-Meta (Meta-embeddings based on Word and Graph Embeddings) | 0.418 | 0.227 | [Taxonomy enrichment with text and graph vector representations](https://content.iospress.com/articles/semantic-web/sw212955)|
|AAEME triplet loss (Tikhomirov and Loukachevitch, 2021) |  0.394 | 0.239 | [Meta-Embeddings in Taxonomy Enrichment Task](http://www.dialog-21.ru/media/5287/tikhomirovmmplusloukachevitchnv091.pdf)|
|Ranking + Wiki (Nikishina et al., 2020) | 0.380 | 0.200 | [Studying Taxonomy Enrichment on Diachronic WordNet Versions](https://aclanthology.org/2020.coling-main.276/) |
|Ranking + Wiki + node2vec + Poincare (Nikishina et al., 2021) | 0.350 |  0.177 | [Exploring Graph-based Representations for Taxonomy Enrichment](https://aclanthology.org/2021.gwc-1.15/)|


##### WordNet 2.0 - WordNet 3.0

| Model           | MAP (Nouns) | MAP (Verbs)|  Paper / Source |
| ------------- | :-----:| :-----:| --- |
|DWRank-Meta (Meta-embeddings based on Word and Graph Embeddings) | 0.480 | 0.280 | [Taxonomy enrichment with text and graph vector representations](https://content.iospress.com/articles/semantic-web/sw212955)|
|AAEME triplet loss (Tikhomirov and Loukachevitch, 2021) | 0.445 | 0.272 | [Meta-Embeddings in Taxonomy Enrichment Task](http://www.dialog-21.ru/media/5287/tikhomirovmmplusloukachevitchnv091.pdf)|
|Ranking + Wiki (Nikishina et al., 2020) | 0.400 | 0.238 | [Studying Taxonomy Enrichment on Diachronic WordNet Versions](https://aclanthology.org/2020.coling-main.276/) |
|Ranking + Wiki + node2vec + Poincare (Nikishina et al., 2021) | 0.300 | 0.248 | [Exploring Graph-based Representations for Taxonomy Enrichment](https://aclanthology.org/2021.gwc-1.15/)|


================================================
FILE: english/temporal_processing.md
================================================
# Temporal Processing

## Document Dating (Time-stamping)

Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others. 

For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms *1995* and *Four years after*.

*Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….*

### Datasets 

|                 Datasets                 | # Docs | Start Year | End Year |
| :--------------------------------------: | :----: | :--------: | :------: |
| [APW](https://drive.google.com/file/d/1tll04ZBooB3Mohm6It-v8MBcjMCC3Y1w/view) |  675k  |    1995    |   2010   |
| [NYT](https://drive.google.com/file/d/1wqQRFeA1ESAOJqrwUNakfa77n_S9cmBi/view?usp=sharing) |  647k  |    1987    |   1996   |

### Comparison on year level granularity:

|                                        | APW Dataset | NYT Dataset | Paper/Source                             |
| -------------------------------------- | :---------: | :---------: | ---------------------------------------- |
| NeuralDater (Vashishth et. al, 2018)   |    64.1     |    58.9     | [Document Dating using Graph Convolution Networks](https://github.com/malllabiisc/NeuralDater) |
| Chambers (2012)                        |    52.5     |    42.3     | [Labeling Documents with Timestamps: Learning from their Time Expressions](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) |
| BurstySimDater (Kotsakos et. al, 2014) |    45.9     |    38.5     | [A Burstiness-aware Approach for Document Dating](https://www.idi.ntnu.no/~noervaag/papers/SIGIR2014short.pdf) |


## Temporal Information Extraction

Temporal information extraction is the identification of chunks/tokens corresponding to temporal intervals, and the extraction and determination of the temporal relations between those. The entities extracted may be temporal expressions (timexes), eventualities (events), or auxiliary signals that support the interpretation of an entity or relation. Relations may be temporal links (tlinks), describing the order of events and times, or subordinate links (slinks) describing modality and other subordinative activity, or aspectual links (alinks) around the various influences aspectuality has on event structure.

The markup scheme used for temporal information extraction is well-described in the ISO-TimeML standard, and also on [www.timeml.org](http://www.timeml.org).

```
<?xml version="1.0" ?>

<TimeML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://timeml.org/timeMLdocs/TimeML_1.2.1.xsd">
<TEXT>


 PRI20001020.2000.0127 
 NEWS STORY 
 <TIMEX3 tid="t0" type="TIME" value="2000-10-20T20:02:07.85">10/20/2000 20:02:07.85</TIMEX3> 


 The Navy has changed its account of the attack on the USS Cole in Yemen.
 Officials <TIMEX3 tid="t1" type="DATE" value="PRESENT_REF" temporalFunction="true" anchorTimeID="t0">now</TIMEX3> say the ship was hit <TIMEX3 tid="t2" type="DURATION" value="PT2H">nearly two hours </TIMEX3>after it had docked.
 Initially the Navy said the explosion occurred while several boats were helping
 the ship to tie up. The change raises new questions about how the attackers
 were able to get past the Navy security.


 <TIMEX3 tid="t3" type="TIME" value="2000-10-20T20:02:28.05">10/20/2000 20:02:28.05</TIMEX3> 


<TLINK timeID="t2" relatedToTime="t0" relType="BEFORE"/>
</TEXT>
</TimeML>
```

To avoid leaking knowledge about temporal structure, train, dev and test splits must be made at document level for temporal information extraction.

### TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: [TimeBank 1.2](https://catalog.ldc.upenn.edu/LDC2006T08)

Evaluation is for both entity chunking and attribute annotation, as well as temporal relation accuracy, typically measured with F1 -- although this metric is not sensitive to inconsistencies or free wins from interval logic induction over the whole set.

| Model           | F1 score  |  Paper / Source |
| ------------- | :-----:| --- |
| Catena | 0.511 |  [CATENA: CAusal and TEmporal relation extraction from NAtural language texts](http://www.aclweb.org/anthology/C16-1007) |
| CAEVO | 0.507 | [Dense Event Ordering with a Multi-Pass Architecture](https://www.transacl.org/ojs/index.php/tacl/article/download/255/50) | 

### TempEval-3

The TempEval-3 corpus accompanied the shared [TempEval-3](http://www.aclweb.org/anthology/S13-2001) SemEval task in 2013. This uses a timelines-based metric to assess temporal relation structure. The corpus is fresh and somewhat more varied than TimeBank, though markedly smaller. [TempEval-3 data](https://www.cs.york.ac.uk/semeval-2013/task1/index.php%3Fid=data.html)

| Model           | Temporal awareness  |  Paper / Source |
| ------------- | :-----:| --- |
| Ning et al. | 67.2 | [A Structured Learning Approach to Temporal Relation Extraction](http://www.aclweb.org/anthology/D17-1108) | 
| ClearTK | 30.98 | [Cleartk-timeml: A minimalist approach to tempeval 2013](http://www.aclweb.org/anthology/S13-2002) |

## Timex normalisation

Temporal expression normalisation is the grounding of a lexicalisation of a time to a calendar date or other formal temporal representation.

Example:
<TIMEX3 tid="t0" type="TIME" value="2000-10-18T21:01:00.65">10/18/2000 21:01:00.65</TIMEX3>
Dozens of Palestinians were wounded in
scattered clashes in the West Bank and Gaza Strip, <TIMEX3 tid="t1" type="DATE" value="2000-10-18" temporalFunction="true" anchorTimeID="t0">Wednesday</TIMEX3>,
despite the Sharm el-Sheikh truce accord. 

Chuck Rich reports on entertainment <TIMEX3 tid="t11" type="SET" value="XXXX-WXX-7">every Saturday</TIMEX3>

### TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: [TimeBank 1.2](https://catalog.ldc.upenn.edu/LDC2006T08)

| Model           | F1 score  |  Paper / Source |
| ------------- | :-----:| --- |
| TIMEN | 0.89 | [TIMEN: An Open Temporal Expression Normalisation Resource](http://aclweb.org/anthology/L12-1015) | 
| HeidelTime | 0.876 | [A baseline temporal tagger for all languages](http://aclweb.org/anthology/D15-1063) |

### PNT

The [Parsing Time Normalizations corpus](https://github.com/bethard/anafora-annotations/releases) in [SCATE](http://www.lrec-conf.org/proceedings/lrec2016/pdf/288_Paper.pdf) format allows the representation of a wider variety of time expressions than previous approaches. This corpus was release with [SemEval 2018 Task 6](http://aclweb.org/anthology/S18-1011).

| Model           | F1 score  |  Paper / Source |
| ------------- | :-----:| --- |
| Laparra et al. 2018 | 0.764 | [From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations](http://aclweb.org/anthology/Q18-1025) |
| HeidelTime | 0.74 | [A baseline temporal tagger for all languages](http://aclweb.org/anthology/D15-1063) |
| Chrono | 0.70 | [Chrono at SemEval-2018 task 6: A system for normalizing temporal expressions](http://aclweb.org/anthology/S18-1012) | 


[Go back to the README](../README.md)


================================================
FILE: english/text_classification.md
================================================
# Text classification

Text classification is the task of assigning a sentence or document an appropriate category.
The categories depend on the chosen dataset and can range from topics.

### AG News

The [AG News corpus](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)
consists of news articles from the [AG's corpus of news articles on the web](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)
pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
Models are evaluated based on error rate (lower is better).

| Model           | Error  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| XLNet (Yang et al., 2019) | 4.49 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| ULMFiT (Howard and Ruder, 2018) | 5.01 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) | [Official](http://nlp.fast.ai/ulmfit ) |
| CNN (Johnson and Zhang, 2016) * | 6.57 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) | [Official](https://github.com/riejohnson/ConText ) |
| DPCNN (Johnson and Zhang, 2017) | 6.87 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) | [Official](https://github.com/riejohnson/ConText ) |
| VDCN (Alexis et al., 2016) | 8.67 | [Very Deep Convolutional Networks for Text Classification](https://arxiv.org/abs/1606.01781) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/vdcnn) |
| Char-level CNN (Zhang et al., 2015) | 9.51 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/cnn) |

\* Results reported in Johnson and Zhang, 2017

### DBpedia

The [DBpedia ontology](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) 
dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
Models are evaluated based on error rate (lower is better).

| Model           | Error  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| XLNet (Yang et al., 2019) | 0.62 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) |
| Bidirectional Encoder Representations from Transformers (Devlin et al., 2018) | 0.64 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | [Official](https://github.com/google-research/bert) |
| ULMFiT (Howard and Ruder, 2018) | 0.80 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)  | [Official](http://nlp.fast.ai/ulmfit ) |
| CNN (Johnson and Zhang, 2016) | 0.84 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) | [Official](https://github.com/riejohnson/ConText ) |
| DPCNN (Johnson and Zhang, 2017) | 0.88 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) | [Official](https://github.com/riejohnson/ConText ) |
| VDCN (Alexis et al., 2016) | 1.29 | [Very Deep Convolutional Networks for Text Classification](https://arxiv.org/abs/1606.01781) |  [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/vdcnn) |
| Char-level CNN (Zhang et al., 2015) | 1.55 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/cnn) |

### TREC

The [TREC dataset](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.2766&rep=rep1&type=pdf) is dataset for
question classification consisting of open-domain, fact-based questions divided into broad semantic categories. 
It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, 
but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.

TREC-6:

| Model           | Error  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| USE_T+CNN (Cer et al., 2018) | 1.93 | [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) | [Official](https://tfhub.dev/google/universal-sentence-encoder/1) |
| ULMFiT (Howard and Ruder, 2018) | 3.6 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) | [Official](http://nlp.fast.ai/ulmfit ) |
| LSTM-CNN (Zhou et al., 2016) | 3.9 | [Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling](http://www.aclweb.org/anthology/C16-1329) |
| CNN+MCFA (Amplayo et al., 2018) | 4 | [Translations as Additional Contexts for Sentence Classification](https://arxiv.org/pdf/1806.05516.pdf) |
| TBCNN (Mou et al., 2015) | 4 | [Discriminative Neural Sentence Modeling by Tree-Based Convolution](http://aclweb.org/anthology/D15-1279) |
| CoVe (McCann et al., 2017) | 4.2 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) |

TREC-50:

| Model           | Error  |  Paper / Source | Code |
| ------------- | :-----:| --- | :-----: |
| Rules (Madabushi and Lee, 2016) | 2.8 |[High Accuracy Rule-based Question Classification using Question Syntax and Semantics](http://www.aclweb.org/anthology/C16-1116)| |
| SVM (Van-Tu and Anh-Cuong, 2016) | 8.4 | [Improving Question Classification by Feature Extraction and Selection](https://www.researchgate.net/publication/303553351_Improving_Question_Classification_by_Feature_Extraction_and_Selection) | |

[Go back to the README](../README.md)


================================================
FILE: english/word_sense_disambiguation.md
================================================
# Word Sense Disambiguation

The task of Word Sense Disambiguation (WSD) consists of associating words in context with their most suitable entry in a pre-defined sense inventory. The de-facto sense inventory for English in WSD is [WordNet](https://wordnet.princeton.edu).
For example, given the word “mouse” and the following sentence:

“A mouse consists of an object held in one's hand, with one or more buttons.” 

we would assign “mouse”  with its electronic device sense ([the 4th sense in the WordNet sense inventory](http://wordnetweb.princeton.edu/perl/webwn?c=8&sub=Change&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&i=-1&h=000000&s=mouse)).


### Fine-grained WSD:

The [Evaluation framework](http://lcl.uniroma1.it/wsdeval/) of [Raganato et al. 2017](http://aclweb.org/anthology/E/E17/E17-1010.pdf) [1] includes two training sets (SemCor-Miller et al., 1993- and OMSTI-Taghipour and Ng, 2015-) and five test sets from the Senseval/SemEval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Pradhan et al., 2007; Navigli et al., 2013; Moro and Navigli, 2015), standardized to the same format and sense inventory (i.e. WordNet 3.0).

Typically, there are two kinds of approach for WSD: supervised (which make use of sense-annotated training data) and knowledge-based (which make use of the properties of lexical resources).

Supervised: The most widely used training corpus used is SemCor, with 226,036 sense annotations from 352 documents manually annotated. All supervised systems in the evaluation table are trained on SemCor. Some supervised methods, particularly neural architectures, usually employ the SemEval 2007 dataset as development set (marked by *). The most usual baseline is the Most Frequent Sense (MFS) heuristic, which selects for each target word the most frequent sense in the training data.

Knowledge-based:  Knowledge-based systems usually exploit WordNet or [BabelNet](https://babelnet.org/) as semantic network. The first sense given by the underlying sense inventory (i.e. WordNet 3.0) is included as a baseline.

The main evaluation measure is F1-score.


### Supervised:

| Model           | Senseval 2  |Senseval 3  |SemEval 2007 |SemEval 2013 |SemEval 2015 | Paper / Source |
| ------------- | :-----:|:-----:|:-----:|:-----:|:-----:| --- |
|MFS baseline | 65.6 | 66.0 | 54.5 | 63.8 | 67.1 |  [[1]](http://aclweb.org/anthology/E/E17/E17-1010.pdf) |
|Bi-LSTM<sub>att+LEX</sub> |  72.0 | 69.4 |63.7* | 66.4 | 72.4 | [[2]](http://aclweb.org/anthology/D17-1120) |
|Bi-LSTM<sub>att+LEX+POS</sub> |   72.0 | 69.1|64.8* | 66.9 | 71.5 | [[2]](http://aclweb.org/anthology/D17-1120) |
|context2vec | 71.8 | 69.1 |61.3  | 65.6 | 71.9 | [[3]](http://www.aclweb.org/anthology/K16-1006) | 
|ELMo | 71.6 | 69.6 | 62.2 | 66.2 | 71.3 | [[4]](http://aclweb.org/anthology/N18-1202) |
|GAS (Linear) | 72.0  | 70.0 | --* | 66.7 | 71.6 | [[5]](http://aclweb.org/anthology/P18-1230) |
|GAS (Concatenation) | 72.1 | 70.2 | --* | 67 | 71.8 |  [[5]](http://aclweb.org/anthology/P18-1230)  |
|GAS<sub>ext</sub> (Linear) | 72.4 | 70.1 | --* | 67.1 | 72.1 |[[5]](http://aclweb.org/anthology/P18-1230)  |
|GAS<sub>ext</sub> (Concatenation) | 72.2 | 70.5 | --* | 67.2 | 72.6 | [[5]](http://aclweb.org/anthology/P18-1230)  |
|supWSD | 71.3 | 68.8 | 60.2 | 65.8 | 70.0 | [[6]](https://aclanthology.info/pdf/P/P10/P10-4014.pdf) [[11]](http://aclweb.org/anthology/D17-2018) |
|supWSD<sub>emb</sub> | 72.7 | 70.6 | 63.1 | 66.8 | 71.8 | [[7]](http://www.aclweb.org/anthology/P16-1085) [[11]](http://aclweb.org/anthology/D17-2018) |
|BERT (nearest neighbor) | 73.8 | 71.6 | 63.3 | 69.2 | 74.4 | [[13]](https://www.aclweb.org/anthology/D19-1533.pdf) [[code]](https://github.com/nusnlp/contextemb-wsd) |
|BERT (linear projection) | 75.5 | 73.6 | 68.1 | 71.1 | 76.2 | [[13]](https://www.aclweb.org/anthology/D19-1533.pdf) [[code]](https://github.com/nusnlp/contextemb-wsd) |
|GlossBERT | 77.7 | 75.2 | 72.5 | 76.1 | 80.4 | [[14]](https://arxiv.org/pdf/1908.07245.pdf) |
|SemCor+WNGC, hypernyms | 79.7 | 77.8 | 73.4 | 78.7 | 82.6 | [[15]](https://arxiv.org/abs/1905.05677) |
|BEM    | 79.4 | 77.4 | 74.5 | 79.7 | 81.7 | [[17]](https://www.aclweb.org/anthology/2020.acl-main.95/)[[code]](https://github.com/facebookresearch/wsd-biencoders) |
|EWISER | 78.9 | 78.4 | 71.0 | 78.9 | 79.3 | [[18]](https://www.aclweb.org/anthology/2020.acl-main.255/)[[code]](https://github.com/SapienzaNLP/ewiser) |
|EWISER+WNGC | 80.8 | 79.0 | 75.2 | 80.7 | 81.8 | [[18]](https://www.aclweb.org/anthology/2020.acl-main.255/)[[code]](https://github.com/SapienzaNLP/ewiser) |
|SparseLMMS | 77.9 | 77.8 | 68.8 | 76.1 | 77.5 | [[19]](https://www.aclweb.org/anthology/2020.emnlp-main.683/)[[code]](https://github.com/begab/sparsity_makes_sense) |
|SparseLMMS+WNGC | 79.6 | 77.3 | 73.0 | 79.4 | 81.3 | [[19]](https://www.aclweb.org/anthology/2020.emnlp-main.683/)[[code]](https://github.com/begab/sparsity_makes_sense) |
|ARES | 78.0 | 77.1 | 71.0 | 77.3 | 83.2 | [[20]](https://www.aclweb.org/anthology/2020.emnlp-main.285/)|
|ESCHER | 81.7 | 77.8 | 76.3 | 82.2 | 83.2 | [[21]](https://aclanthology.org/2021.naacl-main.371/)|
|ESR | 81.3 | 79.9 | 77.0 | 81.5 | 84.1 | [[22]](https://aclanthology.org/2021.findings-emnlp.365/)[[code]](https://github.com/nusnlp/esr) |
|ESR+WNGC | 82.5 | 80.2 | 78.5 | 82.3 | 85.3 | [[22]](https://aclanthology.org/2021.findings-emnlp.365/)[[code]](https://github.com/nusnlp/esr) |
|ConSeC | 82.3 | 79.9 | 77.4 | 83.2 | 85.2 | [[23]](https://aclanthology.org/2021.emnlp-main.112.pdf)[[code]](https://github.com/SapienzaNLP/consec) |
|ConSeC+WNGC | 82.7 | 81.0 | 78.5 | 85.2 | 87.5 | [[23]](https://aclanthology.org/2021.emnlp-main.112.pdf)[[code]](https://github.com/SapienzaNLP/consec) |

### Knowledge-based:

| Model           | All | Senseval 2 |Senseval 3 |SemEval 2007 |SemEval 2013 |SemEval 2015 |  Paper / Source |
| ------------- | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | --- |
|WN 1st sense baseline | 65.2 | 66.8 | 66.2 | 55.2 | 63.0 | 67.8 | [[1]](http://aclweb.org/anthology/E/E17/E17-1010.pdf) |
|Babelfy | 65.5 | 67.0 | 63.5 | 51.6 | 66.4 | 70.3 | [[8]](http://aclweb.org/anthology/Q14-1019) |
|UKB<sub>ppr_w2w-nf</sub> | 57.5 | 64.2 | 54.8 | 40.0 | 64.5 | 64.5 | [[9]](https://direct.mit.edu/coli/article/40/1/57/1454/Random-Walks-for-Knowledge-Based-Word-Sense) [[12]](http://aclweb.org/anthology/W18-2505) |
|UKB<sub>ppr_w2w</sub> | 67.3 | 68.8 | 66.1 | 53.0 | **68.8** | 70.3 | [[9]](https://direct.mit.edu/coli/article/40/1/57/1454/Random-Walks-for-Knowledge-Based-Word-Sense) [[12]](http://aclweb.org/anthology/W18-2505) |
|WSD-TM | 66.9 | 69.0 | **66.9** | 55.6 | 65.3 | 69.6 | [[10]](https://arxiv.org/pdf/1801.01900.pdf) |
|KEF | **68.0** | **69.6** | 66.1 | **56.9** | 68.4 | **72.3** | [[16]](https://doi.org/10.1016/j.knosys.2019.105030) [[code]](https://github.com/lwmlyy/Knowledge-based-WSD)|

Note: 'All' is the concatenation of all datasets, as described in [10] and [12]. The scores of [6,7] and [9] are not taken from the original papers but from the results of the implementations of [11] and [12], respectively.

[1] [Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison](http://aclweb.org/anthology/E/E17/E17-1010.pdf)

[2] [Neural Sequence Learning Models for Word Sense Disambiguation](http://aclweb.org/anthology/D17-1120)

[3] [context2vec: Learning generic context embedding with bidirectional lstm](http://www.aclweb.org/anthology/K16-1006)

[4] [Deep contextualized word representations](http://aclweb.org/anthology/N18-1202)

[5] [Incorporating Glosses into Neural Word Sense Disambiguation](http://aclweb.org/anthology/P18-1230)

[6] [It makes sense: A wide-coverage word sense disambiguation system for free text](https://aclanthology.info/pdf/P/P10/P10-4014.pdf)

[7] [Embeddings for Word Sense Disambiguation: An Evaluation Study](http://www.aclweb.org/anthology/P16-1085)

[8] [Entity Linking meets Word Sense Disambiguation: A Unified Approach](http://aclweb.org/anthology/Q14-1019)

[9] [Random walks for knowledge-based word sense disambiguation](https://direct.mit.edu/coli/article/40/1/57/1454/Random-Walks-for-Knowledge-Based-Word-Sense)

[10] [Knowledge-based Word Sense Disambiguation using Topic Models](https://arxiv.org/pdf/1801.01900.pdf)

[11] [SupWSD: A Flexible Toolkit for Supervised Word Sense Disambiguation](http://aclweb.org/anthology/D17-2018)

[12] [The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD](http://aclweb.org/anthology/W18-2505)

[13] [Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations](https://www.aclweb.org/anthology/D19-1533.pdf)

[14] [GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge](https://arxiv.org/pdf/1908.07245.pdf)

[15] [Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation](https://arxiv.org/abs/1905.05677)

[16] [Word Sense Disambiguation: A Comprehensive Knowledge Exploitation Framework](https://doi.org/10.1016/j.knosys.2019.105030)

[17] [Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders](https://www.aclweb.org/anthology/2020.acl-main.95/)

[18] [Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information](https://www.aclweb.org/anthology/2020.acl-main.255/)

[19] [Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations](https://www.aclweb.org/anthology/2020.emnlp-main.683/)

[20] [With More Contexts Comes Better Performance: Contextualized Sense Embeddings for All-Round Word Sense Disambiguation](https://www.aclweb.org/anthology/2020.emnlp-main.285/)

[21] [ESC: Redesigning WSD with Extractive Sense Comprehension](https://aclanthology.org/2021.naacl-main.371/)

[22] [Improved Word Sense Disambiguation with Enhanced Sense Representations](https://aclanthology.org/2021.findings-emnlp.365/)

[23] [ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension](https://aclanthology.org/2021.emnlp-main.112.pdf)

## WSD Lexical Sample task:

Above task is called All-words WSD because the systems attempt to disambiguate all of the words in a document, while there is another task which is called 
Lexical Sample task. In this task a number of words are selected and the system should only disambiguate the occurrences of these words in a test set. 
Iaccobacci et, al. (2016) provide the state-of-the-art results until 2016 [1]. Main tasks include Senseval 2, Senseval 3  and SemEval 2007. Evaluation metrics are as same as All words task. 

### Lexical Sample results:

| Model           | Senseval 2  |Senseval 3  |SemEval 2007 | Paper / Source |
| ------------- | :-----: | :-----: | :-----: | --- |
|IMSE + heuristics | 71.4 | 76.2  | - | [[Preprint]](http://cv.znu.ac.ir/afsharchim/pub/JofIFS2019-2.pdf) [[2]](https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs182868) |
|IMS + Word2vec | 69.9 | 75.2  | 89.4 | [[1]](http://www.aclweb.org/anthology/P16-1085) |
|AutoExtend | 66.5 | 73.6 | − | [[3]](https://arxiv.org/abs/1507.01127) [[4]](https://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00294)|
|Taghipour and Ng | 66.2 | 73.4 | − | [[4]](https://www.aclweb.org/anthology/N15-1035.pdf) |
|IMS | 65.3 | 72.9 | 87.9 | [[6]](https://www.aclweb.org/anthology/P10-4014.pdf) |

## Word Sense Induction

Word sense induction (WSI) is widely known as the "unsupervised version" of WSD. The problem states as: Given a target word (e.g., "cold") and a collection of sentences (e.g., "I caught a cold", "The weather is cold") that use the word, cluster the sentences according to their different senses/meanings. We do not need to know the sense/meaning of each cluster, but sentences inside a cluster should have used the target words with the same sense.

There are two widely used datasets: SemEval 2010 and 2013, and both of them use different kinds of metrices: V-Measure (V-M) and paired F-Score (F-S) for SemEval 2010, and fuzzy B-Cubed (F-BC) and fuzzy normalized mutual information (F-NMI). For ease of system comparisons, the metrics are usually aggregated using a geometric mean (AVG).

### SemEval 2010

| Model         | F-S    | V-M   | AVG   | Paper/source | Code |
| ------------- | :-----:|:-----:|:-----:| ------------ | ---- |
| BERT+DP (Amrami and Goldberg, 2019) | 71.3 | 40.4 | 53.6 | [Towards better substitution-based word sense induction](https://arxiv.org/pdf/1905.12598.pdf) | [Code](https://github.com/asafamr/bertwsi) |
| AutoSense (Amplayo et al., 2019) | 61.7 | 9.8 | 24.59 | [AutoSense Model for Word Sense Induction](https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4580/4458) | [Code](https://github.com/rktamplayo/AutoSense) |
| SE-WSI-fix (Song et al., 2016) | 55.1 | 9.8 | 23.24 | [Sense Embedding Learning for Word Sense Induction](https://aclweb.org/anthology/S16-2009/) |  |
| BNP-HC (Chang et al., 2014) | 23.1 | 21.4 | 22.23 | [Inducing Word Sense with Automatically Learned Hidden Concepts](https://www.aclweb.org/anthology/C14-1035/) |  |
| LDA (Goyal and Hovy, 2014) | 60.7 | 4.4 | 16.34 | [Unsupervised Word Sense Induction using Distributional Statistics](https://www.aclweb.org/anthology/C14-1123/) |  |

### SemEval 2013

| Model         | F-BC    | F_NMI   | AVG   | Paper/source | Code |
| ------------- | :-----:|:-----:|:-----:| ------------ | ---- |
| BERT+DP (Amrami and Goldberg, 2019) | 64.0 | 21.4 | 37.0 | [Towards better substitution-based word sense induction](https://arxiv.org/pdf/1905.12598.pdf) | [Code](https://github.com/asafamr/bertwsi) |
| LSDP (Amrami and Goldberg, 2018) | 57.5 | 11.3 | 25.4 | [Word Sense Induction with Neural biLM and Symmetric Patterns](https://www.aclweb.org/anthology/D18-1523/) | [Code](https://github.com/asafamr/SymPatternWSI) |
| AutoSense (Amplayo et al., 2019) | 61.7 | 7.96 | 22.16 | [AutoSense Model for Word Sense Induction](https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4580/4458) | [Code](https://github.com/rktamplayo/AutoSense) |
| MCC-S (Komninos and Manandhar, 2016) | 55.6 | 7.62 | 20.58 | [Structured Generative Models of Continuous Features for Word Sense Induction](https://www.aclweb.org/anthology/C16-1337/) |  |
| STM+w2v (Wang et al., 2016) | 55.4 | 7.14 | 19.89 | [A Sense-Topic Model for Word Sense Induction with Unsupervised Data Enrichment](https://www.aclweb.org/anthology/Q15-1005/) |  |
| AI-KU (Baskaya et al., 2013) | 39.0 | 6.5 | 15.92 | [AI-KU: Using Substitute Vectors and Co-Occurrence Modeling For Word Sense Induction and Disambiguation](https://www.aclweb.org/anthology/S13-2050/) |  |


================================================
FILE: french/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [FQuAD](#fquad)
  
## Reading comprehension
  
### FQuAD

The [French Question Answering dataset (FQuAD)](https://arxiv.org/abs/2002.06071) is a 
reading comprehension dataset in the style of SQuAD. It consists of 25k questions on 
Wikipedia articles. The dataset is available [here](https://fquad.illuin.tech/).

Example:

| Document  | Question | Answer |
| ------------- | -----:| -----: |
| Des observations de 2015 par la sonde Dawn ont confirmé qu'elle possède une forme sphérique, à la différence des corps plus petits qui ont une forme irrégulière. [...] |A quand remonte les observations faites par la sonde Dawn ? | 2015 |

| Model           | F1 | EM |  Paper |
| ------------- | :-----:| :-----:| --- |
| Human performance | 92.1 | 78.4 | [FQuAD: French Question Answering Dataset](https://arxiv.org/abs/2002.06071) |
| CamemBERTQA (d'Hoffschmidt et al., 2020)* | 88.0 | 77.9 | [FQuAD: French Question Answering Dataset](https://arxiv.org/abs/2002.06071) |
| CamemBERTQA (d'Hoffschmidt et al., 2020)† | 84.1 | 70.9 | [FQuAD: French Question Answering Dataset](https://arxiv.org/abs/2002.06071) |

*: trained on the FQuAD training set 

†: trained on the SQuAD training set and zero-shot transferred to the FQuAD test set.

================================================
FILE: french/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as MLSUM provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### MLSUM

We present [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/), the first large-scale MultiLingual SUMmarization dataset. 
Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, [French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum). Together with [English](../english/summarization.md#cnn--daily-mail) newspapers from the popular CNN / Daily Mail dataset, 
the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. 
We report cross-lingual comparative analyses based on state-of-the-art systems. 
These highlight existing biases which motivate the use of a multi-lingual dataset.

Below results are ranked by chronological order.

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| Lead_3 | 28.74 | 9.84 | 19.7 | 12.6 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Pointer-Generator | 31.08 | 10.12 | 23.6 | 14.1 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| M-BERT (Scialom et al., 2020) | 31.59 | 10.61 | 25.1 | 15.1 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Oracle | 47.32 | 25.95 | 37.7 | 24.7 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| MARGE-NEWS (Train All) (Lewis et al., 2020) | - | - | 25.79 | - | [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) | [Official](https://github.com/lucidrains/marge-pytorch) |

### OrangeSum

The OrangeSum dataset was introduced in ["BARThez: a Skilled Pretrained French Sequence-to-Sequence Model"](https://aclanthology.org/2021.emnlp-main.740/). It was created by scraping the "Orange Actu" website: https://actu.orange.fr/. Orange S.A. is a large French multinational telecommunications corporation, with 266M customers worldwide. Scraped pages cover almost a decade from Feb 2011 to Sep 2020. They belong to five main categories: France, world, politics, automotive, and society. The society category is itself divided into 8 subcategories: health, environment, people, culture, media, high-tech, unsual ("insolite" in French), and miscellaneous.

Each article featured a single-sentence title as well as a very brief abstract, both professionally written by the author of the article. These two fields were extracted from each page, thus creating two summarization tasks: OrangeSum Title and OrangeSum Abstract.

The dataset can be found [here](https://huggingface.co/datasets/orange_sum).
   

#### OrangeSum-abstract

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| BARThez | 31.44 | 12.77 | 22.23 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| mBARThez | 32.67 | 13.73 | 23.18 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| CamemBERT2CamemBERT (Martin, Louis, et al.) | 29.23 | 09.79 | 19.95 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| mBART (Liu, Yinhan, et al.) | 31.85 | 13.10 | 22.35 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| PAGnol-S | 24.54 | 8.98 | 18.45 | - | [PAGnol](https://arxiv.org/pdf/2110.08554.pdf) | [Official](https://github.com/lightonai/lairgpt) |
| PAGnol-M | 27.80 | 10.56 | 20.29 | - | [PAGnol](https://arxiv.org/pdf/2110.08554.pdf) | [Official](https://github.com/lightonai/lairgpt) |
| PAGnol-L | 28.25 | 11.05 | 21.03 | - | [PAGnol](https://arxiv.org/pdf/2110.08554.pdf) | [Official](https://github.com/lightonai/lairgpt) |
| PAGnol-XL | 28.72 | 11.08 | 20.89 | - | [PAGnol](https://arxiv.org/pdf/2110.08554.pdf) | [Official](https://github.com/lightonai/lairgpt) |

#### OrangeSum-title

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| BARThez | 40.86 | 23.68 | 36.03 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| mBARThez | 41.08 | 24.11 | 36.41 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| CamemBERT2CamemBERT (Martin, Louis, et al.) | 34.92 | 18.04 | 30.83 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |
| mBART (Liu, Yinhan, et al.) | 40.74 | 23.70 | 36.04 | - | [BARThez](https://aclanthology.org/2021.emnlp-main.740/) | [Official](https://github.com/moussaKam/BARThez) |


================================================
FILE: german/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.


### Table of contents

- [GermanQuAD](#germanquad)

### GermanQuAD

[GermanQuAD](https://arxiv.org/abs/2104.12741) contains ~14,000 German QA pairs released by deepset in the same format as the original SQuAD dataset. In the same paper, a second dataset called GermanDPR was released, which can be used for training dense passage retrieval models. For each question, it provides a set of relevant and irrelevant documents.
More details are available on the [GermanQuAD website](https://www.deepset.ai/germanquad).


================================================
FILE: german/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as MLSUM provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### MLSUM

We present [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/), the first large-scale MultiLingual SUMmarization dataset. 
Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, [French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum). Together with [English](../english/summarization.md#cnn--daily-mail) newspapers from the popular CNN / Daily Mail dataset, 
the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. 
We report cross-lingual comparative analyses based on state-of-the-art systems. 
These highlight existing biases which motivate the use of a multi-lingual dataset.

Below results are ranked by chronological order.

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| Lead_3 | 38.57 | 25.66 | 33.1 | 23.9 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Pointer-Generator | 39.8 | 25.96 | 35.1 | 24.4 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| M-BERT (Scialom et al., 2020) | 44.78 | 30.75 | 42 | 26.5 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Oracle | 57.23 | 39.72 | 52.3 | 31.7 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| MARGE-NEWS (Train All) (Lewis et al., 2020) | - | - | 42.77 | - | [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) | [Official](https://github.com/lucidrains/marge-pytorch) |


================================================
FILE: hindi/hindi.md
================================================
# Hindi

## Chunking

| Model           | Dev accuracy  | Test F1 | Paper / Source | Code | 
| ------------- | :-----:| :-----:| --- | --- | 
| Dalal et al. (2006) | 87.40 | 82.40 | [Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach](https://www.researchgate.net/publication/241211496_Hindi_Part-of-Speech_Tagging_and_Chunking_A_Maximum_Entropy_Approach) | | 

## Part-of-speech tagging

| Model           | Dev accuracy  | Test F1 | Paper / Source | Code | 
| ------------- | :-----:| :-----:| --- | --- | 
| Jha et al. (2018) | 99.30 | 99.06 | [Multi-Task Deep Morphological Analyzer: Context-Aware Joint Morphological Tagging and Lemma Prediction](https://arxiv.org/ftp/arxiv/papers/1811/1811.08619.pdf) | [mt-dma](https://github.com/Saurav0074/mt-dma)
| Dalal et al. (2006) | 89.35 | 82.22 | [Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach](https://www.researchgate.net/publication/241211496_Hindi_Part-of-Speech_Tagging_and_Chunking_A_Maximum_Entropy_Approach) | | 

## Machine Translation

The IIT Bombay English-Hindi Parallel Corpus used by Kunchukuttan et al. (2018) can be accessed [here](http://www.cfilt.iitb.ac.in/iitb_parallel/). A live leaderboard involving more directions involving Hindi can be accessed at the evaluation website for the [Workshop on Asian Translation](http://lotus.kuee.kyoto-u.ac.jp/WAT/).

### Hindi -> English 

* [WAT:HINDENhi-en](http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=14&o=4)

| Model           | BLEU | Paper / Source | Code | 
| ------------- | :-----:| --- | --- | 
| Philip et al. (2020) | 24.85 | Revisiting Low Resource Status of Indian Languages in MT | [ilmulti](https://github.com/jerinphilip/ilmulti) | 
| Siripragada et al. (2020) | 22.91 | [A Multilingual Parallel Corpora Collection Effort for Indian Languages](https://www.aclweb.org/anthology/2020.lrec-1.462/) | [ilmulti](https://github.com/jerinphilip/ilmulti) | 
| Goyal et al. (2019) | 19.06 | [LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019](https://www.aclweb.org/anthology/D19-5216.pdf) 

### English -> Hindi 

* [WAT:HINDENen-hi](http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=13&o=7)


| Model           | BLEU | Paper / Source | Code | 
| ------------- | :-----:| --- | --- | 
| Philip et al. (2018)  | 21.57 | [CVIT-MT Systems for WAT-2018](https://www.aclweb.org/anthology/Y18-3010/) || 
| Philip et al. (2020) | 21.20 | Revisiting Low Resource Status of Indian Languages in MT | [ilmulti](https://github.com/jerinphilip/ilmulti) | 
| Saini et al. (2018) | 18.215| [Neural Machine Translation for English to Hindi](https://www.researchgate.net/publication/327717152_Neural_Machine_Translation_for_English_to_Hindi) | | 

## G2P Conversion

### Schwa Deletion

Due to diachronic processes the inherent vowel of Hindi (the *schwa*, automatically applied to consonants that have no other vowel diacritic or vowel-killer diacritic attached) is sometimes dropped in pronunciation despite being present in the orthography. This process is known as schwa deletion. There are no known linguistic rules that can consistently and accurately predict what happens to the inherent vowel in speech. Thus, this is an open problem in the field.

Each paper below has used different datasets. The dataset for Arora et al. (2020) is the largest of all, extracted from the Oxford Hindi-English Dictionary, and future work should ideally compare against that dataset.

| Model | Schwa-level accuracy | Word-level accuracy | Paper / Source | Code |
| ----- | :------------------: | :-----------------: | -------------- | ---- |
| Arora et al. (2020) | 98.00 | 97.78 | [Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi](https://www.aclweb.org/anthology/2020.acl-main.696.pdf) | [schwa-deletion](https://github.com/aryamanarora/schwa-deletion) |
| Tyson and Nagar (2009) | | 95.00 | [Prosodic rules for schwa-deletion in hindi text-to-speech synthesis](http://www.academia.edu/download/38321628/tyson_nagar_2009.pdf) | |
| Narasimhan et al. (2004) | | 88.97 | [Schwa-Deletion in Hindi Text-to-Speech Synthesis](https://pure.mpg.de/rest/items/item_59025/component/file_59026/content) | | 
| Choudhury et al. (2004) | | 99.89 | [A Diachronic Approach for Schwa Deletion in Indo Aryan Languages](https://www.aclweb.org/anthology/W04-0103.pdf) | |


================================================
FILE: jekyll_instructions.md
================================================
# Instructions for building the site locally

You can build the site locally using Jekyll by following the steps detailed
[here](https://help.github.com/articles/setting-up-your-github-pages-site-locally-with-jekyll/#requirements):

1. Check whether you have Ruby 2.1.0 or higher installed with `ruby --version`, otherwise [install it](https://www.ruby-lang.org/en/downloads/).
On OS X for instance, this can be done with `brew install ruby`. Make sure you also have `ruby-dev` and `zlib1g-dev` installed.
1. Install Bundler `gem install bundler`. If you run into issues with installing bundler on OS X, have a look
[here](https://bundler.io/v1.16/guides/rubygems_tls_ssl_troubleshooting_guide.html) for troubleshooting tips. Also try refreshing
the terminal.
1. Clone the repo locally: `git clone https://github.com/sebastianruder/NLP-progress`
1. Navigate to the repo with `cd NLP-progress`
1. Install Jekyll: `bundle install`
1. Run the Jekyll site locally: `bundle exec jekyll serve`
1. You can now preview the local Jekyll site in your browser at `http://localhost:4000`.


================================================
FILE: korean/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [KorQuAD](#korquad)
  
## Reading comprehension
  
### KorQuAD

The [Korean Question Answering Dataset (KorQuAD)](https://arxiv.org/abs/1909.07005) is a large-scale reading comprehension
dataset in the style of SQuAD that consists of 70,000+ human-generated question answer pairs on Wikipedia articles. The
data and public leaderboard are available [here](https://korquad.github.io/).


================================================
FILE: nepali/nepali.md
================================================
# Nepali

## Machine Translation

| Model           | BLEU  | Paper / Source | Code | 
| ------------- | :-----:|  --- | --- | 
| Guzman et al. (2019) | 21.5 (NE-EN) & 8.8 (EN-NE) | [The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English](https://www.aclweb.org/anthology/D19-1632/) | [Official](https://github.com/facebookresearch/flores/) | 


================================================
FILE: persian/named_entity_recognition.md
================================================
# Named entity recognition

Named entity recognition (NER) is the task of tagging entities in text with their corresponding type.
Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities.
O is used for non-entity tokens.

Example:

| Mark | Watney | visited | Mars |
| --- | ---| --- | --- |
| B-PER | I-PER | O | B-LOC |

### ArmanPersoNERCorpus

The [ArmanPersoNERCorpus](https://www.aclweb.org/anthology/C16-1319/) dataset contains 7,682 sentences with 250,015 tokens tagged in IOB format in six different classes, Organization, Person, Location, Facility, Event, and Product.

Download Links: [ARMAN](https://github.com/HaniehP/PersianNER/blob/master/ArmanPersoNERCorpus.zip) 

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| ParsBERT (Farahani et al., 2020) | 99.84  | [ParsBERT: Transformer-based Model for Persian Language Understanding](https://arxiv.org/abs/2005.12515) | [Official](https://github.com/hooshvare/parsbert) |
| LSTM-CRF (Hafezi, Rezaeian, 2018) | 86.55 | [Neural Architecture for Persian Named Entity Recognition](https://ieeexplore.ieee.org/abstract/document/8700549) | - |
| mBERT (Taher et al., 2020) | 84.03  | [Beheshti-NER: Persian Named Entity Recognition Using BERT](https://arxiv.org/abs/2003.08875) | [Official](https://github.com/sEhsanTaher/Beheshti-NER) |
| Deep-CRF (Bokaei, Mahmoudi, 2018) | 81.50 | [Improved Deep Persian Named Entity Recognition](https://ieeexplore.ieee.org/abstract/document/8661067) | - |
| Deep-Local (Bokaei, Mahmoudi, 2018) | 79.19 | [Improved Deep Persian Named Entity Recognition](https://ieeexplore.ieee.org/abstract/document/8661067) | - |
| BiLSTM-CRF (Poostchi et al., 2018) | 77.45 | [BiLSTM-CRF for Persian Named-Entity Recognition](https://www.aclweb.org/anthology/L18-1701/) | - |
| SVM-HMM (Poostchi et al., 2016) | 72.59 | [PersoNER: Persian Named-Entity Recognition](https://www.aclweb.org/anthology/C16-1319/) | - |

### PEYMA

The [PEYMA](https://arxiv.org/abs/1801.09936) dataset includes 7,145 sentences with 302,530 tokens from which 41,148 tokens are tagged in IOB format in with seven different classes, Organization, Percent, Money, Location, Date, Time, and Person.

Download Links: [PEYMA](http://en.itrc.ac.ir/sites/default/files/pictures/NER.rar) 

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| ParsBERT (Farahani et al., 2020) | 93.40  | [ParsBERT: Transformer-based Model for Persian Language Understanding](https://arxiv.org/abs/2005.12515) | [Official](https://github.com/hooshvare/parsbert) |
| mBERT (Taher et al., 2020) | 90.59  | [Beheshti-NER: Persian Named Entity Recognition Using BERT](https://arxiv.org/abs/2003.08875) | [Official](https://github.com/sEhsanTaher/Beheshti-NER) |
| Rule-Based-CRF (Shahshahani et al., 2018) | 84.00 | [PEYMA: A Tagged Corpus for Persian Named Entities](https://arxiv.org/abs/1801.09936) | - |


================================================
FILE: persian/natural_language_inference.md
================================================
# Natural Language Inference

Natural Language Inference (NLI) is the task of determining the inference relationship between a premise and a hypothesis. It is a three-class problem assigning each input pair to one of the classes {entailment, contradiction, neutral}. 

## Table of contents

- [FarsTail](#farstail)
  
### FarsTail

[FarsTail](https://arxiv.org/abs/2009.08820) is a Persian NLI dataset including an indexed version for non-Persian research.  
The dataset is available [here](https://github.com/dml-qom/FarsTail).

#### Example

| Premise | Label | Hypothesis |
| --- | ---| --- |
| منشور سازمان ملل متحد ۲۶ ژوئن ۱۹۴۵، در شهر سانفرانسیسکو، ایالات متحده امریکا به وسیله ۵۰ دولت از ۵۱ دولت مؤسس سازمان ملل متحد به امضا رسید. | منشور سازمان ملل متحد در سانفرانسیسکو به امضا رسید. | Entailment |
| منشور سازمان ملل متحد ۲۶ ژوئن ۱۹۴۵، در شهر سانفرانسیسکو، ایالات متحده امریکا به وسیله ۵۰ دولت از ۵۱ دولت مؤسس سازمان ملل متحد به امضا رسید. | منشور سازمان ملل متحد در نیویورک تاسیس شد. | Contradiction |
| منشور سازمان ملل متحد ۲۶ ژوئن ۱۹۴۵، در شهر سانفرانسیسکو، ایالات متحده امریکا به وسیله ۵۰ دولت از ۵۱ دولت مؤسس سازمان ملل متحد به امضا رسید. | ایران از جمله دولت‌های عضو مؤسس سازمان ملل متحد است. | Neutral |

#### Results

| Model           | Accuracy | Paper / Source |
| ------------- | :-----:| :-----:|
| Translate-Source + fastText| 78.1 | [FarsTail: A Persian Natural Language Inference Dataset](https://arxiv.org/abs/2009.08820) |
| LSTM + BERT (FarsTail) | 75.8 | [FarsTail: A Persian Natural Language Inference Dataset](https://arxiv.org/abs/2009.08820) |
| ESIM + BERT (FarsTail+MultiNLI) | 74.6 | [FarsTail: A Persian Natural Language Inference Dataset](https://arxiv.org/abs/2009.08820) |


================================================
FILE: persian/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on the lexical overlap, although an abstractive summary could express the same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as `pn_summary` provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.

There are a few resources for the abstractive/extractive tasks in Persian, while some are not available online, or there are no curators for them. While surfing the academic papers, you might see some of them like **Pasokh**. Of course, thanks to some researchers' efforts in this field, a dataset called Persian News Summarization (known as `pn_summary`) has been prepared for both Persian summarization tasks and made available online.


### Persian News Summary (known as pn_summary)

The [Persian News Summary (known as pn_summary)](https://github.com/hooshvare/pn-summary) is a well-structured summarization dataset for the Persian language that consists of 93,207 online news articles (from 200,000 crawled news) from 6 different news agencies in 18 different news categories from economy to tourism. Each document (article) includes the long original text as well as a human-generated summary. Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional).

#### Abstractive Models & Mixed Models

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- |
| BERT2BERT (ParsBERT) + mT5 (Farahani et al., 2020) |  44.01  |  25.07  |  37.76  |    -   | [Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization](https://arxiv.org/abs/2012.11204) | [Official](https://github.com/hooshvare/pn-summary) |


### Pasokh
[Pasokh](https://ieeexplore.ieee.org/document/6682873/) is a summarization dataset covering 6 news categories from 7 news agencies in two forms: **Single-Document (SD)** and **Multi-Document (MD)** with 100, 1000 records. Each document covers 5 samples for extractive and abstractive example.

#### Extractive Models & Mixed Models

| Model           | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- |
| Based on NER (SD) (Khademi, Fakhredanesh, 2020) | 47.20 | 33.40 | - | - | [Persian Automatic Text Summarization Based on Named Entity Recognition](https://link.springer.com/article/10.1007%2Fs40998-020-00352-2) | - |
| Based on NER (SD) (Khademi et al., 2020) | 45.40 | 30.10 | - | - | [Conceptual Text Summarizer: A new model in continuous vector space](http://iajit.org/index.php?option=com_content&task=view&id=1935&Itemid=488) | - |
| Feature Extraction (SD) (Rezaei et al., 2019) | 78.00 | 71.00 | 74.00 | - | [Features in Extractive Supervised Single-document Summarization: Case of Persian News](https://arxiv.org/abs/1909.02776) | [Official](https://github.com/Hrezaei/SummBot) |
| Multi-Feature Extraction (SD) (Kermani, Ghanbari, 2019) | 48.70 | 42.60 | - | - | [Extractive Persian Summarizer for News Websites](https://ieeexplore.ieee.org/document/8765279) | - |


================================================
FILE: portuguese/question_answering.md
================================================
# Question Answering

See [here](../english/question_answering.md) for more information about the task.

### Datasets

#### Chave (2008)

This collection contains more than 4000 questions in Portuguese provided by [Linguateca](https://www.linguateca.pt/CHAVE/), a resource center for computational processing of the Portuguese. Each question contain included a category and type as well as other information such as identification code and year of creation.


|  Model | Accuracy | Paper / Source | Code | 
| :-------------: | :-----:| :----: | :----: |
|  Priberam's (2008) | 63.5 | [Priberam’s Question Answering System in QA@CLEF 2008](https://link.springer.com/chapter/10.1007/978-3-642-04447-2_39) | |
|  Senso (2008) | 46.5 | [The Senso Question Answering System at QA@CLEF 2008](http://dspace.uevora.pt/rdpc/handle/10174/1562) | |
|  Esfinge (2008) | 23.5 | [Esfinge at CLEF 2008: Experimenting with answer retrieval patterns. Can they help?](https://comum.rcaap.pt/handle/10400.26/300) | |

[Go back to the README](../README.md)


================================================
FILE: russian/question_answering.md
================================================
# Question answering

Question answering is the task of answering a question.

### Table of contents

- [Reading comprehension](#reading-comprehension)
  - [SberQuAD](#sberquad)
  
  
## Reading comprehension
  
### SberQuAD

The [Sberbank Question Answering dataset (SberQuAD)](https://arxiv.org/abs/1912.09723) is a reading comprehension dataset
in the style of SQuAD, which was created as part of a competition in 2017 by Sberbank. The data consists of around 50k
questions on Wikipeda. 

Because the original SberQuAD development set is not available, the original training set of SberQuAD was partitioned
into a (new) training (45,328) and testing (5,036) sets by the DeepPavlov team.

| Model           | F1 | EM |  Paper |
| ------------- | :-----:| :-----:| --- |
| BERT (Efimov et al., 2019) | 84.8 | 66.6 | [SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis](https://arxiv.org/abs/1912.09723) |
| DocQA (Efimov et al., 2019) | 79.5 | 59.6 | [SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis](https://arxiv.org/abs/1912.09723) |


================================================
FILE: russian/sentiment-analysis.md
================================================
# Sentiment Analysis

Sentiment analysis is the task of classifying the polarity of a given text.

## RuSentRel

The [RuSentRel](https://github.com/nicolay-r/RuSentRel) dataset
consisted of analytical articles from Internet-portal inosmi.ru. These are translated into Russian texts in the domain of international politics obtained from foreign authoritative sources.
The collected articles contain both the author's opinion on the subject matter of the article and a large number of references mentioned between the participants of the described situations. 
The dataset contains 73 large analytical texts, labeled with about 2000 relations.
 
**Sentiment Attitude Extraction task:** Given a subset of documents, in which every document includes: (1) text, (2) a list of mentioned named entities. 
For each document it is required to complete a list of labeled entity pairs [(s<sub>1</sub>->o<sub>1</sub>, l<sub>1</sub>), ... (s<sub>n</sub>->o<sub>n</sub>, l<sub>n</sub>)], 
for which text conveys the presence of the sentiment relation expressed by the subject towards the object (s<sub>i</sub>->o<sub>i</sub>) with sentiment l<sub>i</sub> ∈ [neg, pos] (see the example below).

Task paper: https://arxiv.org/pdf/1808.08932.pdf

The public leaderboard is available at [github repository](https://github.com/nicolay-r/RuSentRel-Leaderboard)

| Example                                                                                                                                                                                                                                                     |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ... Meanwhile <ins>Moscow</ins> has repeatedly emphasized that its activity in the <ins>Baltic Sea</ins> is a response precisely to actions of **<ins>NATO</ins>** and the escalation of the hostile approach to **<ins>Russia</ins>** near its eastern borders ...
| (NATO->Russia, neg), (Russia->NATO, neg)                                                                                                                                                                                                                    |

## RuSentNE-2023

The dataset for RuSentNE-2023 evaluation is based on the Russian news corpus RuSentNE having rich sentiment-related annotation. The corpus
is annotated with named entities and sentiments towards these entities, along with related effects and emotional states. The dataset contains over 11000 sentences from 400+ large texts.

**Entity-Oriented Sentiment Analysis task:** Given a subset of sentences, in which every sentence includes one or several named entities. For each sentence all of the named entities should be classified into one of three sentiment classes: positive, negative or neutral within the context of a single sentence.

Task paper: https://arxiv.org/pdf/2305.17679.pdf

The public leaderboard is available at [github repository](https://github.com/dialogue-evaluation/RuSentNE-evaluation)

| Example                                                                                                                                                                                                                                                     |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Notorious figure in the country - <ins>Berlusconi</ins> has been repeatedly accused of financial fraud
| (Berlusconi-> neg)


================================================
FILE: russian/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as MLSUM provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### MLSUM

We present [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/), the first large-scale MultiLingual SUMmarization dataset. 
Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, [French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum). Together with [English](../english/summarization.md#cnn--daily-mail) newspapers from the popular CNN / Daily Mail dataset, 
the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. 
We report cross-lingual comparative analyses based on state-of-the-art systems. 
These highlight existing biases which motivate the use of a multi-lingual dataset.

Below results are ranked by chronological order.

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| Lead_3 | 9.29 | 1.54 | 5.9 | 5.8 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Pointer-Generator | 9.19 | 1.18 | 5.7 | 5.7 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| M-BERT (Scialom et al., 2020) | 10.94 | 1.75 | 9.5 | 6.8 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Oracle | 36.14 | 19.88 | 29.8 | 20.3 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| MARGE-NEWS (Train All) (Lewis et al., 2020) | - | - | 11.03 | - | [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) | [Official](https://github.com/lucidrains/marge-pytorch) |


================================================
FILE: spanish/entity_linking.md
================================================
# Entity Linking

See [here](../english/entity_linking.md) for more information about the task.

### Datasets

#### AIDA CoNLL-YAGO Dataset

##### Disambiguation-Only Models

|  Model | Micro-Precision | Paper / Source | Code | 
| :-------------: | :-----:| :----: | :----: |
| Sil et al. (2018) | 82.3 | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | |
| Tsai & Roth (2016) | 80.9 | [Cross-lingual wikification using multilingual embeddings](http://cogcomp.org/papers/TsaiRo16b.pdf) | |

[Go back to the README](../README.md)


================================================
FILE: spanish/named_entity_recognition.md
================================================
# Named entity recognition

Named entity recognition (NER) is the task of tagging entities in text with their corresponding type.
Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities.
O is used for non-entity tokens.

Example:

| Mark | Watney | visited | Mars |
| --- | ---| --- | --- |
| B-PER | I-PER | O | B-LOC |

(NER definition taken from english/named_entity_recognition.md)

### CANTEMIST 2020

The [CANTEMIST-NER 2020](https://temu.bsc.es/cantemist/) task consists of Spanish oncology clinical reports corpus tagged with one entity type (MORFOLOGIA_NEOPLASIA). 
Models are evaluated based on span-based F1 on the test set: see [evaluation scripts](https://github.com/TeMU-BSC/cantemist-evaluation-library).

The CANTEMIST shared task contains as well an entity linking subtrack (CANTEMIST-NORM) and a document indexing subtrack (CANTEMIST-CODING).

Data link: [Zenodo](https://doi.org/10.5281/zenodo.3773228)

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| MRC mBERT-MLP (Xiong et al., 2020) | 87.0 | [A Joint Model for Medical Named Entity Recognition and Normalization](http://ceur-ws.org/Vol-2664/cantemist_paper18.pdf) | [Official](https://github.com/xy-always/2020Iberlef) |
| BETO-SciBERT (Garcia-Pablos et al., 2020) | 86.9 | [Vicomtech at CANTEMIST 2020](http://ceur-ws.org/Vol-2664/cantemist_paper17.pdf) | |
| BiLSTM-CRF+GloVe+SME+CWE (López-Úbeda et al., 2020) | 85.5 | [Extracting Neoplasms Morphology Mentions in Spanish Clinical Cases through Word Embeddings](http://ceur-ws.org/Vol-2664/cantemist_paper1.pdf) | |
| Biaffine Classifier (Lange et al., 2020) | 85.3 | [NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction](http://ceur-ws.org/Vol-2664/cantemist_paper2.pdf) | |
| BETO (Han et al., 2020) | 85.0 | [Pre-trained Language Model for CANTEMIST Named Entity Recognition](http://ceur-ws.org/Vol-2664/cantemist_paper3.pdf) | |
| BiLSTM-CRF+FasText+Char (Carreto Fidalgo et al., 2020) | 84.5 | [Recognai’s Working Notes for CANTEMIST-NER Track](http://ceur-ws.org/Vol-2664/cantemist_paper4.pdf) | [Official](https://github.com/recognai/cantemist-ner) |
| BiLSTM-BiLSTM-CRF+FasText+PoS+Char (Santamaria Carrasco et al., 2020) | 83.4 | [Using Embeddings and Bi-LSTM+CRF Model to Detect Tumor Morphology Entities in Spanish Clinical Cases](http://ceur-ws.org/Vol-2664/cantemist_paper6.pdf) | [Official](https://github.com/ssantamaria94/CANTEMIST-Participation) |


### ProfNER 2021

The [ProfNER-NER 2021](https://temu.bsc.es/smm4h-spanish/) task consists of Spanish COVID-19 related Twitter corpus tagged with four entity types (PROFESION,SITUACION_LABORAL,ACTIVIDAD,FIGURATIVA). 
Models are evaluated based on span AND label-based F1 on the test set: see [Task 7 of Codalab SMM4H competition](https://competitions.codalab.org/competitions/28766).

The ProfNER shared task contains as well a tweet classification subtrack (ProfNER-Track A).

Data link: [Zenodo](https://doi.org/10.5281/zenodo.4309356)

| Model           | F1  |  Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| BETO-Linear-CRF (David Carreto Fidalgo et al., 2021) | 83.9 | [Recognai](https://www.aclweb.org/anthology/2021.smm4h-1.11.pdf) | [Official](https://github.com/recognai/profner) |
| 3xBiLSTM-CRF+BPE+FastText+BETOemb (Usama Yaseen et al., 2021) | 82.4 | [MIC-NLP](https://www.aclweb.org/anthology/2021.smm4h-1.14.pdf) | |
| BiLSTM-LSTM-CRF+Char+STE+SME+BETO+Syllabes+POS (Sergio Santamaría Carrasco et al., 2021) | 82.3 | [Troy](https://www.aclweb.org/anthology/2021.smm4h-1.12.pdf) | [Official](https://github.com/ssantamaria94/ProfNER-SMM4H) |
| BiGRU-BiLSTM-TokenClassification-CRF+STE+Char (David Carreto Fidalgo et al., 2021) | 76.4 | [Recognai](https://www.aclweb.org/anthology/2021.smm4h-1.11.pdf) | [Official](https://github.com/recognai/profner) | [Official](https://github.com/recognai/profner) |
| BiLSTM-CRF+Char+STE+SME+WikiFastText (Vasile Pais, et al., 2021) | 75.7 | [RACAI](https://www.aclweb.org/anthology/2021.smm4h-1.27.pdf) | |
| 30xBETO-BiLSTM (Tong Zhou et al., 2021) | 73.3 | [CASIA_Unisound](https://www.aclweb.org/anthology/2021.smm4h-1.13.pdf) | [Official](https://github.com/recognai/cantemist-ner) |
| Dictionaries-CRF (Alberto Mesa Murgado et al., 2021) | 72.8 | [SINAI](https://www.aclweb.org/anthology/2021.smm4h-1.31.pdf) | [Official](https://github.com/ssantamaria94/CANTEMIST-Participation) |
| BiLSTM-CRF+FLAIR+FastText (Pedro Ruas et al., 2021) | 72.7 | [Lasige-BioTM](https://www.aclweb.org/anthology/2021.smm4h-1.21.pdf) | [Official](https://github.com/lasigeBioTM/LASIGE-participation-in-ProfNER) |


[Go back to the README](../README.md)


================================================
FILE: spanish/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as MLSUM provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### MLSUM

We present [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/), the first large-scale MultiLingual SUMmarization dataset. 
Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, [French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum). Together with [English](../english/summarization.md#cnn--daily-mail) newspapers from the popular CNN / Daily Mail dataset, 
the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. 
We report cross-lingual comparative analyses based on state-of-the-art systems. 
These highlight existing biases which motivate the use of a multi-lingual dataset.

Below results are ranked by chronological order.

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| Lead_3 | 21.87 | 6.25 | 13.7 | 10.3 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Pointer-Generator | 24.63 | 6.54 | 17.7 | 13.2 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| M-BERT (Scialom et al., 2020) | 25.58 | 8.61 | 20.4 | 14.9 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Oracle | 45.23 | 26.21 | 35.8 | 26.5 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| MARGE-NEWS (Train All) (Lewis et al., 2020) | - | - | 22.72 | - | [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) | [Official](https://github.com/lucidrains/marge-pytorch) |


================================================
FILE: structured/README.md
================================================
# Exporting NLP-progress into a structure format

Parse and export the unstructured information from Markdown into a structured JSON format. 

## Installation

Requires Python 3.6+.

Create a virtualenv and install requirements (you can also use conda):

```shell
virtualenv -p python3 venv
source venv/bin/activate

pip install -r requirements.txt
```

## Running

From the NLP-progress root directly (where the LICENCE file is), run:

```shell
python structured/export.py <one or more directories or files>
```

For example, to export all the data in the `english/` directory:

```shell
python structured/export.py english
```

By default the output will be written into `structured.json`, but you can override this with the `--output` parameter. 


================================================
FILE: structured/export.py
================================================
import argparse
import os
import pprint
from typing import Dict, Tuple, List
import re
import sys
import json


def extract_dataset_desc_links(desc:List[str]) -> List:
    """
    Extract all the links from the description of datasets

    :param desc: Lines of the description of the dataset
    :return:
    """

    out = []
    md = "".join(desc)

    md_links = re.findall("\\[.*\\]\\(.*\\)", md)

    for md_link in md_links:
        title, link = extract_title_and_link(md_link)
        out.append({
            "title": title,
            "url": link,
        })

    return out


def sanitize_subdataset_name(name:str):
    """
    Do some sanitization on automatically extracted subdataset name

    :param name: raw subdataset name line
    :return:
    """

    name = name.replace("**", "")
    if name.endswith(":"):
        name = name[:-1]

    return name.strip()


def extract_lines_before_tables(lines:List[str]):
    """
    Extract the non-empty line before the table

    :param lines: a list of lines
    :return:
    """

    out = []

    before = None
    in_table = False
    for l in lines:
        if l.startswith("|") and not in_table:
            if before is not None:
                out.append(before)
            in_table = True
        elif in_table and not l.startswith("|"):
            in_table = False
            before = None
            if l.strip() != "":
                before = l.strip()
        elif l.strip() != "":
            before = l.strip()

    return out


def handle_multiple_sota_table_exceptions(section:List[str], sota_tables:List[List[str]]):
    """
    Manually handle the edge cases with dataset partitions

    These are not captured in a consistent format, so no unified approach is possible atm.

    :param section: The lines in this section
    :param sota_tables: The list of sota table lines
    :return:
    """

    section_full = "".join(section)
    out = []

    # Use the line before the table
    subdatasets = extract_lines_before_tables(section)
    subdatasets = [sanitize_subdataset_name(s) for s in subdatasets]

    # exceptions:
    if "hypernym discovery evaluation benchmark" in section_full:
        subdatasets = subdatasets[1:]

    if len(subdatasets) != len(sota_tables):
        print("ERROR parsing the subdataset SOTA tables", file=sys.stderr)
        print(sota_tables, file=sys.stderr)
    else:
        for i in range(len(subdatasets)):
            out.append({
                "subdataset": subdatasets[i],
                "sota": extract_sota_table(sota_tables[i])
            })

    return out


def extract_title_and_link(md_link:str) -> Tuple:
    """
    Extract the anchor text and URL from a markdown link

    :param md_link: a string of ONLY the markdown link, e.g. "[google](http://google.com)"
    :return: e.g. the tuple (google, http://google.com)
    """
    title = re.findall("^\\[(.*)\\]", md_link)[0].strip()
    link = re.findall("\\((.*)\\)$", md_link)[0].strip()

    return title, link


def extract_model_name_and_author(md_name:str) -> Tuple:
    """
    Extract the model name and author, if provided

    :param md_name: a string with the model name from the sota table
    :return: tuple (model_name, author_names)
    """

    if ' (' in md_name and ')' in md_name:
        model_name = md_name.split(' (')[0]
        model_authors = md_name.split(' (')[1].split(')')[0]
    elif '(' in md_name and ')' in md_name: # only has author name
        model_name = None
        model_authors = md_name
    else:
        model_name = md_name
        model_authors = None

    return model_name, model_authors


def extract_paper_title_and_link(paper_md:str) -> Tuple:
    """
    Extract the title and link to the paper

    :param paper_md: markdown for the paper link
    :return: tuple (paper_title, paper_link)
    """

    md_links = re.findall("\\[.*\\]\\(.*\\)", paper_md)

    if len(md_links) > 1:
        print("WARNING: Found multiple paper references: `%s`, using only the first..." % paper_md)
    if len(md_links) == 0:
        return None, None

    md_link = md_links[0]

    paper_title, paper_link = extract_title_and_link(md_link)
    return paper_title, paper_link


def extract_code_links(code_md:str) -> List[Dict]:
    """
    Extract the links to all code implementations

    :param code_md:
    :return:
    """

    md_links = re.findall("\\[.*\\]\\(.*\\)", code_md)

    links = []
    for md_link in md_links:
        t, l = extract_title_and_link(md_link)
        links.append({
            "title": t,
            "url": l,
        })

    return links


def extract_sota_table(table_lines:List[str]) -> Dict:
    """
    Parse a SOTA table out of lines in markdown

    :param table_lines: lines in the SOTA table
    :return:
    """

    sota = {}

    header = table_lines[0]
    header_cols = [h.strip() for h in header.split("|") if h.strip()]
    cols_sanitized = [h.lower() for h in header_cols]
    cols_sanitized = [re.sub(" +", "", h).replace("**","") for h in cols_sanitized]

    # find the model name column (usually the first one)
    if "model" in cols_sanitized:
        model_inx = cols_sanitized.index("model")
    else:
        print("ERROR: Model name not found in this SOTA table, skipping...\n", file=sys.stderr)
        print("".join(table_lines), file=sys.stderr)
        return {}

    if "paper/source" in cols_sanitized:
        paper_inx = cols_sanitized.index("paper/source")
    elif "paper" in cols_sanitized:
        paper_inx = cols_sanitized.index("paper")
    else:
        print("ERROR: Paper reference not found in this SOTA table, skipping...\n", file=sys.stderr)
        print("".join(table_lines), file=sys.stderr)
        return {}

    if "code" in cols_sanitized:
        code_inx = cols_sanitized.index("code")
    else:
        code_inx = None

    metrics_inx = set(range(len(header_cols))) - set([model_inx, paper_inx, code_inx])
    metrics_inx = sorted(list(metrics_inx))

    metrics_names = [header_cols[i] for i in metrics_inx]

    sota["metrics"] = metrics_names
    sota["rows"] = []

    min_cols = len(header_cols)

    # now parse the table rows
    rows = table_lines[2:]
    for row in rows:
        row_cols = [h.strip() for h in row.split("|")][1:]

        if len(row_cols) < min_cols:
            print("This row doesn't have enough columns, skipping: %s" % row, file=sys.stderr)
            continue

        # extract all the metrics
        metrics = {}
        for i in range(len(metrics_inx)):
            metrics[metrics_names[i]] = row_cols[metrics_inx[i]]

        # extract paper references
        paper_title, paper_link = extract_paper_title_and_link(row_cols[paper_inx])

        # extract model_name and author
        model_name, model_author = extract_model_name_and_author(row_cols[model_inx])

        sota_row = {
            "model_name": model_name,
            "metrics": metrics,
        }

        if paper_title is not None and paper_link is not None:
            sota_row["paper_title"] = paper_title
            sota_row["paper_url"] = paper_link

        # and code links if they exist
        if code_inx is not None:
            sota_row["code_links"] = extract_code_links(row_cols[code_inx])

        sota["rows"].append(sota_row)

    return sota


def get_line_no(sections:List[str], section_index:int, section_line=0) -> int:
    """
    Get the line number for a section heading

    :param sections: A list of list of sections
    :param section_index: Index of the current section
    :param section_line: Index of the line within the section
    :return:
    """
    if section_index == 0:
        return 1+section_line
    lens = [len(s) for s in sections[:section_index]]
    return sum(lens)+1+section_index


def extract_dataset_desc_and_sota_table(md_lines:List[str]) -> Tuple:
    """
    Extract the lines that are the description and lines that are the sota table(s)

    :param md_lines: a list of lines in this section
    :return:
    """

    # Main assumption is that the Sota table will minimally have a "Model" column
    desc = []
    tables = []
    t = None
    in_table = False
    for l in md_lines:
        if l.startswith("|") and "model" in l.lower() and not in_table:
            t = [l]
            in_table = True
        elif in_table and l.startswith("|"):
            t.append(l)
        elif in_table and not l.startswith("|"):
            if t is not None:
                tables.append(t)
            t = None
            desc.append(l)
            in_table = False
        else:
            desc.append(l)

    if t is not None:
        tables.append(t)

    return desc, tables


def parse_markdown_file(md_file:str) -> List:
    """
    Parse a single markdown file

    :param md_file: path to the markdown file
    :return:
    """

    with open(md_file, "r") as f:
        md_lines = f.readlines()

    # Assumptions:
    # 1) H1 are tasks
    # 2) Everything until the next heading is the task description
    # 3) H2 are subtasks, H3 are datasets, H4 are subdatasets

    # Algorithm:
    # 1) Split the document by headings

    sections = []
    cur = []
    for line in md_lines:
        if line.startswith("#"):
            if cur:
                sections.append(cur)
                cur = [line]
            else:
                cur = [line]
        else:
            cur.append(line)

    if cur:
        sections.append(cur)

    # 2) Parse each heading section one-by-one
    parsed_out = []  # whole parsed output
    t = {}  # current task element being parsed
    st = None  # current subtask being parsed
    ds = None # current dataset being parsed
    for section_index in range(len(sections)):
        section = sections[section_index]
        header = section[0]

        # Task definition
        if header.startswith("#") and not header.startswith("##"):
            if "task" in t:
                parsed_out.append(t)
                t = {}
            t["task"] = header[1:].strip()
            t["description"] = "".join(section[1:]).strip()

            # reset subtasks and datasets
            st = None
            ds = None

        ## Subtask definition
        if header.startswith("##") and not header.startswith("###"):
            if "task" not in t:
                print("ERROR: Unexpected subtask without a parent task at %s:#%d" %
                      (md_file, get_line_no(sections, section_index)), file=sys.stderr)

            if "subtasks" not in t:
                t["subtasks"] = []

            # new substask
            st = {}
            t["subtasks"].append(st)

            st["task"] = header[2:].strip()
            st["description"] = "".join(section[1:]).strip()
            st["source_link"] = {
                "title": "NLP-progress",
                "url": "https://github.com/sebastianruder/NLP-progress"
            }

            # reset the last dataset
            ds = None

        ### Dataset definition
        if header.startswith("###") and not header.startswith("####") and "Table of content" not in header:
            if "task" not in t:
                print("ERROR: Unexpected dataset without a parent task at %s:#%d" %
                      (md_file, get_line_no(sections, section_index)), file=sys.stderr)

            if st is not None:
                # we are in a subtask, add everything here
                if "datasets" not in st:
                    st["datasets"] = []

                # new dataset and add
                ds = {}
                st["datasets"].append(ds)
            else:
                # we are in a task, add here
                if "datasets" not in t:
                    t["datasets"] = []

                ds = {}
                t["datasets"].append(ds)

            ds["dataset"] = header[3:].strip()
            # dataset description is everything that's not a table
            desc, tables = extract_dataset_desc_and_sota_table(section[1:])
            ds["description"] = "".join(desc).strip()

            # see if there is an arxiv link in the first paragraph of the description
            dataset_links = extract_dataset_desc_links(desc)
            if dataset_links:
                ds["dataset_links"] = dataset_links

            if tables:
                if len(tables) > 1:
                    ds["subdatasets"] = handle_multiple_sota_table_exceptions(section, tables)
                else:
                    ds["sota"] = extract_sota_table(tables[0])

    if t:
        t["source_link"] = {
            "title": "NLP-progress",
            "url": "https://github.com/sebastianruder/NLP-progress"
        }
        parsed_out.append(t)

    return parsed_out


def parse_markdown_directory(path:str):
    """
    Parse all markdown files in a directory

    :param path: Path to the directory
    :return:
    """
    all_files = os.listdir(path)
    md_files = [f for f in all_files if f.endswith(".md")]

    out = []
    for md_file in md_files:
        print("Processing `%s`..." % md_file)
        out.extend(parse_markdown_file(os.path.join(path, md_file)))

    return out


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("paths", nargs="+", type=str, help="Files or directories to convert")
    parser.add_argument("--output", default="structured.json", type=str, help="Output JSON file name")

    args = parser.parse_args()

    out = []
    for path in args.paths:
        if os.path.isdir(path):
            out.extend(parse_markdown_directory(path))
        else:
            out.extend(parse_markdown_file(path))

    with open(args.output, "w") as f:
        f.write(json.dumps(out, indent=2))

================================================
FILE: structured/requirements.txt
================================================


================================================
FILE: turkish/summarization.md
================================================
# Summarization

Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.

### Warning: Evaluation Metrics

For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 
2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as MLSUM provide only a single reference.

Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.


### MLSUM

We present [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/), the first large-scale MultiLingual SUMmarization dataset. 
Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, [French](../french/summarization.md#mlsum), [German](../german/summarization.md#mlsum), [Spanish](../spanish/summarization.md#mlsum), [Russian](../russian/summarization.md#mlsum), [Turkish](../turkish/summarization.md#mlsum). Together with [English](../english/summarization.md#cnn--daily-mail) newspapers from the popular CNN / Daily Mail dataset, 
the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. 
We report cross-lingual comparative analyses based on state-of-the-art systems. 
These highlight existing biases which motivate the use of a multi-lingual dataset.

Below results are ranked by chronological order.

| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code |
| --------------- | :-----: | :-----: | :-----: | :-----: | -------------- | ---- |
| Lead_3 | 34.79 | 20.0 | 28.9 | 20.2 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Pointer-Generator | 36.9 | 21.77 | 32.6 | 19.8 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| M-BERT (Scialom et al., 2020) | 36.63 | 20.15 | 32.9 | 26.3 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| Oracle | 50.61 | 33.55 | 45.8 | 26.4 | [MLSUM](https://www.aclweb.org/anthology/2020.emnlp-main.647/) | [Official](https://github.com/recitalAI/MLSUM) |
| MARGE-NEWS (Train All) (Lewis et al., 2020) | - | - | 35.90  | - | [Pre-training via Paraphrasing](https://arxiv.org/abs/2006.15020) | [Official](https://github.com/lucidrains/marge-pytorch) |


================================================
FILE: vietnamese/vietnamese.md
================================================
# Vietnamese NLP tasks

## Dependency parsing

* Experiments employ the [benchmark Vietnamese dependency treebank VnDT](http://vndp.sourceforge.net) of 10K+ sentences, using  1,020 sentences for test, 200 sentences for development and the remaining sentences for training. LAS and UAS scores are computed on all tokens (i.e. including punctuation). 

#### VnDT v1.1:

| | Model           | LAS |  UAS  |  Paper | Code | 
| ----- | ------------- | :-----:| --- | --- | --- |
| **Predicted POS** | PhoNLP (2021) | 79.11 | 85.47 | [PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing](https://aclanthology.org/2021.naacl-demos.1.pdf) | [Official](https://github.com/VinAIResearch/PhoNLP) |
| **Predicted POS** | PhoBERT-base (2020) | 78.77 | 85.22 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) | 
| **Predicted POS** | PhoBERT-large (2020) | 77.85 | 84.32 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) |  [Official](https://github.com/VinAIResearch/PhoBERT) | 
| **Predicted POS** | Biaffine (2017) | 74.99 | 81.19 | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734) |  | 
| **Predicted POS** | jointWPD (2018) | 73.90 | 80.12 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459)  |  | 
| **Predicted POS** | jPTDP-v2 (2018) |  73.12 | 79.63 | [An improved neural network model for joint POS tagging and dependency parsing](http://aclweb.org/anthology/K18-2008) |  | 
| **Predicted POS** | VnCoreNLP (2018) | 71.38 | 77.35 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | 

* Results on the VnDT v1.1 for Biaffine, jPTDP-v2 and VnCoreNLP are reported in the jointWPD paper "[A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459)."

#### VnDT v1.0:

| | Model           | LAS |  UAS  |  Paper | Code | 
| ----- | ------------- | :-----:| --- | --- | --- |
| **Predicted POS** | VnCoreNLP (2018) | 70.23 | 76.93 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | 
| Gold POS | VnCoreNLP (2018) |73.39 |79.02 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | 
| Gold POS | BIST BiLSTM graph-based parser (2016) | 73.17|79.39 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/bmstparser/src) | 
| Gold POS | BIST BiLSTM transition-based parser (2016) | 72.53| 79.33 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/barchybrid/src) | 
| Gold POS | MSTparser (2006) | 70.29 | 76.47 | [Online large-margin training of dependency parsers](http://www.aclweb.org/anthology/P05-1012) | | 
| Gold POS | MaltParser (2007) | 69.10 | 74.91 | [MaltParser: A language-independent system for datadriven dependency parsing](https://stp.lingfil.uu.se/~nivre/docs/nle07.pdf) | | 


* Results for the BIST graph/transition-based parsers, MSTparser and MaltParser are reported in "[An empirical study for Vietnamese dependency parsing](http://www.aclweb.org/anthology/U16-1017)."

## Intent detection and Slot filling
### [PhoATIS](https://github.com/VinAIResearch/JointIDSF)
* The first dataset for intent detection and slot filling for Vietnamese, based on the common ATIS benchmark in the flight booking domain. Data is localized (e.g. replacing slot values with Vietnamese-specific entities) to fit the context of flight booking in Vietnam.
* Training set: 4478 sentences
* Development set: 500 sentences
* Test set: 893 sentences

| Model           | Intent Acc. | Slot F1 | Sentence Acc.  |  Paper | Code | Note |
| ------------- | :-----:| --- |--- |--- | --- | --- |
| JointIDSF (2021) | 97.62 | 94.98 | 86.25 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)
| JointBERT (2019) with PhoBERT encoder | 97.40 | 94.75 | 85.55 | [Intent Detection and Slot Filling for Vietnamese](https://arxiv.org/abs/2104.02021) | [Official](https://github.com/VinAIResearch/JointIDSF) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP)

## Machine translation

### [PhoMT Dataset](https://aclanthology.org/2021.emnlp-main.369/)
* A large-scale and high-quality dataset for Vietnamese-English Machine Translation with 3.02M sentence pairs, available at [https://github.com/VinAIResearch/PhoMT](https://github.com/VinAIResearch/PhoMT).
  * Consists of 6 domains: TED Talks, WikiHow, MediaWiki, OpenSubtitles, News and Blog.
  * Training set: 2.9M sentence pairs
  * Validation set: 18719 sentence pairs
  * Test set: 19151 sentence pairs

| Model           | EN-VI (BLEU) | VI-EN (BLEU) |  Paper | Code | 
| ------------- | :-----:| :-----:| --- | --- | 
| mBART (2020) | 43.46 | 39.78 | [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) | [Link](https://github.com/pytorch/fairseq/tree/main/examples/mbart) | 
| Transformer-big (2017) | 42.94 | 37.83 | [Attention is all you need](https://arxiv.org/abs/1706.03762) | [Link](https://github.com/pytorch/fairseq/tree/main/examples/translation) | 
| Transformer-base (2017) | 42.12 | 37.19 | [Attention is all you need](https://arxiv.org/abs/1706.03762) | [Link](https://github.com/pytorch/fairseq/tree/main/examples/translation) |

### IWSLT2015 Dataset
* Dataset is from [The IWSLT 2015 Evaluation Campaign](http://workshop2015.iwslt.org/downloads/proceeding.pdf) with 150K sentence pairs, also be obtained from [https://github.com/tensorflow/nmt](https://github.com/tensorflow/nmt).

#### English-to-Vietnamese
`tst2015` is used for test

| Model           | BLEU  |  Paper | Code | 
| ------------- | :-----:| --- | --- | 
| Stanford (2015) | 26.4 | [Stanford Neural Machine Translation Systems for Spoken Language Domains](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | | 

---
`tst2013` is used for test

| Model           | BLEU  |  Paper | Code | 
| ------------- | :-----:| --- | --- | 
| Nguyen and Salazar (2019) | 32.8 | [Transformers without Tears: Improving the Normalization of Self-Attention](https://arxiv.org/abs/1910.05895) | [Official](https://github.com/tnq177/transformers_without_tears) | 
| Provilkov et al. (2019) | 33.27 (uncased) | [BPE-Dropout: Simple and Effective Subword Regularization](https://arxiv.org/abs/1910.13267) | |
| Xu et al. (2019) | 31.4 | [Understanding and Improving Layer Normalization](https://papers.nips.cc/paper/8689-understanding-and-improving-layer-normalization.pdf) | [Official](https://github.com/lancopku/AdaNorm) |
| CVT (2018) | 29.6 (SST) | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | |
| ELMo (2018) | 29.3 (SST) | [Deep contextualized word representations](http://aclweb.org/anthology/N18-1202)| | 
| Transformer (2017) | 28.9 | [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need) | [Link](https://github.com/duyvuleo/Transformer-DyNet) |
| Kudo (2018) | 28.5 | [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) | | 
| Google (2017) | 26.1 | [Neural machine translation (seq2seq) tutorial](https://github.com/tensorflow/nmt)  | [Official](https://github.com/tensorflow/nmt) | 
| Stanford (2015) |23.3 | [Stanford Neural Machine Translation Systems for Spoken Language Domains](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | | 

* The ELMo score is reported in [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370). The Transformer score is available at  [https://github.com/duyvuleo/Transformer-DyNet](https://github.com/duyvuleo/Transformer-DyNet).

#### Vietnamese-to-English

`tst2013` is used for test

| Model           | BLEU  |  Paper | Code | 
| ------------- | :-----:| --- | --- | 
| Provilkov et al. (2019) | 32.99 (uncased) | [BPE-Dropout: Simple and Effective Subword Regularization](https://arxiv.org/abs/1910.13267) | |
| Kudo (2018) | 26.31 | [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) | | 

## Named entity recognition
### [PhoNER_COVID19](https://github.com/VinAIResearch/PhoNER_COVID19)
* A named entity recognition dataset for Vietnamese with 10 newly-defined entity types in the context of the COVID-19 pandemic. Data is extracted from news articles and manually annotated. In total, there are 34 984 entities over 10 027 sentences.
* Training set: 5027 sentences
* Development set: 2000 sentences
* Test set: 3000 sentences

| Model           | F1  |  Paper | Code | Note | 
| ------------- | :-----:| --- | --- | --- | 
| PhoBERT-large (2020) | 94.5 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) | 
| PhoBERT-base (2020) | 94.2 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) | 
| XLM-R-large (2019) | 93.8 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) | 
| XLM-R-base (2019) | 92.5 | [Unsupervised Cross-lingual Representation Learning at Scale](https://aclanthology.org/2020.acl-main.747/) | [Official](https://github.com/facebookresearch/XLM) | 
| BiLSTM-CRF + CNN-char (2016) + Word Segmentation | 91 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Text are automatically word-segmented using [RDRSegmenter](https://github.com/vncorenlp/VnCoreNLP) |
| BiLSTM-CRF + CNN-char  (2016) | 90.6 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | No word segmentation |

### VLSP
* 16,861 sentences for training and development from the VLSP 2016 NER shared task:
  *  14,861 sentences are used for training.
  *  2k sentences are used for development.
* Test data: 2,831 test sentences from the VLSP 2016 NER  shared task.
* **NOTE** that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as [reconfirmed by VLSP 2016 organizers](https://drive.google.com/file/d/1XzrgPw13N4C_B6yrQy_7qIxl8Bqf7Uqi/view?usp=sharing). This scheme results in an unrealistic scenario for a pipeline evaluation: 
  * The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all   word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
  * Gold POS and chunking tags are NOT available in a real-world application.
* For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags--if used--have to be automatically predicted! 

| Model           | F1  |  Paper | Code | Note | 
| ------------- | :-----:| --- | --- | --- | 
| PhoBERT-large (2020) | 94.7 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) | 
| PhoNLP (2021) | 94.41 | [PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing](https://aclanthology.org/2021.naacl-demos.1.pdf) | [Official](https://github.com/VinAIResearch/PhoNLP) |
| vELECTRA (2020) | 94.07 | [Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994) | [Official](https://github.com/fpt-corp/viBERT) |
| PhoBERT-base (2020) | 93.6 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) |  [Official](https://github.com/VinAIResearch/PhoBERT) | 
| VnCoreNLP (2018) [1] | 91.30 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | Used ETNLP embeddings |
| BiLSTM-CRF + CNN-char  (2016) [1] | 91.09 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) / [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/)  | Used ETNLP embeddings | 
| VNER (2019) | 89.58 | [Attentive Neural Network for Named Entity Recognition in Vietnamese](https://arxiv.org/abs/1810.13097) | | 
| VnCoreNLP (2018) | 88.55 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | Pre-trained embeddings learned from Baomoi corpus |
| BiLSTM-CRF + CNN-char  (2016) [2] | 88.28 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) / [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus |
| BiLSTM-CRF + LSTM-char (2016) [2] | 87.71 | [Neural Architectures for Named Entity Recognition](http://www.aclweb.org/anthology/N16-1030) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus |
| BiLSTM-CRF (2015) [2] | 86.48 | [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/abs/1508.01991) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus |

* [1] denotes that scores are reported in  "[ETNLP: a visual-aided systematic approach to select pre-trained embeddings for a downstream task](https://arxiv.org/abs/1903.04433)"
* [2] denotes that BiLSTM-CRF-based scores are reported in  "[VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012)"


## Part-of-speech tagging 

* 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
  *  27k sentences are used for training.
  *  870 sentences are used for development.
* Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.

| Model           | Accuracy  |  Paper | Code | 
| ------------- | :-----:| --- | --- | 
| PhoBERT-large (2020) | 96.8 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) | [Official](https://github.com/VinAIResearch/PhoBERT) |
| vELECTRA (2020) | 96.77 | [Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994) | [Official](https://github.com/fpt-corp/viBERT) |
| PhoNLP (2021) | 96.76 | [PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing](https://aclanthology.org/2021.naacl-demos.1.pdf) | [Official](https://github.com/VinAIResearch/PhoNLP) |
| PhoBERT-base (2020) | 96.7 | [PhoBERT: Pre-trained language models for Vietnamese](https://arxiv.org/abs/2003.00744) |  [Official](https://github.com/VinAIResearch/PhoBERT) | 
| jointWPD (2018) | 95.97 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459) | | 
| VnCoreNLP-VnMarMoT (2017) | 95.88 | [From Word Segmentation to POS Tagging for Vietnamese](http://aclweb.org/anthology/U17-1013) | [Official](https://github.com/datquocnguyen/vnmarmot) | 
| jPTDP-v2 (2018) | 95.70 | [An improved neural network model for joint POS tagging and dependency parsing](http://aclweb.org/anthology/K18-2008) | | 
| BiLSTM-CRF + CNN-char  (2016) | 95.40 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) /  [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 
| BiLSTM-CRF + LSTM-char (2016) | 95.31 | [Neural Architectures for Named Entity Recognition](http://www.aclweb.org/anthology/N16-1030) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 
| BiLSTM-CRF (2015) | 95.06 | [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/abs/1508.01991) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 
| RDRPOSTagger (2014) | 95.11 |  [RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger](http://www.aclweb.org/anthology/E14-2005) | [Official](https://github.com/datquocnguyen/rdrpostagger) | 

* Result for jPTDP-v2 is reported in "[A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459)." 
* Results for BiLSTM-CRF-based models and RDRPOSTagger are reported in  "[From Word Segmentation to POS Tagging for Vietnamese](http://aclweb.org/anthology/U17-1013)."

## Semantic parsing
### [ViText2SQL](https://github.com/VinAIResearch/ViText2SQL)
* The first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese, consisting of about 10K question and SQL query pairs.
* Training set:  6831 question and query pairs 
* Development set: 954 question and query pairs 
* Test set: 1906 question and query pairs 


| Model           | Exact Match Accuracy  |  Paper | Code | Note |
| ------------- | :-----:| --- | --- | --- |
| IRNet (2019) | 53.2 | [A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese](https://aclanthology.org/2020.findings-emnlp.364/) | [Link](https://github.com/microsoft/IRNet) | Using [PhoBERT](https://aclanthology.org/2020.findings-emnlp.92/) as encoder |
| EditSQL (2019) |  52.6 | [A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese](https://aclanthology.org/2020.findings-emnlp.364/) | [Link](https://github.com/ryanzhumich/editsql) | Using [PhoBERT](https://aclanthology.org/2020.findings-emnlp.92/) as encoder |


## Word segmentation 

* Training & development data: 75k manually word-segmented training sentences from the [VLSP](http://vlsp.org.vn/) 2013 word segmentation shared task.
* Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.

| Model           | F1  |  Paper | Code | 
| ------------- | :-----:| --- | --- | 
| UITws-v1 (2019) | 98.06 | [Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture](https://arxiv.org/abs/2006.07804) | [Official](https://github.com/ngannlt/UITws-v1) |
| VnCoreNLP-RDRsegmenter (2018) | 97.90 | [A Fast and Accurate Vietnamese Word Segmenter](http://www.lrec-conf.org/proceedings/lrec2018/pdf/55.pdf) | [Official](https://github.com/datquocnguyen/RDRsegmenter) | 
| UETsegmenter (2016) | 97.87 | [A hybrid approach to Vietnamese word segmentation](http://doi.org/10.1109/RIVF.2016.7800279) | [Official](https://github.com/phongnt570/UETsegmenter) |
| jointWPD (2018) | 97.81 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459) | | 
| vnTokenizer (2008) | 97.33 | [A Hybrid Approach to Word Segmentation of Vietnamese Texts](https://link.springer.com/chapter/10.1007/978-3-540-88282-4_23) |  |
| JVnSegmenter (2006) | 97.06 | [Vietnamese Word Segmentation with CRFs and SVMs: An Investigation](http://www.aclweb.org/anthology/Y06-1028) |  |
| DongDu (2012) | 96.90 |  [Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt](https://tiengvietmenyeu.wordpress.com/2013/02/16/ung%C2%B7dung-phuong%C2%B7phap-pointwise-vao-bai%C2%B7toan-tach-tu-cho-tieng%C2%B7viet/) |  |

* Results for VnTokenizer, JVnSegmenter and DongDu are reported in "[A hybrid approach to Vietnamese word segmentation](http://doi.org/10.1109/RIVF.2016.7800279)."