Repository: danqi/thesis
Branch: master
Commit: 897881b16c98
Files: 43
Total size: 414.9 KB

Directory structure:
gitextract_ge445pyb/

├── .gitignore
├── Makefile
├── README.md
├── ack.tex
├── acl_natbib_nourl.bst
├── chapters/
│   ├── coqa/
│   │   ├── dataset.tex
│   │   ├── discussions.tex
│   │   ├── experiments.tex
│   │   ├── intro.tex
│   │   ├── models.tex
│   │   └── related_work.tex
│   ├── openqa/
│   │   ├── evaluation.tex
│   │   ├── future.tex
│   │   ├── intro.tex
│   │   ├── related_work.tex
│   │   └── system.tex
│   ├── rc_future/
│   │   ├── datasets.tex
│   │   ├── models.tex
│   │   ├── overview.tex
│   │   └── questions.tex
│   ├── rc_models/
│   │   ├── advances.tex
│   │   ├── experiments.tex
│   │   ├── feature_classifier.tex
│   │   ├── intro.tex
│   │   └── sar.tex
│   └── rc_overview/
│       ├── discussions.tex
│       ├── history.tex
│       ├── intro.tex
│       └── task.tex
├── conclude.tex
├── fitch.sty
├── img/
│   └── scripts/
│       ├── gen_cnn_analysis.py
│       ├── gen_qa_stat.py
│       ├── gen_squad_progress.py
│       ├── gen_timeline.py
│       └── squad_leaderboard.txt
├── intro.tex
├── macros.tex
├── preface.tex
├── ref.bib
├── std-macros.tex
├── suthesis.sty
└── thesis.tex

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.DS_Store

pages/
*.fdb_latexmk
*.bbl
*.aux
*.out
*.toc
*.fls
*.blg
*.log
*.lot
*.lof
*.synctex.gz


================================================
FILE: Makefile
================================================
thesis.pdf: $(wildcard *.tex) $(wildcard chapters/natlog/*.tex)  $(wildcard chapters/naturalli/*.tex) $(wildcard chapters/openie/*.tex) $(wildcard chapters/qa/*.tex)  Makefile macros.tex std-macros.tex ref.bib
	@pdflatex thesis
	@bibtex thesis
	@pdflatex thesis
	@pdflatex thesis

clean:
	rm -f *.aux *.log *.bbl *.blg present.pdf *.bak *.ps *.dvi *.lot *.bcf thesis.pdf

dist: thesis.pdf
	@pdflatex --file-line-errors thesis

default: thesis.pdf


================================================
FILE: README.md
================================================
## Danqi Chen's Thesis

### Reference

```
@phdthesis{chen2018neural,
  title={Neural Reading Comprehension and Beyond},
  author={Chen, Danqi},
  year={2018},
  school={Stanford University}
}
```

### Acknowledgement

This thesis is built on top of [Gabor Angeli's thesis template](https://github.com/gangeli/thesis).

### Contact

If you have any comments or questions about the thesis, please use pull requests or email <danqi@cs.stanford.edu>.


================================================
FILE: ack.tex
================================================
%!TEX root = thesis.tex

\prefacesection{Acknowledgments}

The past six years at Stanford have been an unforgettable and invaluable experience to me. When I first started my PhD in 2012, I could barely speak fluent English (I was required to take five English courses at Stanford), knew little about this country and had never heard of the term ``natural language processing''.  It is unbelievable that over the following years I have actually been doing research about language and training computer systems to understand human languages (English in most cases), as well as training myself to speak and write in English. At the same time, 2012 is the year that deep neural networks (also called deep learning) started to take off and dominate almost all the AI applications we are seeing today. I witnessed how fast Artificial Intelligence has been developing from the beginning of the journey and feel quite excited —-- and occasionally panicked —-- to be a part of this trend. I would not have been able to make this journey without the help and support of many, many people and I feel deeply indebted to them.

First and foremost, my greatest thanks go to my advisor Christopher Manning. I really didn't know Chris when I first came to Stanford --- only after a couple of years that I worked with him and learned about NLP, did I realize how privileged I am to get to work with one of the most brilliant minds in our field. He always has a very insightful, high-level view about the field while he is also uncommonly detail oriented and understands the nature of the problems very well. More importantly, Chris is an extremely kind, caring and supportive advisor that I could not have asked for more. He is like an older friend of mine (if he doesn't mind me saying so) and I can talk with him about everything. He always believes in me even though I am not always that confident about myself. I am forever grateful to him and I have already started to miss him.

I would like to thank Dan Jurafsky and Percy Liang --- the other two giants of the Stanford NLP group --- for being on my thesis committee and for a lot of guidance and help throughout my PhD studies. Dan is an extremely charming, enthusiastic and knowledgeable person and I always feel my passion getting ignited after talking to him. Percy is a superman and a role model for all the NLP PhD students (at least myself). I never understand how one can accomplish so many things at the same time and a big part of this dissertation is built on top of his research. I want to thank Chris, Dan and Percy, for setting up the Stanford NLP Group, my home at Stanford, and I will always be proud to be a part of this family.

It is also my great honor to have Luke Zettlemoyer on my thesis committee. The work presented in this dissertation is very relevant to his research and I learned a lot from his papers. I look forward to working with him in the near future. I also would like to thank Yinyu Ye for his time chairing my thesis defense.

During my PhD, I have done two wonderful internships at Microsoft Research and Facebook AI Research. I thank my mentors Kristina Toutanova, Antoine Bordes and Jason Weston when I worked at these places.  My internship project at Facebook eventually leads to the \sys{DrQA} project and a part of this dissertation. I also would like to thank Microsoft and Facebook for providing me with fellowships.

Collaboration is a big lesson that I learned, and also a fun part of graduate school. I thank my fellow collaborators: Gabor Angeli, Jason Bolton, Arun Chaganty, Adam Fisch, Jon Gauthier, Shayne Longpre, Jesse Mu, Siva Reddy, Richard Socher, Yuhao Zhang, Victor Zhong, and others. In particular, Richard --- with him I finished my first paper in graduate school. He had very clear sense about how to define an impactful research project while I had little experience at the time. Adam and Siva --- with them I finished the \sys{DrQA} and \sys{CoQA} projects respectively. Not only am I proud of these two projects, but also I greatly enjoyed the collaborations. We have become good friends afterwards. The KBP team, especially Yuhao, Gabor and Arun --- I enjoyed the teamwork during those two summers. Jon, Victor, Shayne and Jesse, the younger people that I got to work with, although I wish I could have done a better job. I also want to thank the two teaching teams (7 and 25 people respectively) for the NLP class that I've worked on and that was a very unique and rewarding experience for me.

I thank the whole Stanford NLP Group, especially Sida Wang, Will Monroe, Angel Chang, Gabor Angeli, Siva Reddy, Arun Chaganty, Yuhao Zhang, Peng Qi, Jacob Steinhardt, Jiwei Li, He He, Robin Jia and Ziang Xie, who gave me a lot of support at various times. I am even not sure if there could be another research group in the world better than our group (I hope I can create a similar one in the future). The NLP retreat, NLP BBQ and those paper swap nights were among my most vivid memories in graduate school.

Outside of the NLP group, I have been extremely lucky to be surrounded by many great friends. Just to name a few (and forgive me for not being able to list all of them): Yanting Zhao, my close friend for many years, who keeps pulling me out from my stressful PhD life, and I share a lot of joyous moments with her. Xueqing Liu, my classmate and roommate in college who started her PhD at UIUC in the same year and she is the person that I can keep talking to and exchanging my feelings and thoughts with, especially on those bad days. Tao Lei, a brilliant NLP PhD and my algorithms ``teacher'' in high school and I keep learning from him and getting inspired from every discussion. Thanh-Vy Hua, my mentor and ``elder sister'' who always makes sure that I am still on the right track of my life and taught me many meta-skills to survive this journey (even though we have only met 3 times in the real world). Everyone in the ``\pinyin{cao3yu2}'' group, I am so happy that I have spent many Friday evenings with you.

During the past year, I visited a great number of U.S. universities seeking an academic job position. There are so many people I want to thank for assistance along the way —-- I either received great help and advice from them, or I felt extremely welcomed during my visit —-- including Sanjeev Arora, Yoav Artzi, Regina Barzilay, Chris Callison-Burch, Kai-Wei Chang, Kyunghyun Cho, William Cohen, Michael Collins, Chris Dyer, Jacob Eisenstein, Julia Hirschberg, Julia Hockenmaier, Tengyu Ma, Andrew McCallum, Kathy McKeown, Rada Mihalcea, Tom Mitchell, Ray Mooney, Karthik Narasimhan, Graham Neubig, Christos Papadimitriou, Nanyun Peng, Drago Radev, Sasha Rush, Fei Sha, Yulia Tsvetkov, Luke Zettlemoyer and many others. These people are really a big part of the reasons that I love our research community so much, therefore I want to follow their paths and dedicate myself to an academic career. I hope to continue to contribute to our research community in the future.

A special thanks to Andrew Chi-Chih Yao for creating the Special Pilot CS Class where I did my undergraduate studies. I am super proud of being a part of the ``Yao class'' family. I also thank Weizhu Chen, Qiang Yang and Haixun Wang, with them I received my very first research experience. With their support, I was very fortunate to have the opportunity to come to Stanford for my PhD.

I thank my parents: Zhi Chen and Hongmei Wang. Like most Chinese students in my generation, I am the only child of my family and I have a very close relationship with them --- even if they are living 16 (or 15) hours ahead of me and I can only spare 2--3 weeks staying with them every year. My parents made me who I am today and I never know how to pay them back. I hope that they are at least a little proud of me for what I have been through so far.

Lastly, I would like to thank Huacheng for his love and support (we got married 4 months before this dissertation was submitted). I was fifteen when I first met Huacheng and we have been experiencing almost everything together since then: from high-school programming competitions, to our wonderful college time at Tsinghua University and we both made it to the Stanford CS PhD program in 2012. For over ten years in the past, he is not only my partner, my classmate, my best friend, but also the person I admire most, for his modesty, intelligence, concentration and hard work.  Without him, I would not have come to Stanford. Without him, I would also not have taken the job at Princeton. I thank him for everything he has done for me.

\newpage

\begin{flushright}
To my parents and Huacheng, for their unconditional love.
\end{flushright}


================================================
FILE: acl_natbib_nourl.bst
================================================
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% BibTeX style file acl_natbib_nourl.bst
%
% intended as input to urlbst script
%
% adapted from compling.bst 
% in order to mimic the style files for ACL conferences prior to 2017
% by making the following three changes:
% - for @incollection, page numbers now follow volume title.
% - for @inproceedings, address now follows conference name.
%	(address is intended as location of conference,
%	 not address of publisher.)
% - for papers with three authors, use et al. in citation
% Dan Gildea 2017/06/08

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% BibTeX style file compling.bst
%
% Intended for the journal Computational Linguistics (ACL/MIT Press)
% Created by Ron Artstein on 2005/08/22 
% For use with <natbib.sty> for author-year citations.
%
% I created this file in order to allow submissions to the journal 
% Computational Linguistics using the <natbib> package for author-year
% citations, which offers a lot more flexibility than <fullname>, CL's 
% official citation package. This file adheres strictly to the official
% style guide available from the MIT Press:
%
% http://mitpress.mit.edu/journals/coli/compling_style.pdf
%
% This includes all the various quirks of the style guide, for example: 
% - a chapter from a monograph (@inbook) has no page numbers.
% - an article from an edited volume (@incollection) has page numbers
%   after the publisher and address.
% - an article from a proceedings volume (@inproceedings) has page 
%   numbers before the publisher and address.
%
% Where the style guide was inconsistent or not specific enough I 
% looked at actual published articles and exercised my own judgment. 
% I noticed two inconsistencies in the style guide:
%
% - The style guide gives one example of an article from an edited 
%   volume with the editor's name spelled out in full, and another 
%   with the editors' names abbreviated. I chose to accept the first 
%   one as correct, since the style guide generally shuns abbreviations, 
%   and editors' names are also spelled out in some recently published 
%   articles.
%
% - The style guide gives one example of a reference where the word 
%   "and" between two authors is preceded by a comma. This is most 
%   likely a typo, since in all other cases with just two authors or 
%   editors there is no comma before the word "and".
%
% One case where the style guide is not being specific is the placement
% of the edition number, for which no example is given. I chose to put 
% it immediately after the title, which I (subjectively) find natural,
% and is also the place of the edition in a few recently published
% articles.
%
% This file correctly reproduces all of the examples in the official
% style guide, except for the two inconsistencies noted above. I even
% managed to get it to correctly format the proceedings example which 
% has an organization, a publisher, and two addresses (the conference 
% location and the publisher's address), though I cheated a bit by 
% putting the conference location and month as part of the title field; 
% I feel that in this case the conference location and month can be
% considered as part of the title, and that adding a location field 
% is not justified. Note also that a location field is not standard, 
% so entries made with this field would not port nicely to other styles. 
% However, if authors feel that there's a need for a location field 
% then tell me and I'll see what I can do.
%
% The file also produces to my satisfaction all the bibliographical 
% entries in my recent (joint) submission to CL (this was the original 
% motivation for creating the file). I also tested it by running it
% on a larger set of entries and eyeballing the results. There may of
% course still be errors, especially with combinations of fields that
% are not that common, or with cross-references (which I seldom use). 
% If you find such errors please write to me. 
% 
% I hope people find this file useful. Please email me with comments 
% and suggestions.
% 
% Ron Artstein
% artstein [at] essex.ac.uk
% August 22, 2005.
%
% Some technical notes.
%
% This file is based on a file generated with the package <custom-bib> 
% by Patrick W. Daly (see selected options below), which was then 
% manually customized to conform with certain CL requirements which
% cannot be met by <custom-bib>. Departures from the generated file 
% include:
%
% Function inbook: moved publisher and address to the end; moved 
% edition after title; replaced function format.chapter.pages by 
% new function format.chapter to output chapter without pages.
% 
% Function inproceedings: moved publisher and address to the end;
% replaced function format.in.ed.booktitle by new function 
% format.in.booktitle to output the proceedings title without 
% the editor.
% 
% Functions book, incollection, manual: moved edition after title.
%
% Function mastersthesis: formatted title as for articles (unlike 
% phdthesis which is formatted as book) and added month.
% 
% Function proceedings: added new.sentence between organization and 
% publisher when both are present.
% 
% Function format.lab.names: modified so that it gives all the
% authors' surnames for in-text citations for one, two and three
% authors and only uses "et. al" for works with four authors or more
% (thanks to Ken Shan for convincing me to go through the trouble of 
% modifying this function rather than using unreliable hacks).
%
% Changes: 
%
% 2006-10-27: Changed function reverse.pass so that the extra label is
% enclosed in parentheses when the year field ends in an uppercase or
% lowercase letter (change modeled after Uli Sauerland's modification
% of nals.bst). RA.
%
%
% The preamble of the generated file begins below:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%% This is file `compling.bst',
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% merlin.mbs  (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss')
%% ----------------------------------------
%% *** Intended for the journal Computational Linguistics ***
%% 
%% Copyright 1994-2002 Patrick W Daly
 % ===============================================================
 % IMPORTANT NOTICE:
 % This bibliographic style (bst) file has been generated from one or
 % more master bibliographic style (mbs) files, listed above.
 %
 % This generated file can be redistributed and/or modified under the terms
 % of the LaTeX Project Public License Distributed from CTAN
 % archives in directory macros/latex/base/lppl.txt; either
 % version 1 of the License, or any later version.
 % ===============================================================
 % Name and version information of the main mbs file:
 % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
 %   For use with BibTeX version 0.99a or later
 %-------------------------------------------------------------------
 % This bibliography style file is intended for texts in ENGLISH
 % This is an author-year citation style bibliography. As such, it is
 % non-standard LaTeX, and requires a special package file to function properly.
 % Such a package is    natbib.sty   by Patrick W. Daly
 % The form of the \bibitem entries is
 %   \bibitem[Jones et al.(1990)]{key}...
 %   \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
 % The essential feature is that the label (the part in brackets) consists
 % of the author names, as they should appear in the citation, with the year
 % in parentheses following. There must be no space before the opening
 % parenthesis!
 % With natbib v5.3, a full list of authors may also follow the year.
 % In natbib.sty, it is possible to define the type of enclosures that is
 % really wanted (brackets or parentheses), but in either case, there must
 % be parentheses in the label.
 % The \cite command functions as follows:
 %   \citet{key} ==>>                Jones et al. (1990)
 %   \citet*{key} ==>>               Jones, Baker, and Smith (1990)
 %   \citep{key} ==>>                (Jones et al., 1990)
 %   \citep*{key} ==>>               (Jones, Baker, and Smith, 1990)
 %   \citep[chap. 2]{key} ==>>       (Jones et al., 1990, chap. 2)
 %   \citep[e.g.][]{key} ==>>        (e.g. Jones et al., 1990)
 %   \citep[e.g.][p. 32]{key} ==>>   (e.g. Jones et al., p. 32)
 %   \citeauthor{key} ==>>           Jones et al.
 %   \citeauthor*{key} ==>>          Jones, Baker, and Smith
 %   \citeyear{key} ==>>             1990
 %---------------------------------------------------------------------

ENTRY
  { address
    author
    booktitle
    chapter
    edition
    editor
    howpublished
    institution
    journal
    key
    month
    note
    number
    organization
    pages
    publisher
    school
    series
    title
    type
    volume
    year
  }
  {}
  { label extra.label sort.label short.list }
INTEGERS { output.state before.all mid.sentence after.sentence after.block }
FUNCTION {init.state.consts}
{ #0 'before.all :=
  #1 'mid.sentence :=
  #2 'after.sentence :=
  #3 'after.block :=
}
STRINGS { s t}
FUNCTION {output.nonnull}
{ 's :=
  output.state mid.sentence =
    { ", " * write$ }
    { output.state after.block =
        { add.period$ write$
          newline$
          "\newblock " write$
        }
        { output.state before.all =
            'write$
            { add.period$ " " * write$ }
          if$
        }
      if$
      mid.sentence 'output.state :=
    }
  if$
  s
}
FUNCTION {output}
{ duplicate$ empty$
    'pop$
    'output.nonnull
  if$
}
FUNCTION {output.check}
{ 't :=
  duplicate$ empty$
    { pop$ "empty " t * " in " * cite$ * warning$ }
    'output.nonnull
  if$
}
FUNCTION {fin.entry}
{ add.period$
  write$
  newline$
}

FUNCTION {new.block}
{ output.state before.all =
    'skip$
    { after.block 'output.state := }
  if$
}
FUNCTION {new.sentence}
{ output.state after.block =
    'skip$
    { output.state before.all =
        'skip$
        { after.sentence 'output.state := }
      if$
    }
  if$
}
FUNCTION {add.blank}
{  " " * before.all 'output.state :=
}

FUNCTION {date.block}
{
  new.block
}

FUNCTION {not}
{   { #0 }
    { #1 }
  if$
}
FUNCTION {and}
{   'skip$
    { pop$ #0 }
  if$
}
FUNCTION {or}
{   { pop$ #1 }
    'skip$
  if$
}
FUNCTION {new.block.checkb}
{ empty$
  swap$ empty$
  and
    'skip$
    'new.block
  if$
}
FUNCTION {field.or.null}
{ duplicate$ empty$
    { pop$ "" }
    'skip$
  if$
}
FUNCTION {emphasize}
{ duplicate$ empty$
    { pop$ "" }
    { "\emph{" swap$ * "}" * }
  if$
}
FUNCTION {tie.or.space.prefix}
{ duplicate$ text.length$ #3 <
    { "~" }
    { " " }
  if$
  swap$
}

FUNCTION {capitalize}
{ "u" change.case$ "t" change.case$ }

FUNCTION {space.word}
{ " " swap$ * " " * }
 % Here are the language-specific definitions for explicit words.
 % Each function has a name bbl.xxx where xxx is the English word.
 % The language selected here is ENGLISH
FUNCTION {bbl.and}
{ "and"}

FUNCTION {bbl.etal}
{ "et~al." }

FUNCTION {bbl.editors}
{ "editors" }

FUNCTION {bbl.editor}
{ "editor" }

FUNCTION {bbl.edby}
{ "edited by" }

FUNCTION {bbl.edition}
{ "edition" }

FUNCTION {bbl.volume}
{ "volume" }

FUNCTION {bbl.of}
{ "of" }

FUNCTION {bbl.number}
{ "number" }

FUNCTION {bbl.nr}
{ "no." }

FUNCTION {bbl.in}
{ "in" }

FUNCTION {bbl.pages}
{ "pages" }

FUNCTION {bbl.page}
{ "page" }

FUNCTION {bbl.chapter}
{ "chapter" }

FUNCTION {bbl.techrep}
{ "Technical Report" }

FUNCTION {bbl.mthesis}
{ "Master's thesis" }

FUNCTION {bbl.phdthesis}
{ "Ph.D. thesis" }

MACRO {jan} {"January"}

MACRO {feb} {"February"}

MACRO {mar} {"March"}

MACRO {apr} {"April"}

MACRO {may} {"May"}

MACRO {jun} {"June"}

MACRO {jul} {"July"}

MACRO {aug} {"August"}

MACRO {sep} {"September"}

MACRO {oct} {"October"}

MACRO {nov} {"November"}

MACRO {dec} {"December"}

MACRO {acmcs} {"ACM Computing Surveys"}

MACRO {acta} {"Acta Informatica"}

MACRO {cacm} {"Communications of the ACM"}

MACRO {ibmjrd} {"IBM Journal of Research and Development"}

MACRO {ibmsj} {"IBM Systems Journal"}

MACRO {ieeese} {"IEEE Transactions on Software Engineering"}

MACRO {ieeetc} {"IEEE Transactions on Computers"}

MACRO {ieeetcad}
 {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}

MACRO {ipl} {"Information Processing Letters"}

MACRO {jacm} {"Journal of the ACM"}

MACRO {jcss} {"Journal of Computer and System Sciences"}

MACRO {scp} {"Science of Computer Programming"}

MACRO {sicomp} {"SIAM Journal on Computing"}

MACRO {tocs} {"ACM Transactions on Computer Systems"}

MACRO {tods} {"ACM Transactions on Database Systems"}

MACRO {tog} {"ACM Transactions on Graphics"}

MACRO {toms} {"ACM Transactions on Mathematical Software"}

MACRO {toois} {"ACM Transactions on Office Information Systems"}

MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}

MACRO {tcs} {"Theoretical Computer Science"}
FUNCTION {bibinfo.check}
{ swap$
  duplicate$ missing$
    {
      pop$ pop$
      ""
    }
    { duplicate$ empty$
        {
          swap$ pop$
        }
        { swap$
          pop$
        }
      if$
    }
  if$
}
FUNCTION {bibinfo.warn}
{ swap$
  duplicate$ missing$
    {
      swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
      ""
    }
    { duplicate$ empty$
        {
          swap$ "empty " swap$ * " in " * cite$ * warning$
        }
        { swap$
          pop$
        }
      if$
    }
  if$
}
STRINGS  { bibinfo}
INTEGERS { nameptr namesleft numnames }

FUNCTION {format.names}
{ 'bibinfo :=
  duplicate$ empty$ 'skip$ {
  's :=
  "" 't :=
  #1 'nameptr :=
  s num.names$ 'numnames :=
  numnames 'namesleft :=
    { namesleft #0 > }
    { s nameptr
      duplicate$ #1 >
        { "{ff~}{vv~}{ll}{, jj}" }
        { "{ff~}{vv~}{ll}{, jj}" }	% first name first for first author 
%        { "{vv~}{ll}{, ff}{, jj}" }	% last name first for first author
      if$
      format.name$
      bibinfo bibinfo.check
      't :=
      nameptr #1 >
        {
          namesleft #1 >
            { ", " * t * }
            {
              numnames #2 >
                { "," * }
                'skip$
              if$
              s nameptr "{ll}" format.name$ duplicate$ "others" =
                { 't := }
                { pop$ }
              if$
              t "others" =
                {
                  " " * bbl.etal *
                }
                {
                  bbl.and
                  space.word * t *
                }
              if$
            }
          if$
        }
        't
      if$
      nameptr #1 + 'nameptr :=
      namesleft #1 - 'namesleft :=
    }
  while$
  } if$
}
FUNCTION {format.names.ed}
{
  'bibinfo :=
  duplicate$ empty$ 'skip$ {
  's :=
  "" 't :=
  #1 'nameptr :=
  s num.names$ 'numnames :=
  numnames 'namesleft :=
    { namesleft #0 > }
    { s nameptr
      "{ff~}{vv~}{ll}{, jj}"
      format.name$
      bibinfo bibinfo.check
      't :=
      nameptr #1 >
        {
          namesleft #1 >
            { ", " * t * }
            {
              numnames #2 >
                { "," * }
                'skip$
              if$
              s nameptr "{ll}" format.name$ duplicate$ "others" =
                { 't := }
                { pop$ }
              if$
              t "others" =
                {

                  " " * bbl.etal *
                }
                {
                  bbl.and
                  space.word * t *
                }
              if$
            }
          if$
        }
        't
      if$
      nameptr #1 + 'nameptr :=
      namesleft #1 - 'namesleft :=
    }
  while$
  } if$
}
FUNCTION {format.key}
{ empty$
    { key field.or.null }
    { "" }
  if$
}

FUNCTION {format.authors}
{ author "author" format.names
}
FUNCTION {get.bbl.editor}
{ editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }

FUNCTION {format.editors}
{ editor "editor" format.names duplicate$ empty$ 'skip$
    {
      "," *
      " " *
      get.bbl.editor
      *
    }
  if$
}
FUNCTION {format.note}
{
 note empty$
    { "" }
    { note #1 #1 substring$
      duplicate$ "{" =
        'skip$
        { output.state mid.sentence =
          { "l" }
          { "u" }
        if$
        change.case$
        }
      if$
      note #2 global.max$ substring$ * "note" bibinfo.check
    }
  if$
}

FUNCTION {format.title}
{ title
  duplicate$ empty$ 'skip$
    { "t" change.case$ }
  if$
  "title" bibinfo.check
}
FUNCTION {format.full.names}
{'s :=
 "" 't :=
  #1 'nameptr :=
  s num.names$ 'numnames :=
  numnames 'namesleft :=
    { namesleft #0 > }
    { s nameptr
      "{vv~}{ll}" format.name$
      't :=
      nameptr #1 >
        {
          namesleft #1 >
            { ", " * t * }
            {
              s nameptr "{ll}" format.name$ duplicate$ "others" =
                { 't := }
                { pop$ }
              if$
              t "others" =
                {
                  " " * bbl.etal *
                }
                {
                  numnames #2 >
                    { "," * }
                    'skip$
                  if$
                  bbl.and
                  space.word * t *
                }
              if$
            }
          if$
        }
        't
      if$
      nameptr #1 + 'nameptr :=
      namesleft #1 - 'namesleft :=
    }
  while$
}

FUNCTION {author.editor.key.full}
{ author empty$
    { editor empty$
        { key empty$
            { cite$ #1 #3 substring$ }
            'key
          if$
        }
        { editor format.full.names }
      if$
    }
    { author format.full.names }
  if$
}

FUNCTION {author.key.full}
{ author empty$
    { key empty$
         { cite$ #1 #3 substring$ }
          'key
      if$
    }
    { author format.full.names }
  if$
}

FUNCTION {editor.key.full}
{ editor empty$
    { key empty$
         { cite$ #1 #3 substring$ }
          'key
      if$
    }
    { editor format.full.names }
  if$
}

FUNCTION {make.full.names}
{ type$ "book" =
  type$ "inbook" =
  or
    'author.editor.key.full
    { type$ "proceedings" =
        'editor.key.full
        'author.key.full
      if$
    }
  if$
}

FUNCTION {output.bibitem}
{ newline$
  "\bibitem[{" write$
  label write$
  ")" make.full.names duplicate$ short.list =
     { pop$ }
     { * }
   if$
  "}]{" * write$
  cite$ write$
  "}" write$
  newline$
  ""
  before.all 'output.state :=
}

FUNCTION {n.dashify}
{
  't :=
  ""
    { t empty$ not }
    { t #1 #1 substring$ "-" =
        { t #1 #2 substring$ "--" = not
            { "--" *
              t #2 global.max$ substring$ 't :=
            }
            {   { t #1 #1 substring$ "-" = }
                { "-" *
                  t #2 global.max$ substring$ 't :=
                }
              while$
            }
          if$
        }
        { t #1 #1 substring$ *
          t #2 global.max$ substring$ 't :=
        }
      if$
    }
  while$
}

FUNCTION {word.in}
{ bbl.in capitalize
  " " * }

FUNCTION {format.date}
{ year "year" bibinfo.check duplicate$ empty$
    {
    }
    'skip$
  if$
  extra.label *
  before.all 'output.state :=
  after.sentence 'output.state :=
}
FUNCTION {format.btitle}
{ title "title" bibinfo.check
  duplicate$ empty$ 'skip$
    {
      emphasize
    }
  if$
}
FUNCTION {either.or.check}
{ empty$
    'pop$
    { "can't use both " swap$ * " fields in " * cite$ * warning$ }
  if$
}
FUNCTION {format.bvolume}
{ volume empty$
    { "" }
    { bbl.volume volume tie.or.space.prefix
      "volume" bibinfo.check * *
      series "series" bibinfo.check
      duplicate$ empty$ 'pop$
        { swap$ bbl.of space.word * swap$
          emphasize * }
      if$
      "volume and number" number either.or.check
    }
  if$
}
FUNCTION {format.number.series}
{ volume empty$
    { number empty$
        { series field.or.null }
        { series empty$
            { number "number" bibinfo.check }
        { output.state mid.sentence =
            { bbl.number }
            { bbl.number capitalize }
          if$
          number tie.or.space.prefix "number" bibinfo.check * *
          bbl.in space.word *
          series "series" bibinfo.check *
        }
      if$
    }
      if$
    }
    { "" }
  if$
}

FUNCTION {format.edition}
{ edition duplicate$ empty$ 'skip$
    {
      output.state mid.sentence =
        { "l" }
        { "t" }
      if$ change.case$
      "edition" bibinfo.check
      " " * bbl.edition *
    }
  if$
}
INTEGERS { multiresult }
FUNCTION {multi.page.check}
{ 't :=
  #0 'multiresult :=
    { multiresult not
      t empty$ not
      and
    }
    { t #1 #1 substring$
      duplicate$ "-" =
      swap$ duplicate$ "," =
      swap$ "+" =
      or or
        { #1 'multiresult := }
        { t #2 global.max$ substring$ 't := }
      if$
    }
  while$
  multiresult
}
FUNCTION {format.pages}
{ pages duplicate$ empty$ 'skip$
    { duplicate$ multi.page.check
        {
          bbl.pages swap$
          n.dashify
        }
        {
          bbl.page swap$
        }
      if$
      tie.or.space.prefix
      "pages" bibinfo.check
      * *
    }
  if$
}
FUNCTION {format.journal.pages}
{ pages duplicate$ empty$ 'pop$
    { swap$ duplicate$ empty$
        { pop$ pop$ format.pages }
        {
          ":" *
          swap$
          n.dashify
          "pages" bibinfo.check
          *
        }
      if$
    }
  if$
}
FUNCTION {format.vol.num.pages}
{ volume field.or.null
  duplicate$ empty$ 'skip$
    {
      "volume" bibinfo.check
    }
  if$
  number "number" bibinfo.check duplicate$ empty$ 'skip$
    {
      swap$ duplicate$ empty$
        { "there's a number but no volume in " cite$ * warning$ }
        'skip$
      if$
      swap$
      "(" swap$ * ")" *
    }
  if$ *
  format.journal.pages
}

FUNCTION {format.chapter}
{ chapter empty$
    'skip$
    { type empty$
        { bbl.chapter }
        { type "l" change.case$
          "type" bibinfo.check
        }
      if$
      chapter tie.or.space.prefix
      "chapter" bibinfo.check
      * *
    }
  if$
}

FUNCTION {format.chapter.pages}
{ chapter empty$
    'format.pages
    { type empty$
        { bbl.chapter }
        { type "l" change.case$
          "type" bibinfo.check
        }
      if$
      chapter tie.or.space.prefix
      "chapter" bibinfo.check
      * *
      pages empty$
        'skip$
        { ", " * format.pages * }
      if$
    }
  if$
}

FUNCTION {format.booktitle}
{
  booktitle "booktitle" bibinfo.check
  emphasize
}
FUNCTION {format.in.booktitle}
{ format.booktitle duplicate$ empty$ 'skip$
    {
      word.in swap$ *
    }
  if$
}
FUNCTION {format.in.ed.booktitle}
{ format.booktitle duplicate$ empty$ 'skip$
    {
      editor "editor" format.names.ed duplicate$ empty$ 'pop$
        {
          "," *
          " " *
          get.bbl.editor
          ", " *
          * swap$
          * }
      if$
      word.in swap$ *
    }
  if$
}
FUNCTION {format.thesis.type}
{ type duplicate$ empty$
    'pop$
    { swap$ pop$
      "t" change.case$ "type" bibinfo.check
    }
  if$
}
FUNCTION {format.tr.number}
{ number "number" bibinfo.check
  type duplicate$ empty$
    { pop$ bbl.techrep }
    'skip$
  if$
  "type" bibinfo.check
  swap$ duplicate$ empty$
    { pop$ "t" change.case$ }
    { tie.or.space.prefix * * }
  if$
}
FUNCTION {format.article.crossref}
{
  word.in
  " \cite{" * crossref * "}" *
}
FUNCTION {format.book.crossref}
{ volume duplicate$ empty$
    { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
      pop$ word.in
    }
    { bbl.volume
      capitalize
      swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
    }
  if$
  " \cite{" * crossref * "}" *
}
FUNCTION {format.incoll.inproc.crossref}
{
  word.in
  " \cite{" * crossref * "}" *
}
FUNCTION {format.org.or.pub}
{ 't :=
  ""
  address empty$ t empty$ and
    'skip$
    {
      t empty$
        { address "address" bibinfo.check *
        }
        { t *
          address empty$
            'skip$
            { ", " * address "address" bibinfo.check * }
          if$
        }
      if$
    }
  if$
}
FUNCTION {format.publisher.address}
{ publisher "publisher" bibinfo.warn format.org.or.pub
}

FUNCTION {format.organization.address}
{ organization "organization" bibinfo.check format.org.or.pub
}

FUNCTION {article}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title "title" output.check
  new.block
  crossref missing$
    {
      journal
      "journal" bibinfo.check
      emphasize
      "journal" output.check
      format.vol.num.pages output
    }
    { format.article.crossref output.nonnull
      format.pages output
    }
  if$
  new.block
  format.note output
  fin.entry
}
FUNCTION {book}
{ output.bibitem
  author empty$
    { format.editors "author and editor" output.check
      editor format.key output
    }
    { format.authors output.nonnull
      crossref missing$
        { "author and editor" editor either.or.check }
        'skip$
      if$
    }
  if$
  format.date "year" output.check
  date.block
  format.btitle "title" output.check
  format.edition output
  crossref missing$
    { format.bvolume output
      new.block
      format.number.series output
      new.sentence
      format.publisher.address output
    }
    {
      new.block
      format.book.crossref output.nonnull
    }
  if$
  new.block
  format.note output
  fin.entry
}
FUNCTION {booklet}
{ output.bibitem
  format.authors output
  author format.key output
  format.date "year" output.check
  date.block
  format.title "title" output.check
  new.block
  howpublished "howpublished" bibinfo.check output
  address "address" bibinfo.check output
  new.block
  format.note output
  fin.entry
}

FUNCTION {inbook}
{ output.bibitem
  author empty$
    { format.editors "author and editor" output.check
      editor format.key output
    }
    { format.authors output.nonnull
      crossref missing$
        { "author and editor" editor either.or.check }
        'skip$
      if$
    }
  if$
  format.date "year" output.check
  date.block
  format.btitle "title" output.check
  format.edition output
  crossref missing$
    {
      format.bvolume output
      format.number.series output
      format.chapter "chapter" output.check
      new.sentence
      format.publisher.address output
      new.block
    }
    {
      format.chapter "chapter" output.check
      new.block
      format.book.crossref output.nonnull
    }
  if$
  new.block
  format.note output
  fin.entry
}

FUNCTION {incollection}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title "title" output.check
  new.block
  crossref missing$
    { format.in.ed.booktitle "booktitle" output.check
      format.edition output
      format.bvolume output
      format.number.series output
      format.chapter.pages output
      new.sentence
      format.publisher.address output
    }
    { format.incoll.inproc.crossref output.nonnull
      format.chapter.pages output
    }
  if$
  new.block
  format.note output
  fin.entry
}
FUNCTION {inproceedings}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title "title" output.check
  new.block
  crossref missing$
    { format.in.booktitle "booktitle" output.check
      format.bvolume output
      format.number.series output
      format.pages output
      address "address" bibinfo.check output
      new.sentence
      organization "organization" bibinfo.check output
      publisher "publisher" bibinfo.check output
    }
    { format.incoll.inproc.crossref output.nonnull
      format.pages output
    }
  if$
  new.block
  format.note output
  fin.entry
}
FUNCTION {conference} { inproceedings }
FUNCTION {manual}
{ output.bibitem
  format.authors output
  author format.key output
  format.date "year" output.check
  date.block
  format.btitle "title" output.check
  format.edition output
  organization address new.block.checkb
  organization "organization" bibinfo.check output
  address "address" bibinfo.check output
  new.block
  format.note output
  fin.entry
}

FUNCTION {mastersthesis}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title
  "title" output.check
  new.block
  bbl.mthesis format.thesis.type output.nonnull
  school "school" bibinfo.warn output
  address "address" bibinfo.check output
  month "month" bibinfo.check output
  new.block
  format.note output
  fin.entry
}

FUNCTION {misc}
{ output.bibitem
  format.authors output
  author format.key output
  format.date "year" output.check
  date.block
  format.title output
  new.block
  howpublished "howpublished" bibinfo.check output
  new.block
  format.note output
  fin.entry
}
FUNCTION {phdthesis}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.btitle
  "title" output.check
  new.block
  bbl.phdthesis format.thesis.type output.nonnull
  school "school" bibinfo.warn output
  address "address" bibinfo.check output
  new.block
  format.note output
  fin.entry
}

FUNCTION {proceedings}
{ output.bibitem
  format.editors output
  editor format.key output
  format.date "year" output.check
  date.block
  format.btitle "title" output.check
  format.bvolume output
  format.number.series output
  new.sentence
  publisher empty$
    { format.organization.address output }
    { organization "organization" bibinfo.check output
      new.sentence
      format.publisher.address output
    }
  if$
  new.block
  format.note output
  fin.entry
}

FUNCTION {techreport}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title
  "title" output.check
  new.block
  format.tr.number output.nonnull
  institution "institution" bibinfo.warn output
  address "address" bibinfo.check output
  new.block
  format.note output
  fin.entry
}

FUNCTION {unpublished}
{ output.bibitem
  format.authors "author" output.check
  author format.key output
  format.date "year" output.check
  date.block
  format.title "title" output.check
  new.block
  format.note "note" output.check
  fin.entry
}

FUNCTION {default.type} { misc }
READ
FUNCTION {sortify}
{ purify$
  "l" change.case$
}
INTEGERS { len }
FUNCTION {chop.word}
{ 's :=
  'len :=
  s #1 len substring$ =
    { s len #1 + global.max$ substring$ }
    's
  if$
}
FUNCTION {format.lab.names}
{ 's :=
  "" 't :=
  s #1 "{vv~}{ll}" format.name$
  s num.names$ duplicate$
  #2 >
    { pop$
      " " * bbl.etal *
    }
    { #2 <
        'skip$
        { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
            {
              " " * bbl.etal *
            }
            { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
              * }
          if$
        }
      if$
    }
  if$
}

FUNCTION {author.key.label}
{ author empty$
    { key empty$
        { cite$ #1 #3 substring$ }
        'key
      if$
    }
    { author format.lab.names }
  if$
}

FUNCTION {author.editor.key.label}
{ author empty$
    { editor empty$
        { key empty$
            { cite$ #1 #3 substring$ }
            'key
          if$
        }
        { editor format.lab.names }
      if$
    }
    { author format.lab.names }
  if$
}

FUNCTION {editor.key.label}
{ editor empty$
    { key empty$
        { cite$ #1 #3 substring$ }
        'key
      if$
    }
    { editor format.lab.names }
  if$
}

FUNCTION {calc.short.authors}
{ type$ "book" =
  type$ "inbook" =
  or
    'author.editor.key.label
    { type$ "proceedings" =
        'editor.key.label
        'author.key.label
      if$
    }
  if$
  'short.list :=
}

FUNCTION {calc.label}
{ calc.short.authors
  short.list
  "("
  *
  year duplicate$ empty$
  short.list key field.or.null = or
     { pop$ "" }
     'skip$
  if$
  *
  'label :=
}

FUNCTION {sort.format.names}
{ 's :=
  #1 'nameptr :=
  ""
  s num.names$ 'numnames :=
  numnames 'namesleft :=
    { namesleft #0 > }
    { s nameptr
      "{ll{ }}{  ff{ }}{  jj{ }}"
      format.name$ 't :=
      nameptr #1 >
        {
          "   "  *
          namesleft #1 = t "others" = and
            { "zzzzz" * }
            { t sortify * }
          if$
        }
        { t sortify * }
      if$
      nameptr #1 + 'nameptr :=
      namesleft #1 - 'namesleft :=
    }
  while$
}

FUNCTION {sort.format.title}
{ 't :=
  "A " #2
    "An " #3
      "The " #4 t chop.word
    chop.word
  chop.word
  sortify
  #1 global.max$ substring$
}
FUNCTION {author.sort}
{ author empty$
    { key empty$
        { "to sort, need author or key in " cite$ * warning$
          ""
        }
        { key sortify }
      if$
    }
    { author sort.format.names }
  if$
}
FUNCTION {author.editor.sort}
{ author empty$
    { editor empty$
        { key empty$
            { "to sort, need author, editor, or key in " cite$ * warning$
              ""
            }
            { key sortify }
          if$
        }
        { editor sort.format.names }
      if$
    }
    { author sort.format.names }
  if$
}
FUNCTION {editor.sort}
{ editor empty$
    { key empty$
        { "to sort, need editor or key in " cite$ * warning$
          ""
        }
        { key sortify }
      if$
    }
    { editor sort.format.names }
  if$
}
FUNCTION {presort}
{ calc.label
  label sortify
  "    "
  *
  type$ "book" =
  type$ "inbook" =
  or
    'author.editor.sort
    { type$ "proceedings" =
        'editor.sort
        'author.sort
      if$
    }
  if$
  #1 entry.max$ substring$
  'sort.label :=
  sort.label
  *
  "    "
  *
  title field.or.null
  sort.format.title
  *
  #1 entry.max$ substring$
  'sort.key$ :=
}

ITERATE {presort}
SORT
STRINGS { last.label next.extra }
INTEGERS { last.extra.num number.label }
FUNCTION {initialize.extra.label.stuff}
{ #0 int.to.chr$ 'last.label :=
  "" 'next.extra :=
  #0 'last.extra.num :=
  #0 'number.label :=
}
FUNCTION {forward.pass}
{ last.label label =
    { last.extra.num #1 + 'last.extra.num :=
      last.extra.num int.to.chr$ 'extra.label :=
    }
    { "a" chr.to.int$ 'last.extra.num :=
      "" 'extra.label :=
      label 'last.label :=
    }
  if$
  number.label #1 + 'number.label :=
}
FUNCTION {reverse.pass}
{ next.extra "b" =
    { "a" 'extra.label := }
    'skip$
  if$
  extra.label 'next.extra :=
  extra.label
  duplicate$ empty$
    'skip$
    { year field.or.null #-1 #1 substring$ chr.to.int$ #65 < 
      { "{\natexlab{" swap$ * "}}" * }
      { "{(\natexlab{" swap$ * "})}" * }
    if$ }
  if$
  'extra.label :=
  label extra.label * 'label :=
}
EXECUTE {initialize.extra.label.stuff}
ITERATE {forward.pass}
REVERSE {reverse.pass}
FUNCTION {bib.sort.order}
{ sort.label
  "    "
  *
  year field.or.null sortify
  *
  "    "
  *
  title field.or.null
  sort.format.title
  *
  #1 entry.max$ substring$
  'sort.key$ :=
}
ITERATE {bib.sort.order}
SORT
FUNCTION {begin.bib}
{ preamble$ empty$
    'skip$
    { preamble$ write$ newline$ }
  if$
  "\begin{thebibliography}{" number.label int.to.str$ * "}" *
  write$ newline$
  "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
  write$ newline$
}
EXECUTE {begin.bib}
EXECUTE {init.state.consts}
ITERATE {call.type$}
FUNCTION {end.bib}
{ newline$
  "\end{thebibliography}" write$ newline$
}
EXECUTE {end.bib}
%% End of customized bst file
%%
%% End of file `compling.bst'.


================================================
FILE: chapters/coqa/dataset.tex
================================================
%!TEX root = ../../thesis.tex

\section{\sys{CoQA}: A Conversational QA Challenge}
\label{sec:coqa-dataset}

In this section, we introduce \sys{CoQA}, a novel dataset for building \tf{Co}nversational \tf{Q}uestion \tf{A}nswering systems. We develop \sys{CoQA} with three main goals in mind. The first concerns the nature of questions in a human conversation. As an example seen in Figure~\ref{fig:coqa-example}, in this conversation, every question after the first is dependent on the conversation history. At present, there are no large scale reading comprehension datasets which contain questions that depend on a conversation history and this is what \sys{CoQA} is mainly developed for.\footnote{Concurrent with our work, \newcite{choi2018quac} also created a conversational dataset with a similar goal, but it differs in many key design decisions. We will discuss it in Section~\ref{sec:coqa-future}.}

The second goal of \sys{CoQA} is to ensure the naturalness of answers in a conversation. As we discussed in the earlier chapters, most existing reading comprehension datasets either restrict answers to a contiguous span in a given passage, or allow free-form answers with a low human agreement (e.g., \sys{NarrativeQA}). Our desiderata are 1) the answers should not be only span-based so that anything can be asked and the conversation can flow naturally. For example, there is no extractive answer for $Q_4$ \ti{How many?} in Figure~\ref{fig:coqa-example}. 2) It still supports reliable automatic evaluation with a a strong human performance. Therefore, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for $Q_4$ is simply \ti{Three} while its rationale is spanned across multiple sentences.

The third goal of \sys{CoQA} is to enable building QA systems that perform robustly across domains. The current reading comprehension datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models. Hence we collect our dataset from seven different domains --- children's stories, literature, middle and high school English exams, news, Wikipedia, science articles and Reddit. The last two are used for out-of-domain evaluation.

\subsection{Task Definition}
\label{sec:coqa-task}

\begin{figure}[!t]
\begin{tabular}{p{\columnwidth}}
\toprule
The Virginia governor's race, billed as the marquee battle of an otherwise anticlimactic 2013 election cycle, is shaping up to be a foregone conclusion. Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn't trailed in a poll since May. Barring a political miracle, Republican Ken Cuccinelli will be delivering a concession speech on Tuesday evening in Richmond. In recent ...\\
\\
$Q_1$:               What are the candidates {\bf \color{magenta} running} for?\\
$A_1$:               Governor\\
$R_1$: The Virginia governor's race\\
\vspace{0em}
$Q_2$:               {\bf \color{magenta} Where}?\\
$A_2$:               Virginia \\
$R_2$: The Virginia governor's race\\
\vspace{0em}
$Q_3$:               Who is the democratic candidate?\\
\vspace{-0.6em}{\bf \color{blue} A$_3$}:               {\bf \color{orange} Terry McAuliffe} \\
$R_3$: Democrat Terry McAuliffe\\
\vspace{0em}
$Q_4$:               Who is {\bf \color{orange} his} opponent?\\
\vspace{-0.6em}{\bf \color{blue} A$_4$}:               {\bf \color{red} Ken Cuccinelli} \\
$R_4$ Republican Ken Cuccinelli\\
\vspace{0em}
$Q_5$:               What party does {\bf \color{red} he} belong to?\\
$A_5$:               Republican \\
$R_5$: Republican Ken Cuccinelli\\
\vspace{0em}
$Q_6$:               Which of {\bf \color{blue} them} is winning?\\
$A_6$:               Terry McAuliffe \\
$R_6$: Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn't trailed in a poll since May\\
\bottomrule
\end{tabular}
\longcaption{Another example in \sys{CoQA} with entity of focus changes}{\label{fig:coqa-example2}A conversation showing coreference chains in colors. The entity of focus changes in $Q_4$, $Q_5$, $Q_6$.}
\end{figure}

We first define the task formally. Given a passage $P$, a conversation consists of $n$ turns, and each turn consists of $(Q_i, A_i, R_i), i = 1, \ldots n$, where $Q_i$ and $A_i$ denote the question and the answer in the $i$-th turn, and $R_i$ is the rationale which supports the answer $A_i$ and must be a single span of the passage. The task is defined as to answer the next question $Q_i$ given the conversation so far: $Q_1, A_1, \ldots, Q_{i -1}, A_{i - 1}$. It is worth noting that we collect $R_i$ with the hope that they can help understand how answers are derived and improve training our models, while \ti{they are not provided during evaluation}.

For the example in Figure~\ref{fig:coqa-example2}, the conversation begins with question $Q_1$. We answer $Q_1$ with $A_1$ based on the evidence $R_1$ from the passage. In this example, the answerer wrote only the \ti{Governor} as the answer but selected a longer rationale \ti{The Virginia governor's race}. When we come to $Q_2$ \ti{Where?}, we must refer back to the conversation history since otherwise its answer could be \ti{Virginia} or \ti{Richmond} or something else. In our task, conversation history is indispensable for answering many questions. We use conversation history $Q_1$ and $A_1$ to answer $Q_2$ with $A_2$ based on the evidence $R_2$.  For an unanswerable question, we give \ti{unknown} as the final answer and do not highlight any rationale.

In this example, we observe that the entity of focus changes as the conversation progresses. The questioner uses \ti{his} to refer to \ti{Terry} in $Q_4$ and \ti{he} to \ti{Ken} in $Q_5$. If these are not resolved correctly, we end up with incorrect answers. The conversational nature of questions requires us to reason from multiple sentences (the current question and the previous questions or answers, and sentences from the passage). It is common that a single question may require a rationale spanned across multiple sentences (e.g., $Q_1$ $Q_4$ and $Q_5$ in Figure~\ref{fig:coqa-example}). We describe additional question and answer types in \ref{sec:coqa-data-analysis}.


\subsection{Dataset Collection}
We detail our dataset collection process as follows. For each conversation, we employ two annotators, a questioner and an answerer. This setup has several advantages over using a single annotator to act both as a questioner and an answerer:
1) when two annotators chat about a passage, their dialogue flow is natural compared to chatting with oneself; 2) when one annotator responds with a vague question or an incorrect answer, the other can raise a flag which we use to identify bad workers; and 3) the two annotators can discuss guidelines (through a separate chat window) when they have disagreements. These measures help to prevent spam and to obtain high agreement data.\footnote{Due to AMT terms of service, we allowed a single worker to act both as a questioner and an answerer after a minute of waiting. This constitutes around 12\% of the data.}

\begin{figure}[!t]
  \center
  \includegraphics[scale=0.18]{img/coqa_questioner.png}
  \longcaption{The questioner interface of \sys{CoQA}}{\label{fig:coqa-questioner}The questioner interface of our \sys{CoQA} dataset.}
\end{figure}

\begin{figure}[!t]
  \center
  \includegraphics[scale=0.18]{img/coqa_answerer.png}
  \longcaption{The answerer interface of \sys{CoQA}}{\label{fig:coqa-answerer}The answerer interface of our \sys{CoQA} dataset.}
\end{figure}

We use the Amazon Mechanical Turk (AMT) to pair workers on a on a passage for which we use the ParlAI Mturk API \cite{miller2017parlai}. On average, each passage costs 3.6 USD for conversation collection and another 4.5 USD for collecting three additional answers for development and test data.


\paragraph{Collection interface.} We have different interfaces for a questioner and an answerer (Figure~\ref{fig:coqa-questioner} and Figure~\ref{fig:coqa-answerer}). A questioner's role is to ask questions, and an answerer's role is to answer questions in addition to highlighting rationales. We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible. For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78\% of the answers have at least one edit such as changing a word's case or adding a punctuation.

\paragraph{Passage selection.} We select passages from seven diverse domains: children's stories from MCTest \cite{richardson2013mctest}, literature from Project Gutenberg\footnote{Project Gutenberg \url{https://www.gutenberg.org}}, middle and high school English exams from RACE \cite{lai2017race}, news articles from CNN \cite{hermann2015teaching}, articles from Wikipedia, science articles from AI2 Science Questions \cite{welbl2017crowdsourcing} and Reddit articles from the Writing Prompts dataset \cite{fan2018hierarchical}.

Not all passages in these domains are equally good for generating interesting conversations.
A passage with just one entity often result in questions that entirely focus on that entity.
We select passages with multiple entities, events and pronominal references  using Stanford \sys{CoreNLP} \cite{manning2014stanford}. We truncate long articles to the first few paragraphs that result in around 200 words.

Table~\ref{tab:coqa-domains} shows the distribution of domains.
We reserve the Science and Reddit domains for out-of-domain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. In contrast, for each out-of-domain dataset, we just have 100 passages in the test set without any passages in the training or the development sets.

\begin{table}
\centering
\begin{tabular}{lrrrr}
\toprule
\tf{Domain} &  \tf{\# Passages} &  \tf{\# Q/A} & \tf{Passage}  &  \tf{\# Turns per} \\
 & & \tf{pairs} & \tf{length} & \tf{passage} \\
\midrule
Children's Stories  & 750 & 10.5k & 211 &  14.0 \\
Literature  & 1,815 & 25.5k & 284  & 15.6 \\
Mid/High School Exams & 1,911 & 28.6k & 306  & 15.0 \\
News & 1,902 & 28.7k & 268 &  15.1 \\
Wikipedia & 1,821 & 28.0k & 245  & 15.4 \\
\midrule
\multicolumn{5}{c}{Out of domain} \\
\midrule
Science & 100 & 1.5k & 251  & 15.3\\
Reddit & 100 & 1.7k & 361 & 16.6 \\
\midrule
Total & 8,399 & 127k  & 271 & 15.2 \\
\bottomrule
\end{tabular}
\longcaption{Distribution of domains in \sys{CoQA}.}{\label{tab:coqa-domains} Distribution of domains in \sys{CoQA}.}
\end{table}

\paragraph{Collecting multiple answers.} Some questions in \sys{CoQA} may have multiple valid answers. For example, another answer for Q$_4$ in Figure~\ref{fig:coqa-example2} is \ti{A Republican candidate}. In order to account for answer variations, we collect three additional answers for all questions in the development and test data. Since our data is conversational, questions influence answers which in turn influence the follow-up questions. In the previous example, if the original answer was \ti{A Republican Candidate}, then the following question \ti{Which party does he belong to?} would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer.

We achieve this by turning the answer collection task into a game of predicting original answers.
First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original.
For the next question, we ask her to guess the original answer and verify again.
We repeat this process until the conversation is complete.
In our pilot experiment, the human F1 score increased by 5.4\% when we use this verification setup.


\subsection{Dataset Analysis}
\label{sec:coqa-data-analysis}

What makes the \sys{CoQA} dataset conversational compared to existing reading comprehension datasets like \sys{SQuAD}? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in \sys{CoQA} exhibit? We answer these questions below.

\paragraph{Comparison with \sys{SQuAD 2.0}.}

\begin{figure}[ht]
\begin{center}
\includegraphics[height=8cm]{img/coqa_squad_comparison.pdf}
\end{center}
\longcaption{A comparison of questions in \sys{CoQA} and \sys{SQuAD 2.0} }{\label{fig:coqa-squad-comparison} Distribution of trigram prefixes of questions in \sys{SQuAD 2.0} and \sys{CoQA}.}
\end{figure}

In the following, we perform an in-depth comparison of \sys{CoQA} and \sys{SQuAD 2.0}~\cite{rajpurkar2018know}.  Figure~\ref{fig:coqa-squad-comparison} shows the distribution of frequent trigram prefixes. While coreferences are non-existent in \sys{SQuAD 2.0}, almost every sector of \sys{CoQA} contains coreferences (\ti{he, him, she, it, they})  indicating \sys{CoQA} is highly conversational. Because of the free-form nature of answers, we expect a richer variety of questions in \sys{CoQA} than \sys{SQuAD 2.0}.
While nearly half of \sys{SQuAD 2.0} questions are dominated by \ti{what} questions, the distribution of \sys{CoQA} is spread across multiple question types. Several sectors indicated by prefixes \ti{did, was, is, does, and} are frequent in \sys{CoQA} but are completely absent in \sys{SQuAD 2.0}.

Since a conversation is spread over multiple turns, we expect conversational questions and answers to be shorter than in a standalone interaction. In fact, questions in \sys{CoQA} can be made up of just one or two words (\ti{who?}, \ti{when?},  \ti{why?}).
As seen in Table~\ref{tab:squad-coqa-length}, on average, a question in \sys{CoQA} is only 5.5 words long while it is 10.1 for \sys{SQuAD}. The answers are also usually shorter in \sys{CoQA} than \sys{SQuAD 2.0}.

Table~\ref{tab:squad-coqa-answers} provides insights into the type of answers in \sys{SQuAD 2.0} and \sys{CoQA}.
While the original version of \sys{SQuAD 2.0}  \cite{rajpurkar2016squad} does not have any unanswerable questions, \sys{SQuAD 2.0} \cite{rajpurkar2018know} focuses solely on obtaining them resulting in higher frequency than in \sys{CoQA}. \sys{SQuAD 2.0} has 100\% extractive answers by design, whereas in \sys{CoQA}, 66.8\% answers can be classified as extractive after ignoring punctuation and case mismatches.\footnote{If punctuation and case are not ignored, only 37\% of the answers are extractive.}
This is higher than we anticipated. Our conjecture is that human factors such as wage may have influenced workers to ask questions that elicit faster responses by selecting text. It is worth noting that \sys{CoQA} has 11.1\% and 8.7\% questions with \ti{yes} or \ti{no} as answers whereas \sys{SQuAD 2.0} has 0\%. Both datasets have a high number of named entities and noun phrases as answers.


\begin{table}[h]
\centering
\begin{tabular}{p{3cm} r r}
\toprule
 & \bf \sys{SQuAD 2.0}  & \bf \sys{CoQA} \\
\midrule
Passage Length & 117 & 271 \\
Question Length & 10.1 & 5.5 \\
Answer  Length & 3.2 & 2.7 \\
\midrule
\end{tabular}
\longcaption{Data statistics in \sys{SQuAD 2.0} and \sys{CoQA}}{\label{tab:squad-coqa-length} Average number of words in passage, question and answer in \sys{SQuAD 2.0} and \sys{CoQA}.}
\end{table}

\begin{table}[h]
\centering
\begin{tabular}{p{3.5cm} r r}
\toprule
& \bf \sys{SQuAD 2.0}   & \bf \sys{CoQA}  \\
\midrule
Answerable & 66.7\% & 98.7\% \\
Unanswerable & 33.3\% & 1.3\% \\
\midrule
Extractive & 100.0\% & 66.8\% \\
Abstractive & 0.0\% & 33.2\% \\
\midrule
Named Entity & 35.9\% & 28.7\% \\
Noun Phrase & 25.0\% & 19.6\% \\
Yes & 0.0\% & 11.1\% \\
No & 0.1\% & 8.7\% \\
Number & 16.5\% & 9.8\% \\
Date/Time & 7.1\% & 3.9\% \\
Other & 15.5\% & 18.1\% \\
\bottomrule
\end{tabular}
\longcaption{Distribution of answer types in \sys{SQuAD 2.0} and \sys{CoQA}}{\label{tab:squad-coqa-answers} Distribution of answer types in \sys{SQuAD 2.0} and \sys{CoQA}.}
\end{table}

\paragraph{Conversation flow.}
A coherent conversation must have smooth transitions between turns.
We expect the narrative structure of the passage to influence our conversation flow.
We split the passage into 10 uniform chunks, and identify chunks of interest of a given turn and its transition based on rationale spans.

\begin{figure}[!t]
\begin{center}
\includegraphics[height=9cm]{img/coqa_conversation_flow.pdf}
\end{center}
\longcaption{Conversation Flow in \sys{CoQA}}{\label{fig:coqa-conversation-flow} Chunks of interests as a conversation progresses. The x-axis indicates the turn number and
the y-axis indicates the passage chunk containing the rationale. The height of a chunk indicates the concentration of conversation in that chunk. The width of the bands is proportional to the frequency of transition between chunks from one turn to the other.}
\end{figure}


Figure~\ref{fig:coqa-conversation-flow} portrays the conversation flow of the top 10 turns.
The starting turns tend to focus on the first few chunks and as the conversation advances, the focus shifts to the later chunks. Moreover, the turn transitions are smooth, with the focus often remaining in the same chunk or moving to a neighbouring chunk. Most frequent transitions happen to the first and the last chunks, and likewise these chunks have diverse outward transitions.

\paragraph{Linguistic phenomena.}

\begin{table}[!t]
\centering
\small
\begin{tabular}{lp{7cm}c}
\toprule
\bf Phenomenon & \bf Example & \bf Percentage \\
\midrule
\multicolumn{3}{c}{Relationship between a question and its passage} \\
\midrule
Lexical match & Q: Who had to rescue her?& 29.8\% \\
& A: the coast guard \\
& R: Outen was rescued by the coast guard \\
Paraphrasing & Q: Did the wild dog approach? & 43.0\% \\
& A: Yes \\
& R: he drew cautiously closer \\
Pragmatics &  Q:               Is Joey a male or female?  &  27.2\% \\
 & A:  Male \\
& R: it looked like a stick man so she kept \textbf{him}. She named her new noodle friend Joey \\
\midrule
\multicolumn{3}{c}{Relationship between a question and its conversation history} \\
\midrule
No coreference & Q: What is IFL? & 30.5\% \\
Explicit coreference & Q: Who had Bashti forgotten? & 49.7\% \\
& A: the puppy \\
& Q: What was \textbf{his} name? \\
Implicit coreference & Q: When will Sirisena be sworn in? & 19.8\% \\
& A: 6 p.m local time  \\
& Q: \textbf{Where}?\\
\bottomrule
\end{tabular}
\longcaption{Linguistic phenomena in \sys{CoQA} questions}{\label{tab:ling-phenomena}Linguistic phenomena in \sys{CoQA} questions.}
\end{table}
We further analyze the questions for their relationship with the passages and the conversation history. We sample 150 questions in the development set and annotate various phenomena as shown in Table~\ref{tab:ling-phenomena}.

If a question contains at least one content word that appears in the passage, we classify it as \ti{lexical match}. These comprise around 29.8\% of the questions. If it has no lexical match but is a paraphrase of the rationale, we classify it as \ti{paraphrasing}. These questions contain phenomena such as synonymy, antonymy, hypernymy, hyponymy and negation.
These constitute a large portion of questions, around 43.0\%. The rest, 27.2\%, have no lexical cues, and we classify them under \ti{pragmatics}. These include phenomena like common sense and presupposition. For example, the question \ti{Was he loud and boisterous?} is not a direct paraphrase of the rationale \ti{he dropped his feet with the lithe softness of a cat} but the rationale combined with world knowledge can answer this question.

For the relationship between a question and its conversation history, we classify questions into whether they are dependent or independent on the conversation history. If dependent, whether the questions contain an explicit marker or not.

As a result, around 30.5\% questions do not rely on coreference with the conversational history and are answerable on their own. Almost half of the questions (49.7\%) contain explicit coreference markers such as \ti{he, she, it}. These either refer to an entity or an event introduced in the conversation.
The remaining 19.8\% do not have explicit coreference markers but refer to an entity or event implicitly.


================================================
FILE: chapters/coqa/discussions.tex
================================================
%!TEX root = ../../thesis.tex

\section{Discussion}
\label{sec:coqa-future}

So far, we have discussed the \sys{CoQA} dataset and several competitive baselines based on conversational models and reading comprehension models. We hope that our efforts can enable the first step to building conversational QA agents.

On the one hand, we think there is ample room for further improving performance on \sys{CoQA}: our hybrid system obtains an F1 score of 65.1\%, which is still 23.7 points behind the human performance (88.8\%). We encourage our research community to work on this dataset and push the limits of conversational question answering models. We think there are several directions for further improvement:

\begin{itemize}
    \item
        All the baseline models we built only use the conversation history by simply concatenating the previous questions and answers with the current question. We think that there should be better ways to connect the history and the current question. For the questions in Table~\ref{tab:ling-phenomena}, we should build models to actually understand that \ti{his} in the question \ti{What was his name?} refers to \ti{the puppy}, and the question \ti{Where?} means \ti{Where will Sirisena be sworn in?}. Indeed, a recent model \sys{FlowQA}~\cite{huang2018flowqa} proposed a solution to effectively stack single-turn models along the conversational flow and demonstrated a state-of-the-art performance on \sys{CoQA}.
    \item
        Our hybrid model aims to combine the advantages from the span prediction reading comprehension models and the pointer-generator network model to address the nature of abstractive answers. However, we implemented it as a pipeline model so the performance of the second component depends on whether the reading comprehension model can extract the right piece of evidence from the passage. We think that it is desirable to build an end-to-end model which can extract rationales while also rewriting the rationale into the final answer.
    \item
        We think the rationales that we collected can be better leveraged into training models.
\end{itemize}

On the other hand, \sys{CoQA} certainly has its limitations and we should explore more challenging and more useful datasets in the future. One clear limitation is that the conversations in \sys{CoQA} are only turns of question and answer pairs. That means the answerer is only responsible for answering questions while she can't ask any clarification questions or communicate with the questioner through conversations.  Another problem is that \sys{CoQA} has very few (1.3\%) unanswerable questions, which we think are crucial in practical conversational QA systems.


In parallel to our work, \newcite{choi2018quac} also created a dataset of conversations in the form of questions and answers on text passages. In our interface, we show a passage to both the questioner and the answerer, whereas their interface only shows a title to the questioner and the full passage to the answerer. Since their setup encourages the answerer to reveal more information for the following questions, their answers are as long as 15.1 words on average (ours is 2.7). While the human performance on our test set is 88.8 F1, theirs is 74.6 F1. Moreover, while \sys{CoQA}'s answers can be abstractive, their answers are restricted to only extractive text spans. Our dataset contains passages from seven diverse domains, whereas their dataset is built only from Wikipedia articles about people. Also, concurrently, \newcite{saeidi2018interpretation} created a conversational QA dataset for regulatory text such as tax and visa regulations. Their answers are limited to \textit{yes} or \textit{no} along with a positive characteristic of permitting to ask clarification questions when a given question cannot be answered.


================================================
FILE: chapters/coqa/experiments.tex
================================================
%!TEX root = ../../thesis.tex

\section{Experiments}
\label{sec:coqa-experiments}

\subsection{Setup}
For the \sys{seq2seq} and \sys{PGNet} experiments, we use the \sys{OpenNMT} toolkit \cite{klein2017opennmt}.
For the reading comprehension experiments, we use the same implementation that we used for \sys{SQuAD}~\cite{chen2017reading}.
We tune the hyperparameters on the development data: the number of turns to use from the conversation history, the number of layers, number of each hidden units per layer and dropout rate.
We initialize the word projection matrix with \sys{GloVe} \cite{pennington2014glove} for conversational models and \sys{fastText} \cite{bojanowski2017enriching} for reading comprehension models, based on empirical performance. We update the projection matrix during training in order to learn embeddings for delimiters such as $\mathrm{<}q\mathrm{>}$.

For all the experiments of \sys{seq2seq} and \sys{PGNet}, we use the default settings of \sys{OpenNMT}: 2-layers of LSTMs with $500$ hidden units for both the encoder and the decoder. The models are optimized using SGD, with an initial learning rate of $1.0$ and a decay rate of $0.5$. A dropout rate of $0.3$ is applied to all layers.

For all the reading comprehension experiments, the best configuration we find is 3 layers of LSTMs with $300$ hidden units for each layer. A dropout rate of $0.4$ is applied to all LSTM layers and a dropout rate of $0.5$ is applied to word embeddings.

\subsection{Experimental Results}
Table~\ref{tab:coqa-results} presents the results of the models on the development and the test data. Considering the results on the test set, the \sys{seq2seq} model performs the worst, generating frequently occurring answers irrespective of whether these answers appear in the passage or not, a well known behavior of conversational models \cite{li2016diversity}. \sys{PGNet} alleviates the frequent response problem by focusing on the vocabulary in the passage and it outperforms \sys{seq2seq} by 17.8 points. However, it still lags behind \sys{Stanford Attentive Reader} by 8.5 points.
A reason could be that \sys{PGNet} has to memorize the whole passage before answering a question, a huge overhead which \sys{Stanford Attentive Reader} avoids. But \sys{Stanford Attentive Reader} fails miserably in answering questions with free-form answers (see row \textit{Abstractive} in Table ~\ref{tab:error-analysis}).
When the \sys{Stanford Attentive Reader} is fed into \sys{PGNet}, we empower both \sys{Stanford Attentive Reader} and \sys{PGNet} --- \sys{Stanford Attentive Reader} in producing free-form answers; \sys{PGNet} in focusing on the rationale instead of the passage. This combination outperforms the \sys{PGNet} and the \sys{Stanford Attentive Reader} models by 21.0 and 12.5 points respectively.

\begin{table}
\small
\centering
\begin{tabular}{l | c c c c c | c c |  c}
\hline
&  \multicolumn{5}{c|}{\tf{In-domain}} & \multicolumn{2}{c|}{\tf{Out-of-domain}} & \tf{Overall} \\
&  Children & Literature & Exam & News & Wikipedia & Reddit & Science &  \\
\hline
\multicolumn{9}{c}{\tf{Development data}}\\
\hline
\sys{seq2seq} & 30.6 & 26.7 & 28.3 & 26.3 & 26.1 & N/A & N/A & 27.5 \\
\sys{PGNet} & 49.7 & 42.4 & 44.8 & 45.5 & 45.0 & N/A & N/A & 45.4 \\
\sys{SAR} & 52.4 & 52.6 & 51.4 & 56.8 & 60.3 & N/A & N/A & 54.7 \\
\sys{Hybrid} & \bf 64.5 & \bf 62.0 & \bf 63.8 & \bf 68.0 & \bf 72.6 & N/A & N/A & \bf 66.2 \\
\sys{Human} & 90.7 & 88.3 & 89.1 & 89.9 & 90.9 & N/A & N/A  & 89.8 \\
\hline
\multicolumn{9}{c}{\tf{Test data}}\\
\hline
\sys{seq2seq} & 32.8 & 25.6 & 28.0 & 27.0 & 25.3 & 25.6 & 20.1  & 26.3 \\
\sys{PGNet} & 49.0 & 43.3 & 47.5 & 47.5 & 45.1 & 38.6 & 38.1  & 44.1 \\
\sys{SAR} & 46.7 & 53.9 & 54.1 & 57.8 & 59.4 & 45.0 & 51.0 & 52.6 \\
\sys{Hybrid} & \bf 64.2 & \bf 63.7 & \bf  67.1 & \bf 68.3 & \bf 71.4 & \bf 57.8 & \bf 63.1  & \bf 65.1  \\
\sys{Human} & 90.2 & 88.4 & 89.8 & 88.6 & 89.9 & 86.7 & 88.1 & 88.8 \\
\hline
\end{tabular}
\longcaption{Models and human performance on \sys{CoQA}}{\label{tab:coqa-results}Models and human performance (F1 score) on the development and the test data. \sys{SAR}: \sys{Stanford Attentive Reader}.}
\end{table}

\paragraph{Models vs. Humans.}
The human performance on the test data is 88.8 F1, a strong agreement indicating that the \sys{CoQA}'s questions have concrete answers.
Our best model is 23.7 points behind humans, suggesting that the task is difficult to accomplish with current models.
We anticipate that using a state-of-the-art reading comprehension model \cite{devlin2018bert} may improve the results by a few points.

\paragraph{In-domain~vs.~Out-of-domain.}
All models perform worse on out-of-domain datasets compared to in-domain datasets. The best model drops by 6.6 points. For in-domain results, both the best model and humans find the literature domain harder than the others since literature's vocabulary requires proficiency in English. For out-of-domain results, the Reddit domain is apparently harder. This could be because Reddit requires reasoning on longer passages (see Table~\ref{tab:coqa-domains}).

While humans achieve high performance on children's stories,  models perform poorly, probably due to the fewer training examples in this domain compared to others.\footnote{We collect children's stories from MCTest which contains only 660 passages in total, of which we use 200 stories for development and test.}
Both humans and models find Wikipedia easy.

\subsection{Error Analysis}

\begin{table}[!t]
\centering
\begin{tabular}{p{4cm}ccccc}
\toprule
\tf{Type} & \sys{seq2seq} & \sys{PGNet} & \sys{SAR} & \sys{Hybrid} & \sys{Human}\\
\midrule
\multicolumn{6}{c}{\tf{Answer Type}} \\
\midrule
Answerable & 27.5 & 45.4 & 54.7 & 66.3 & 89.9 \\
Unanswerable & 33.9 & 38.2 & 55.0 & 51.2 & 72.3 \\
\midrule
Extractive & 20.2 & 43.6 & 69.8 & 70.5 & 91.1 \\
Abstractive & 43.1 & 49.0 & 22.7 & 57.0 & 86.8 \\
\midrule
Named Entity & 21.9 & 43.0 & 72.6 & 72.2 & 92.2 \\
Noun Phrase & 17.2 & 37.2 & 64.9 & 64.1 & 88.6 \\
Yes & 69.6 & 69.9 & 7.9\; & 72.7 & 95.6 \\
No & 60.2 & 60.3 & 18.4 & 58.7 & 95.7 \\
Number & 15.0 & 48.6 & 66.3 & 71.7 & 91.2 \\
Date/Time & 13.7\; & 50.2 & 79.0 & 79.1 & 91.5 \\
Other & 14.1 & 33.7 & 53.5 & 55.2 & 80.8 \\
\midrule
\multicolumn{6}{c}{\tf{Question Type}} \\
\midrule
Lexical Matching & 20.7 &  40.7 & 57.2 & 65.7 & 91.7 \\
Paraphrasing &  23.7 & 33.9 & 46.9 & 64.4 & 88.8 \\
Pragmatics  & 33.9 & 43.1 & 57.4 & 60.6 & 84.2 \\
\midrule
No coreference & 16.1  & 31.7 & 54.3 & 57.9 & 90.3  \\
Explicit coreference & 30.4 & 42.3 & 49.0 & 66.3 & 87.1 \\
Implicit coreference & 31.4 & 39.0 & 60.1 & 66.4 & 88.7 \\
\bottomrule
\end{tabular}
\longcaption{Error anlaysis on \sys{CoQA}}{\label{tab:error-analysis} Fine-grained results of different question and answer types in the development set. For the question type results, we only analyze 150 questions as described in Section~\ref{sec:coqa-data-analysis}.}
\end{table}

Table~\ref{tab:error-analysis} presents fine-grained results of models and humans on the development set. We observe that humans have the highest disagreement on the unanswerable questions.
Sometimes, people guess an answer even when it is not present in the passage, e.g., one can guess the age of \textit{Annie} in Figure~\ref{fig:coqa-example} based on her \textit{grandmother}'s age.
The human agreement on abstractive answers is lower than on extractive answers.
This is expected because our evaluation metric is based on word overlap rather than on the meaning of words.
For the question \textit{did Jenny like her new room?},  human answers \textit{she loved it} and \textit{yes} are both accepted.

Finding the perfect evaluation metric for abstractive responses is still a challenging problem \cite{liu2016not} and beyond the scope of our work.
For our models' performance, \sys{seq2seq} and \sys{PGNet} perform well on the questions with abstractive answers, and \sys{Stanford Attentive Reader} performs well on the questions with extractive answers, due to their respective designs.
The combined model improves on both categories.

Among the lexical question types, humans find the questions with lexical matches the easiest followed by paraphrasing, and the questions with pragmatics the hardest --- this is expected since questions with lexical matches and paraphrasing share some similarity with the passage, thus making them relatively easier to answer than pragmatic questions.
The best model also follows the same trend.
While humans find the questions without coreferences easier than those with coreferences (explicit or implicit), the models behave sporadically.
It is not clear why humans find implicit coreferences easier than explicit coreferences.
A conjecture is that implicit coreferences depend directly on the previous turn whereas explicit coreferences may have long distance dependency on the conversation.

\paragraph{Importance of conversation history.}
Finally, we examine how important the conversation history is for the dataset. Table \ref{tab:ablations} presents the results with a varied number of previous turns used as conversation history.
All models succeed at leveraging history but only up to a history of one previous turn (except \sys{PGNet}). It is surprising that using more turns could decrease the performance.

We also perform an experiment on humans to measure the trade-off between their performance and the number of previous turns shown.
Based on the heuristic that short questions likely depend on the conversation history, we sample 300 one or two word questions, and collect answers to these varying the number of previous turns shown.

When we do not show any history, human performance drops to 19.9 F1 as opposed to 86.4 F1 when full history is shown. When the previous question and answer is shown, their performance boosts to 79.8 F1, suggesting that the previous turn plays an important role in making sense of the current question. If the last two questions and answers are shown, they reach up to 85.3 F1, almost close to the performance when the full history is shown. This suggests that most questions in a conversation have a limited dependency within a bound of two turns.

\begin{table}[!t]
\centering
\begin{tabular}{ccccc}
\toprule
\tf{history size} & \sys{seq2seq} & \sys{PGNet} & \sys{SAR} & \sys{Hybrid} \\
\midrule
0 & 24.0 & 41.3 & 50.4 & 61.5 \\
1 & 27.5 & 43.9 & 54.7 &  66.2 \\
2 & 21.4 & 44.6 & 54.6 & 66.0 \\
all & 21.0 &  45.4 & 52.3 & 64.3 \\
\bottomrule
\end{tabular}
\longcaption{\sys{CoQA} results on the development set with different history sizes}{\label{tab:ablations} Results on the development set with different history sizes. History size indicates the number of previous turns prepended to the current question. Each turn contains a question and its answer. \sys{SAR}: \sys{Stanford Attentive Reader}. }
\end{table}


================================================
FILE: chapters/coqa/intro.tex
================================================
%!TEX root = ../../thesis.tex

% \section{Introduction}

In the last chapter, we discussed how we built a general-knowledge question-answering system from neural reading comprehension. However, most current QA systems are limited to answering isolated questions, i.e., every time we ask a question, the systems return an answer without the ability to consider any context. In this chapter, we set out to tackle another challenging problem \ti{Conversational Question Answering}, where a machine has to understand a text passage and answer a series of questions that appear in a conversation.

Humans gather information by engaging in conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. Figure~\ref{fig:coqa-example} shows a conversation between two humans who are reading a passage, one acting as a questioner and the other as an answerer. In this conversation, every question after the first is dependent on the conversation history. For instance, $Q_5$ \ti{Who?} is only a single word and is impossible to answer without knowing what has already been said. Posing short questions is an effective human conversation strategy, but such questions are really difficult for machines to parse. Therefore, conversational question answering combines the challenges from both dialogue and reading comprehension.

We believe that building systems which are able to answer such conversational questions will play a crucial role in our future conversational AI systems. To approach this problem, we need to build effective \ti{datasets} and conversational QA \ti{models} and we will describe both of them in this chapter.

\begin{figure}[!t]
\begin{tabular}{p{0.9\columnwidth}}
\midrule
Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well. Jessica had $\ldots$\\
\\
$Q_1$: Who had a birthday? \\
$A_1$: Jessica \\
$R_1$: Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80.\\
\vspace{0em}
$Q_2$: How old would she be?\\
$A_2$: 80 \\
$R_2$: she was turning 80 \\
\vspace{0em}
$Q_3$: Did she plan to have any visitors?\\
$A_3$: Yes \\
$R_3$: Her granddaughter Annie was coming over \\
\vspace{0em}
$Q_4$: How many?\\
$A_4$: Three \\
$R_4$: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well. \\
\vspace{0em}
$Q_5$: Who?\\
$A_5$: Annie, Melanie and Josh \\
$R_5$: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well.\\
\bottomrule
\end{tabular}
\longcaption{A conversation from \sys{CoQA}}{\label{fig:coqa-example} A conversation from our \sys{CoQA} dataset. Each turn contains a question ($Q_i$), an answer ($A_i$) and a rationale ($R_i$) that supports the answer.}
\end{figure}

This chapter is organized as follows. We first discuss related work in Section~\ref{sec:coqa-rw} and then we introduce \sys{CoQA}~\cite{reddy2019coqa} in Section~\ref{sec:coqa-dataset}, a \textbf{Co}nversational \textbf{Q}uestion \textbf{A}nswering challenge for measuring the ability of machines to participate in a question-answering style conversation.\footnote{We launch \sys{CoQA} as a challenge to the community at \href{https://stanfordnlp.github.io/coqa/}{https://stanfordnlp.github.io/coqa/}.} Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. We define our task and describe the dataset collection process. We also analyze the dataset in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. Next we describe several strong conversational and reading comprehension models we built for \sys{CoQA} in Section~\ref{sec:coqa-models} and present experimental results in Section~\ref{sec:coqa-experiments}. Finally, we discuss future work of conversational question answering (Section~\ref{sec:coqa-future}).


================================================
FILE: chapters/coqa/models.tex
================================================
%!TEX root = ../../thesis.tex

\section{Models}
\label{sec:coqa-models}

Given a passage $p$, the conversation history \{$q_1, a_1, \ldots q_{i-1}, a_{i-1}$\} and a question $q_i$, the task is to predict the answer ${a_i}$. Our task can be modeled as either a conversational response generation problem or a reading comprehension problem. We evaluate strong baselines from each class of models and a combination of the two on \sys{CoQA}.

\subsection{Conversational Models}

\begin{figure}[!t]
\begin{center}
\includegraphics[height=9.5cm]{img/coqa_pgnet.pdf}
\end{center}
\longcaption{The pointer-generator network used for conversational question answering}{\label{fig:coqa-pgnet} The pointer-generator network used for conversational question answering. The figure is adapted from \newcite{see2017get}.}
\end{figure}

The basic goal of conversational models is to predict the next utterance based on its conversation history. Sequence-to-sequence (seq2seq) models~\cite{sutskever2014sequence} have shown promising results for generating conversational responses \cite{vinyals2015neural,li2016diversity,zhang2018personalizing}. Motivated by their success, we use a standard sequence-to-sequence model with an attention mechanism for generating answers. We append the passage, the conversation history (the question/answer pairs in the last $n$ turns) and the current question as, $p\; \mathrm{<}q\mathrm{>}\; q_{i-n} \;\mathrm{<}a\mathrm{>}\; a_{i-n}\; \ldots$ $\mathrm{<}q\mathrm{>}\; q_{i-1} \;\mathrm{<}a\mathrm{>}\; a_{i-1}\;$  $\mathrm{<}q\mathrm{>}\;q_i$, and feed it into a bidirectional LSTM encoder, where $\mathrm{<}q\mathrm{>}$ and $\mathrm{<}a\mathrm{>}$ are special tokens used as delimiters. We then generate the answer using a LSTM decoder which attends to the encoder states.

Moreover, as the answer words are likely to appear in the original passage, we adopt a copy mechanism in the decoder proposed for summarization tasks \cite{gu2016incorporating,see2017get}, which allows to (optionally) copy a word from the passage and the conversation history. We call this model the Pointer-Generator network~\cite{see2017get}, \sys{PGNet}. Figure~\ref{fig:coqa-pgnet} illustrates a full model of \sys{PGNet}. Formally, we denote the encoder hidden vectors by $\{\tilde{\mf{h}}_i\}$, the decoder state at timestep $t$ by $\mf{h}_t$ and the input vector by $\mf{x}_t$, an attention function is computed based on $\{\tilde{\mf{h}}_i\}$ and $\mf{h}_t$  as $\alpha_i$ (Equation~\ref{eq:attention}) and the context vector is computed as $\mf{c} = \sum_{i}{\alpha_i \tilde{\mf{h}}_i}$ (Equation~\ref{eq:context-vector}).

For a copy mechanism, it first computes the \ti{generation probability} $p_{\text{gen}} \in [0, 1]$ which controls the probability that it generates a word from the full vocabulary $\mathcal{V}$ (rather than copying a word) as:

\begin{equation}
    p_{\text{gen}} = \sigma\left({\mf{w}^{(c)}}^{\intercal}\mf{c} + {\mf{w}^{(x)}}^{\intercal}\mf{x}_t + {\mf{w}^{(h)}}^{\intercal}\mf{h}_t + b\right).
\end{equation}

The final probability distribution of generating word $w$ is computed as:
\begin{equation}
    P(w) = p_{\text{gen}}P_{\text{vocab}}(w) + (1 - p_{\text{gen}})\sum_{i: w_i = w}\alpha_i,
\end{equation}
where $P_{\text{vocab}}(w)$ is the original probability distribution (computed based on $\mf{c}$ and $\mf{h}_t$) and $\{w_i\}$ refers to all the words in the passage and the dialogue history. For more details, we refer readers to \cite{see2017get}.


\subsection{Reading Comprehension Models}
The second class of models we evaluate is the neural reading comprehension models. In particular, the models for the span prediction problems can't be applied directly, as a large portion of the \sys{CoQA} questions don't have a single span in the passage as their answer, e.g., $Q_3$, $Q_4$ and $Q_5$ in Figure~\ref{fig:coqa-example}. Therefore, we modified the \sys{Stanford Attentive Reader} model we described in Section~\ref{sec:sar} for this problem. Since the model requires text spans as answers during training, we select the span which has the highest lexical overlap (F1 score) with the original answer as the gold answer. If the answer appears multiple times in the story we use the rationale to find the correct one. If any answer word does not appear in the passage, we fall back to an additional \textit{unknown} token as the answer (about 17\%). We prepend each question with its past questions and answers to account for conversation history, similar to the conversational models.

\subsection{A Hybrid Model}
The last model we build is a \ti{hybrid} model, which combines the advantages of the aforementioned two models. The reading comprehension models can predict a text span as an answer, while they can't produce answers that do not overlap with the passage. Therefore,  we combine \sys{Stanford Attentive Reader} with \sys{PGNet} to address this problem since \sys{PGNet} can generate free-form answers effectively. In this hybrid model, we use the reading comprehension model to first point to the answer evidence in text, and \sys{PGNet} naturalizes the evidence into the final answer. For example, for Q$_5$ in Figure~\ref{fig:coqa-example}, we expect that the reading comprehension model first predicts the rationale R$_5$ \ti{Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well.}, and then \sys{PGNet} generates A$_5$ \ti{Annie, Melanie and Josh} from R$_5$.

We make a few changes to both models based on empirical performance. For the \sys{Stanford Attentive Reader} model, we only use rationales as answers for the questions with an non-extractive answer. For \sys{PGNet}, we only provide current question and span predictions from the the \sys{Stanford Attentive Reader} model as input to the encoder. During training, we feed the oracle spans into \sys{PGNet}.


================================================
FILE: chapters/coqa/related_work.tex
================================================
%!TEX root = ../../thesis.tex

\section{Related Work}
\label{sec:coqa-rw}

Conversational question answering is directly related to \tf{dialogue}. Building conversational agents, or dialogue systems to converse with humans in natural language is one of the major goals of natural language understanding. The two most common classes of dialogue systems are: \ti{task-oriented}, and \ti{chit-chat} (or \ti{chatbot}) dialogue agents.  Task-oriented dialogue systems are designed for a particular task and set up to have short conversations (e.g., booking a flight or making a restaurant reservation). They are evaluated based on task-completion rate or time to task completion. In contrast, chit-chat dialogue systems are designed for extended, casual conversations, without a specific goal. Usually, the longer the user engagement and interaction, the better these systems are.

Answering questions is also a core task of dialogue systems, because one of the most common needs for humans to interact with dialogue agents is to seek information and ask questions of various topics. QA-based dialogue techniques have been developed extensively in automated personal assistant systems such as Amazon's \sys{Alexa}, Apple's \sys{Siri} or \sys{Google Assistant}, either based on structured knowledge bases, or unstructured text collections. Modern dialogue systems are mostly built on top of deep neural networks. For a comprehensive survey of neural approaches to different types of dialogue systems, we refer readers to \cite{gao2018neural}.

\begin{figure}[!t]
    \center
    \includegraphics[scale=0.45]{img/other_coqa_tasks.pdf}
    \longcaption{Other conversational question answering tasks on images and KBs}{\label{fig:other-coqa-tasks}Other conversational question answering tasks on images (left) and KBs (right). Images courtesy: \cite{das2017visual} and \cite{guo2018dialog} with modifications.}
\end{figure}

Our work is closely related to the \ti{Visual Dialog} task of \cite{das2017visual} and the \ti{Complex Sequential Question Answering} task of \cite{saha2018complex}, which perform conversational question answering on images and a knowledge graph (e.g. \sys{WikiData}) respectively, with the latter focusing on questions obtained by paraphrasing templates. Figure~\ref{fig:other-coqa-tasks} demonstrates an example from each task. We focus on conversations over a passage of text, which requires the ability of reading comprehension.

Another related line of research is \ti{sequential question answering}~\cite{iyyer2017search,talmor2018web}, in which a complex question is decomposed into a sequence of simpler questions. For example, a question \ti{What super hero from Earth appeared most recently?} can be decomposed into the following three questions: 1) \ti{Who are all of the super heroes?}, 2) \ti{Which of them come from Earth?}, and 3) \ti{Of those, who appeared most recently?}. Therefore, their focus is how to answer a complex question via sequential question answering, while we are more interested in a natural conversation of a variety of topics while the questions can be dependent on the dialogue history.


================================================
FILE: chapters/openqa/evaluation.tex
================================================
%!TEX root = ../../thesis.tex

\section{Evaluation}
\label{sec:drqa-eval}

We have all the basic elements of our \sys{DrQA} systems and let's take a look at the evaluation.

\subsection{Question Answering Datasets}
The first question is which question answering datasets we should evaluate on. As we discussed, \sys{SQuAD} is one of the largest general purpose QA datasets currently available for question answering but it is very different from open-domain QA setting. We propose to train and evaluate our system on other datasets developed for open-domain QA that have been constructed in different ways. We hence adopt the following three datasets:

\paragraph{TREC} This dataset is based on the benchmarks from the TREC QA tasks that have been curated by \newcite{baudivs2015modeling}. We use the large version, which contains a total of 2,180 questions extracted from the datasets from TREC 1999, 2000, 2001 and 2002.\footnote{This dataset is available at \url{https://github.com/brmson/dataset-factoid-curated}.} Note that for this dataset, all the answers are written in regular expressions, for example, the answer is \texttt{Sept(ember)?|Feb(ruary)?} to the question \ti{When is Fashion week in NYC?}, so answers \ti{Sept}, \ti{September}, \ti{Feb}, \ti{February} are all judged as correct.

\paragraph{WebQuestions} Introduced in \newcite{berant2013semantic}, this dataset is built to answer questions from the Freebase KB. It was created by crawling questions through the \sys{Google Suggest} API, and then obtaining answers using Amazon Mechanical Turk. We convert each answer to text by using entity names so that the dataset does not reference Freebase IDs and is purely made of plain text question-answer pairs.

\paragraph{WikiMovies} This dataset, introduced in \newcite{miller2016key}, contains 96k question-answer pairs in the domain of movies. Originally created from the \sys{OMDb} and \sys{MovieLens} databases, the examples are built such that they can also be answered by using a subset of Wikipedia as the knowledge source (the title and the first section of articles from the movie domain).

We would like to emphasize that these datasets are not necessarily collected in the context of answering from Wikipedia. The \sys{TREC} dataset was designed for text-based question answering (the primary TREC document sets consist mostly of newswire articles), while \sys{WebQuestions} and \sys{WikiMovies} were mainly collected for knowledge-based question answering. We put all these resources in one unified framework, and test how well our system can answer all the questions --- hoping that it can reflect the performance of general-knowledge QA.

Table~\ref{tab:qa-data-stats} and Figure~\ref{fig:qa-data-stats} give detailed statistics of these QA datasets. As we can see that, the distribution of \sys{SQuAD} examples is quite different from that of the other QA datasets. Due to the construction method, \sys{SQuAD} has longer questions (10.4 tokens vs 6.7--7.5 tokens on average). Also, all these datasets have short answers (although the answers in \sys{SQuAD} are slightly longer) and most of them are factoid.

Note that there are might be multiple answers for many of the questions in these QA datasets (see the \ti{\# answers} column of Table~\ref{tab:qa-data-stats}). For example, there are two valid answers: \ti{English} and \ti{Urdu} to the question \ti{What language do people speak in Pakistan?} on \sys{WebQuestions}. As our system is designed to return one answer, our evaluation considers the prediction as correct if it gives any of the gold answers.

\begin{figure}[h]
\center
\includegraphics[scale=0.7]{img/qa_stat.png}
\longcaption{The average length of questions and answers in our QA datasets}{\label{fig:qa-data-stats}The average length of questions and answers in our QA datasets. All the statistics are computed based on the training sets.}
\end{figure}

\begin{table}[t]
\begin{center}
\begin{tabular}{l | r r | r | r}
\toprule
\tf{Dataset} & \tf{\# Train} & \tf{\# DS Train} & \tf{\# Test} & \tf{\# answers} \\
\midrule
\sys{SQuAD} & 87,599 & 71,231 & N/A & 1.0  \\
\midrule
\sys{TREC} &  1,486$^{\dagger}$ & 3,464 & 694 & 3.2\footnote{As all the answer strings are regex expressions, it is difficult to estimate \# of answers. We only simply list the number of alternation symbols \texttt{|} in the answer.} \\
\sys{WebQuestions} &  3,778$^{\dagger}$ & 4,602 & 2,032 & 2.4 \\
\sys{WikiMovies} &  96,185$^{\dagger}$ & 36,301 & 9,952 & 1.9 \\
\bottomrule
\end{tabular}
\end{center}
\longcaption{Statistics of the QA datasets used for \sys{DrQA}.}{\label{tab:qa-data-stats} Statistics of the QA datasets used for \sys{DrQA}. DS Train: distantly supervised training data. $^{\dagger}$: These training sets are not used as is because no passage is associated with each question.}
\end{table}


\subsection{Implementation Details}

\subsubsection{Processing Wikipedia}
We use the 2016-12-21 dump\footnote{\url{https://dumps.wikimedia.org/enwiki/latest}} of English Wikipedia for all of our full-scale experiments as the knowledge source used to answer questions. For each page, only the plain text is extracted and all structured data sections such as lists and figures are stripped.\footnote{We use the WikiExtractor script: \url{https://github.com/attardi/wikiextractor}.} After discarding internal disambiguation, list, index, and outline pages, we retain 5,075,182 articles consisting of 9,008,962 unique uncased token types.


\subsubsection{Distantly-supervised data}
We use the following process for each question-answer pair from the training portion of each dataset to build our distantly-supervised training examples. First, we run our \sys{Document Retriever} on the question to retrieve the top 5 Wikipedia articles. All paragraphs from those articles without an exact match of the known answer are directly discarded. All paragraphs  shorter than 25 or longer than 1500  characters are also filtered out. If any named entities are detected in the question, we remove any paragraph that does not contain them at all. For every remaining paragraph in each retrieved page, we score all positions that match an answer using unigram and bigram overlap between the question and a 20 token window, keeping up to the top 5 paragraphs with the highest overlaps. If there is no paragraph with non-zero overlap, the example is discarded; otherwise we add each found pair to our DS training dataset. Some examples are shown in Figure~\ref{fig:ds_examples} and the number of distantly supervised examples we created for training are given in Table~\ref{tab:qa-data-stats} (column \ti{\# DS Train}).


\begin{figure}
\begin{center}
\small
\begin{tabularx}{\textwidth}{l|p{4.5cm}|p{7cm}}
\hline
\bf Dataset & \bf Example & \bf Article / Paragraph \\
\hline
\sys{TREC} & {\bf Q}: What U.S. state's motto is ``Live free or Die''? \newline {\bf A}: New Hampshire & {\bf Article}: Live Free or Die \newline {\bf Paragraph}: ``Live Free or Die'' is the official motto of the U.S. state of \hl{New Hampshire}, adopted by the state in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.\\
\hline
\sys{WebQuestions}  & {\bf Q}: What part of the atom did Chadwick discover?$^\dagger$  \newline {\bf A}: neutron  & {\bf Article}: Atom \newline {\bf Paragraph}: ... The atomic mass of these isotopes varied by integer amounts, called the whole number rule. The explanation for these different isotopes awaited the discovery of the \hl{neutron}, an uncharged particle with a mass similar to the proton, by the physicist James Chadwick in 1932.  ... \\
\hline
\sys{WikiMovies} & {\bf Q}: Who wrote the film Gigli? \newline {\bf A}: Martin Brest &  {\bf Article}: Gigli \newline {\bf Paragraph}: Gigli is a 2003 American romantic comedy film written and directed by \hl{Martin Brest} and starring Ben Affleck, Jennifer Lopez, Justin Bartha, Al Pacino, Christopher Walken, and Lainie Kazan. \\
\hline
\end{tabularx}
\end{center}
\longcaption{Examples of distantly-supervised examples from QA datasets}{\label{fig:ds_examples}Example training data from each QA dataset. In each case we show an associated paragraph where distant supervision (DS) correctly identified the answer within it, which is highlighted.}
\end{figure}


\Subsection{retrieval-eval}{Document Retriever Performance}
We first examine the performance of our retrieval module on all the QA datasets. Table~\ref{tab:ir-res} compares the performance of the two approaches described in Section~\ref{sec:doc-retriever} with that of the Wikipedia Search Engine\footnote{We use the Wikipedia Search API \url{https://www.mediawiki.org/wiki/API:Search}.} for the task of finding articles that contain the answer given a question.

Specifically, we compute the ratio of questions for which the text span of any of their associated answers appear in at least one the top 5 relevant pages returned by each system.

Results on all datasets indicate that our simple approach outperforms Wikipedia Search, especially with bigram hashing. We also compare doing retrieval with Okapi BM25 or by using cosine distance in the word embeddings space (by encoding questions and articles as bag-of-embeddings), both of which we find performed worse.

\begin{table}[t]
\begin{center}
\normalsize
\begin{tabular}{l r r r}
\toprule
\bf Dataset &  \sys{Wiki. Search} & \multicolumn{2}{c}{\sys{Document Retriever}} \\
&    & unigram &  bigram  \\
\midrule
% SQuAD & 62.7 &  76.1 & \bf 77.8 \\
% %\curq  & 82.8 & 84.2 & \bf 85.6 \\
\sys{TREC} & 81.0 & 85.2 & \bf 86.0 \\
\sys{WebQuestions} &    73.7 & \bf 75.5 & 74.4 \\
\sys{WikiMovies} & 61.7 &  54.4 &  \bf 70.3 \\
\bottomrule
\end{tabular}
\end{center}
\longcaption{Document retrieval results}{\label{tab:ir-res} Document retrieval results. \% of questions for which the answer segment appears in one of the top 5 pages returned by the method. }
\end{table}


\subsection{Final Results}
\label{sec:drqa-final-results}
Finally, we assess the performance of our full system \sys{DrQA} for answering open-domain questions using all these datasets. We compare three versions of \sys{DrQA} which evaluate the impact of using distant supervision and multitask learning across the training sources provided to \sys{Document Reader} (\sys{Document Retriever} remains the same for each case):

\begin{itemize}
\item
  \sys{SQuAD}: A single \sys{Document Reader} model is trained on the \sys{SQuAD} training set only and used on all evaluation sets. We used the model that we described in Section~\ref{sec:drqa} (the F1 score is 79.0\% on the test set of \sys{SQuAD}).
\item
  Fine-tune (DS): A \sys{Document Reader} model is pre-trained on \sys{SQuAD} and then fine-tuned for each dataset independently using its distant supervision (DS) training set.
\item
  Multitask (DS): A single \sys{Document Reader} model is jointly trained on the SQuAD training set and all the distantly-supervised examples.
\end{itemize}

For the full Wikipedia setting we use a streamlined model that does not use the \sys{CoreNLP} parsed $f_{token}$ features or lemmas for $f_{exact\_match}$. We find that while these help for more exact paragraph reading in \sys{SQuAD}, they don't improve results in the full setting. Additionally, \sys{WebQuestions} and \sys{WikiMovies} provide a list of candidate answers (1.6 million \sys{Freebase} entity strings for \sys{WebQuestions} and 76k movie-related entities for \sys{WikiMovies}) and we restrict that the answer span must be in these lists during prediction.

Table~\ref{tab:drqa-full-results} presents the results. We only consider top-1, exact-match accuracy, which is the most restricted and challenging setting. In the original paper \cite{chen2017reading}, we also evaluated the question/answer pairs in SQuAD. We omit them here because that at least a third of these questions are context-dependent and are not really suitable for open QA.

\begin{table}[t]
\begin{center}
\begin{tabular}{l c ccc cc}
\toprule
\textbf{Dataset} &  \tf{YodaQA} &  \multicolumn{3}{c}{\tf{DrQA}} & \multicolumn{2}{c}{\tf{DrQA*}} \\
&   &  {SQuAD} &  {FT} & {MT} & {SQuAD} & {FT} \\
\midrule
\sys{TREC} & 31.3 & 19.7 & 25.7 & 25.4 &  21.3 &  28.8 \\
\sys{WebQuestions} & 38.9 & 11.8 & 19.5 & 20.7 & 14.2 & 24.3 \\
\sys{WikiMovies} & N/A & 24.5 & 34.3 & 36.5 & 31.9 & 46.0 \\
\bottomrule
\end{tabular}
\end{center}
\longcaption{Final performance of DrQA}{\label{tab:drqa-full-results} Full Wikipedia results. Top-1 exact-match accuracy (\%). \tf{FT}: Fine-tune (DS). \tf{MT}: Multitask (DS). The \sys{DrQA*} results are taken from \newcite{raison2018weaver}.}
\end{table}

Despite the difficulty of the task compared to the reading comprehension task (where you are given the right paragraph) and unconstrained QA (using redundant resources), \sys{DrQA} still provides reasonable performance across all four datasets.

We are interested in a single, full system that can answer any question using Wikipedia. The single model trained only on \sys{SQuAD} is outperformed on all the datasets by the multitask model that uses distant supervision. However, performance when training on SQuAD alone is not far behind, indicating that task transfer is occurring. The majority of the improvement from \sys{SQuAD} to Multitask (DS) learning, however, is likely not from task transfer, as fine-tuning on each dataset alone using DS also gives improvements, showing that is the introduction of extra data in the same domain that helps. Nevertheless, the best single model that we can find is our overall goal, and that is the Multitask (DS) system.

We compare our system to \sys{YodaQA} \cite{baudivs2015yodaqa} (an unconstrained QA system using redundant resources), giving results which were previously reported on \sys{TREC} and \sys{WebQuestions}.\footnote{The results are extracted from \href{https://github.com/brmson/yodaqa/wiki/Benchmarks}{https://github.com/brmson/yodaqa/wiki/Benchmarks}.} Despite the increased difficulty of our task, it is reassuring that our performance is not too far behind on \sys{TREC} (31.3 vs 25.4). The gap is slightly bigger on \sys{WebQuestions}, likely because this dataset was created from the specific structure of \sys{Freebase} which \sys{YodaQA} uses directly.

We also include the results from an enhancement of our model named \sys{DrQA*}, presented in \newcite{raison2018weaver}. The biggest change is that this reading comprehension model is trained and evaluated directly on the Wikipedia articles instead of paragraphs (documents are on average 40 times larger than individual paragraphs). As we can see, the performance has been improved consistently on all the datasets, and the gap from \sys{YodaQA} is hence further reduced.

\clearpage
\begin{longtable}{l l p{12cm}}
\hline
(a) & \tf{Question} & What is question answering? \\
& \tf{Answer} & a computer science discipline within the fields of information retrieval and natural language processing \\
& \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Question_answering}{Question Answering} \\
& \tf{Passage} & {\small Question Answering (QA) is \hl{a computer science discipline within the fields of information retrieval and natural language processing} (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.} \\
\hline
(b) & \tf{Question} & Which state is Stanford University located in? \\
& \tf{Answer} & California \\
& \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Stanford_Memorial_Church}{Stanford Memorial Church} \\
& \tf{Passage} & {\small Stanford Memorial Church (also referred to informally as MemChu) is located on the Main Quad at the center of the Stanford University campus in Stanford, \hl{California}, United States. It was built during the American Renaissance by Jane Stanford as a memorial to her husband Leland. Designed by architect Charles A. Coolidge, a protégé of Henry Hobson Richardson, the church has been called "the University's architectural crown jewel".} \\
\hline
(c) & \tf{Question} & Who invented LSTM? \\
& \tf{Answer} & Sepp Hochreiter \& J\"urgen Schmidhuber \\
& \tf{Wiki. article}  & \href{https://en.wikipedia.org/wiki/Deep_learning}{Deep Learning} \\
& \tf{Passage} & {\small Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by \hl{Sepp Hochreiter \& J\"urgen Schmidhuber} in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn ``Very Deep Learning'' tasks that require memories of events that happened thousands of discrete time steps ago, which is important for speech. In 2003, LSTM started} \\
& & {\small  to become competitive with traditional speech recognizers on certain tasks. Later it was combined with CTC in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49\% through CTC-trained LSTM, which is now available through Google Voice to all smartphone users, and has become a show case of deep learning.} \\
\hline
(d) & \tf{Question} & What is the answer to life, the universe, and everything? \\
& \tf{Answer} & 42 \\
& \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy}{Phrases from The Hitchhiker's Guide to the Galaxy} \\
& \tf{Passage} & {\small The number 42 and the phrase, "Life, the universe, and everything" have attained cult status on the Internet. "Life, the universe, and everything" is a common name for the off-topic section of an Internet forum and the phrase is invoked in similar ways to mean "anything at all". Many chatbots, when asked about the meaning of life, will answer "42". Several online calculators are also programmed with the Question. Google Calculator will give the result to "the answer to life the universe and everything" as 42, as will Wolfram's Computational Knowledge Engine. Similarly, DuckDuckGo also gives the result of "the answer to the ultimate question of life, the universe and everything" as \hl{42}. In the online community Second Life, there is a section on a sim called "42nd Life." It is devoted to this concept in the book series, and several attempts at recreating Milliways, the Restaurant at the End of the Universe, were made.} \\
\hline
\longcaption{Sample predictions of our \sys{DrQA} system}{\label{tab:drqa-output}Sample predictions of our \sys{DrQA} system.}
\end{longtable}

Lastly, our \sys{DrQA} system is open-sourced at \href{https://github.com/facebookresearch/DrQA}{https://github.com/facebookresearch/DrQA} (the Multitask (DS) system was deployed). Table~\ref{tab:drqa-output} lists some sample predictions that we tried by ourselves (not in any of these datasets). As is seen, our system is able to return a precise answer to all these factoid questions and answering some of these questions is not trivial:

\begin{enumerate}[(a)]
    \item It is not trivial to identify that \ti{a computer science discipline within the fields of information retrieval and natural language processing} is the complete noun phrase and the correct answer although the question is pretty simple.
    \item Our system finds the answer in another Wikipedia article \ti{Stanford Memorial Church}, and gives the exactly correct answer \ti{California} as the \ti{state} (instead of \ti{Stanford} or \ti{United States}).
    \item To get the correct answer, the system needs to understand the syntactic structure of the question and the context \ti{Who invented LSTM?} and \ti{a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter \& J\"urgen Schmidhuber in 1997.}
\end{enumerate}

Conceptually, our system is simple and elegant, and doesn't rely on any additional linguistic analysis or external or hand-coded resources (e.g., dictionaries). We think this approach holds great promise for a new generation of open-domain question answering systems. In the next section, we discuss current limitations and possible directions for further improvement.


================================================
FILE: chapters/openqa/future.tex
================================================
%!TEX root = ../../thesis.tex

\section{Future Work}
\label{sec:openqa-future}

Our \sys{DrQA} demonstrates that combining information retrieval and neural reading comprehension is an effective approach for open-domain question answering. We hope that our work takes the first step in this research direction. However, our system is still at an early stage and many implementation details can be further improved.

We think the following research directions will (greatly) improve our \sys{DrQA} system and should be pursued as future work. Indeed, some of the ideas have already been implemented in the following year after we published our \sys{DrQA} system and we will also describe them in detail in this section.

\paragraph{Aggregating evidence from multiple paragraphs.} Our system adopted the most simple and straightforward approach: we took the argmax over the unnormalized scores of all the retrieved passages. This is not ideal because 1) It implies that each passage must contain the correct answer (as \sys{SQuAD} examples) so our system will output one and only one answer for each passage. This is indeed not the case for most retrieved passages. 2) Our current training paradigm doesn't guarantee that the scores in different passages are comparable which causes a gap between the training and the evaluation process.

Training on full Wikipedia articles is a solution to alleviate this problem (see the \sys{DrQA*} results in Table~\ref{tab:drqa-full-results}), however, these models are running slowly and difficult to parallelize. \newcite{clark2018simple} proposed to perform multi-paragraph training with modified training objectives, where the span start and end scores are normalized across all paragraphs sampled from the same context. They demonstrated that it works much better than training on individual passages independently. Similarly, \newcite{wang2018r} and \newcite{wang2018evidence} proposed to train an explicit passage re-ranking component on the retrieved articles: \newcite{wang2018r} implemented this in a reinforcement learning framework so the re-ranker component and answer extraction components are jointly trained; \newcite{wang2018evidence} proposed a strength-based re-ranker and a coverage-based re-ranker which aggregate evidence from multiple paragraphs more directly.

\paragraph{Using more and better training data.} The second aspect which makes a big impact is the training data. Our \sys{DrQA} system only collected 44k distantly-supervised training examples from \sys{TREC}, \sys{WebQuestions} and \sys{WikiMovies}, and we demonstrated their effectiveness in Section~\ref{sec:drqa-final-results}. The system should be further improved if we can leverage more supervised training data --- from either \sys{TriviaQA}~\cite{joshi2017triviaqa} or generating more data from other QA resources. Moreover, these distantly supervised examples inevitably suffer from the noise problem (i.e., the paragraph doesn't imply the answer to the question even if the answer is contained) and \newcite{lin2018denoising} proposed a solution to de-noise these distantly supervised examples and demonstrated gains in an evaluation.

We also believe that adding negative examples should improve the performance of our system substantially. We can either create some negative examples using our full pipeline: we can leverage the \sys{Document Retrieval} module to help us find relevant paragraphs while they don't contain the correct answer. We can also incorporate existing resources such as \sys{SQuAD 2.0}~\cite{rajpurkar2018know} into our training process, which contains curated, high-quality negative examples.

\paragraph{Making the \sys{Document Retriever} trainable.} A third promising direction that has not been fully studied yet is to employ a machine learning approach for the \sys{Document Retriever} module. Our system adopted a straightforward, non-machine learning model and further improvement on the retrieval performance (Table~\ref{tab:ir-res}) should lead to an improvement on the full system. A training corpus for the \sys{Document Retriever} component can be collected either from other resources or from the QA data (e.g., using whether an article contains the answer to the question as a label). Joint training of the \sys{Document Retrieval} and the \sys{Document Reader} component will be a very desirable and promising direction for future work.

Related to this, \newcite{clark2018simple} also built an open-domain question answering system\footnote{The demo is at \href{https://documentqa.allenai.org}{https://documentqa.allenai.org}.} on top of a search engine (Bing web search) and demonstrated superior performance compared to ours. We think the results are not directly comparable and the two approaches (using a commercial search engine or building an independent IR component) both have pros and cons. Building our own IR component gets rid of an existing API call and can run faster and easily adapt to new domains.

\paragraph{Better \sys{Document Reader} module.} For our \sys{DrQA} system, we used the neural reading comprehension model which achieved F1 of 79.0\% on the test set of \sys{SQuAD 1.1}. With the recent development of neural reading comprehension models (Section~\ref{sec:advances}), we are confident that if we replace our current \sys{Document Reader} model with the state-of-the-art models~\cite{devlin2018bert}, the performance of our full system will be improved as well.

\paragraph{More analysis is needed.} Another important missing work is to conduct an in-depth analysis of our current systems: to understand which questions they can answer, and which they can't. We think it is important to compare our modern systems to the earlier TREC QA results under the same conditions. It will help us understand where we make genuine progress and what techniques we can still use from the pre-deep learning era, to build better question answering systems in the future.

Concurrent to our work, there are several works in a similar spirit to ours, including \sys{SearchQA}~\cite{dunn2017searchqa} and \sys{Quasar-T}~\cite{dhingra2017quasar}, which both collected relevant documents for trivia or \sys{Jeopardy!} questions --- the former one retrieved documents from \sys{ClueWeb} using the \sys{Lucene} index and the latter used \sys{Google} search. \sys{TriviaQA}~\cite{joshi2017triviaqa} also has an open-domain setting where all the retrieved documents from Bing web search are kept.
However, these datasets still focus on the task of question answering from the retrieved documents, while we are more interested in building an end-to-end QA system.


================================================
FILE: chapters/openqa/intro.tex
================================================
%!TEX root = ../../thesis.tex

% \section{Introduction}
In \sys{Part I}, we described the task of reading comprehension: its formulation and development over recent years, the key components of neural reading comprehension systems, and future research directions. However, it is unclear yet whether reading comprehension is merely used as a task of measuring language understanding abilities, or it can enable any useful applications.  In \sys{Part II}, we will answer this question and discuss our efforts at building applications which leverage neural reading comprehension as their core component.

In this chapter, we view \ti{open domain question answering} as an application of reading comprehension. Open domain question answering has been a long-standing problem in the history of NLP\@. The goal of open domain question answering is to build automated computer systems which are able to answer any sort of (factoid) questions that humans might ask, based on a large collection of unstructured natural language documents, structured data (e.g., knowledge bases), semi-structured data (e.g., tables) or even other modalities such as images or videos.

We are the first to test how the neural reading comprehension methods can perform in an open-domain QA framework. We believe that the high performance of these systems can be a key ingredient in building a new generation of open-domain question answering systems, when combined with effective information retrieval techniques.

This chapter is organized as follows. We first give a high-level overview of open domain question answering and some notable systems in the history (Section~\ref{sec:openqa-rw}). Next, we introduce an open-domain question answering system that we built called \sys{DrQA}, designed to answer questions from English Wikipedia (Section~\ref{sec:drqa}). It essentially combines an information retrieval module and the high-performing neural reading comprehension module that we described in Section~\ref{sec:sar}. We further talk about how we can improve the system by creating distantly-supervised training examples from the retrieval module. We then present a comprehensive evaluation on multiple question answering benchmarks (Section~\ref{sec:drqa-eval}). Finally, we discuss current limitations, follow-up work and future directions in Section~\ref{sec:openqa-future}.


================================================
FILE: chapters/openqa/related_work.tex
================================================
%!TEX root = ../../thesis.tex

\section{A Brief History of Open-domain QA}
\label{sec:openqa-rw}

Question answering was one of the earliest tasks for NLP systems since 1960s. One early system, which prefigures modern text-based question answering systems, was the \sys{Protosynthex} system of \cite{simmons1964indexing}. The system first formulated a query based on the content words in the question, retrieved candidate answer sentences based on the frequency-weighted term overlap with the question, and finally performed a dependency parse match to get the final answer. Another notable system \sys{MURAX} \cite{kupiec1993murax}, was designed to answer general-knowledge questions over \sys{Grolier}'s on-line encyclopedia, using shallow linguistic processing and information retrieval (IR) techniques.

The interest in open domain question answering has increased since 1999, when the QA track was first included as part of the annual TREC competitions\footnote{\url{http://trec.nist.gov/data/qamain.html}}. The task was at first defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain questions. It has spurred a wide range of QA systems developed at the time, and the majority of the systems consisted of two stages: an IR system used to select the top $n$ documents or passages which match a query that has been generated from the question, and a window-based word scoring system used to pinpoint likely answers. For more details, readers are referred to \cite{voorhees1999trec,moldovan2000structure}.

More recently, with the development of knowledge bases (KBs) such as \sys{Freebase}~\cite{bollacker2008freebase} and \sys{DBpedia}~\cite{auer2007dbpedia}, many innovations have occurred in the context of question answering from KBs with the creation of resources like \sys{WebQuestions} \cite{berant2013semantic} and \sys{SimpleQuestions} \cite{bordes2015large} based on \sys{Freebase}, or on automatically extracted KBs, e.g., OpenIE triples and \sys{NELL} \cite{fader2014open}. A lot of progress has been made on knowledge-based question answering and the major approaches are either based on semantic parsing or information extraction techniques~\cite{yao2014freebase}. However, KBs have inherent limitations (incompleteness and fixed schemas) that motivated researchers to return to the original setting of answering from raw text lately.

\begin{figure}[t]
    \center
    \includegraphics[scale=0.25]{img/deepqa.png}
    \longcaption{The high-level architecture of IBM's \sys{DeepQA} used in \sys{Watson}.}{\label{fig:watson}The high-level architecture of IBM's \sys{DeepQA} used in \sys{Watson}. Image courtesy: \href{https://en.wikipedia.org/wiki/Watson_(computer)}{https://en.wikipedia.org/wiki/Watson\_(computer)}.}
\end{figure}

There are also a number of highly developed full pipeline QA approaches using a myriad of resources, including both text collections (Web pages, Wikipedia, newswire articles) and structured knowledge bases (\sys{Freebase}, \sys{DBpedia} etc.). A few notable systems include Microsoft's \sys{AskMSR} \cite{brill2002askmsr},
IBM's \sys{DeepQA} \cite{ferrucci2010building} and \sys{YodaQA} \cite{baudivs2015yodaqa} --- the latter of which is open source and hence reproducible for comparison purposes. \sys{AskMSR} is a search-engine based QA system that relies on ``data redundancy rather than sophisticated linguistic analyses of either questions or candidate answers''. \sys{DeepQA} is the most representative modern question answering system and its victory at the TV game-show \sys{Jeopardy!} in 2011 received a great deal of attention. It is a very sophisticated system that consists of many different pieces in the pipeline and it relies on unstructured information as well as structured data to generate candidate answers or vote over evidence. A high-level architecture is illustrated in Figure~\ref{fig:watson}. \sys{YodaQA} is an open source system modeled after \sys{DeepQA}, similarly combining websites, databases and Wikipedia in particular. Comparing against these methods provides a useful datapoint for an ``upper bound'' benchmark on performance.

Finally, there are other types of question answering problems based on different types of resources, including Web tables~\cite{pasupat2015compositional}, images~\cite{antol2015vqa}, diagrams~\cite{kembhavi2017you} or even videos~\cite{tapaswi2016movieqa}. We are not going into further details as our work focuses on text-based question answering.

Our \sys{DrQA} system (Section~\ref{sec:drqa}) focuses on question answering using Wikipedia as the unique knowledge source, such as one does when looking for answers in an encyclopedia.  QA using Wikipedia as a resource has been explored previously. \newcite{ryu2014open} perform open-domain QA using a Wikipedia-based knowledge model. They combine article content with multiple other answer matching modules based on different types of semi-structured knowledge such as infoboxes, article structure, category structure, and definitions. Similarly, \newcite{Ahn2004using} also combine Wikipedia as a text resource with other resources, in this case with information retrieval over other documents. \newcite{buscaldi2006mining} also mine knowledge from Wikipedia for QA. Instead of using it as a resource for seeking answers to questions, they focus on validating answers returned by their QA system, and use Wikipedia categories for determining a set of patterns that should fit with the expected answer. In our work, we consider the comprehension of text only, and use Wikipedia text documents as the sole resource in order to emphasize the task of reading comprehension. We believe that adding other knowledge sources or information will further improve the performance of our system.


================================================
FILE: chapters/openqa/system.tex
================================================
%!TEX root = ../../thesis.tex

\section{Our System: \sys{DrQA}}
\label{sec:drqa}

\subsection{An Overview}

In the following we describe our system \sys{DrQA}, which focuses on answering questions using English Wikipedia as the unique knowledge source for documents. We are interested in building a general-knowledge question answering system, which can answer any sort of factoid questions where the answer is contained in and can be extracted from Wikipedia.

There are several reasons that we choose to use Wikipedia: 1) Wikipedia is a constantly evolving source of large-scale, rich, detailed information that could facilitate intelligent machines. Unlike knowledge bases (KBs) such as \sys{Freebase} or \sys{DBPedia}, which are easier for computers to process but too sparsely populated for open-domain question answering, Wikipedia contains up-to-date knowledge that humans are interested in. 2) Many reading comprehension datasets (e.g., \sys{SQuAD}) are built on Wikipedia so that we can easily leverage these resources and we will describe it soon. 3) Generally speaking, Wikipedia articles are clean, high-quality and well-formed and thus they are highly useful resources for open domain question answering.

Using Wikipedia articles as the knowledge source causes the task of question answering (QA) to combine the challenges of both large-scale open-domain QA and of machine comprehension of text. In order to answer any question, one must first retrieve the few relevant articles among more than 5 million items, and then scan them carefully to identify the answer. This is reminiscent of how classical TREC QA systems worked, but we believe that neural reading comprehension models will play a crucial role of \ti{reading} the retrieved articles/passages to obtain the final answer. As shown in Figure \ref{fig:drqa-system}, our system essentially consists of two components: (1) the \sys{Document Retriever} module for finding relevant articles and (2) a reading comprehension model, \sys{Document Reader}, for extracting answers from a single document or a small collection of documents.

Our system treats Wikipedia as a collection of articles and does not rely on its internal graph structure. As a result, our approach is generic and could be switched to other collections of documents, books, or even daily updated newspapers. We detail the two components next.

\subsection{Document Retriever}
\label{sec:doc-retriever}
Following classical QA systems, we use an efficient (non-machine learning) document retrieval system to first narrow our search space and focus on reading only articles that are likely to be relevant. A simple inverted index lookup followed by term vector model scoring performs quite well on this task for many question types, compared to the built-in ElasticSearch based Wikipedia Search API \cite{gormley2015elasticsearch}. Articles and questions are compared as TF-IDF weighted bag-of-word vectors.

We further improve our system by taking local word order into account with n-gram features. Our best performing system uses bigram counts while preserving speed and memory efficiency by using the hashing of \cite{weinberger2009feature} to map the bigrams to $2^{24}$ bins with an unsigned \emph{murmur3} hash.

We use the \sys{Document Retriever} as the first part of our full model, by setting it to return 5 Wikipedia
articles given any question. Those articles are then processed by the \sys{Document Reader}.


\begin{figure}[t]
\begin{center}
\includegraphics[height=8cm]{img/drqa_system.pdf}
\end{center}
\longcaption{An overview of DrQA system}{\label{fig:drqa-system} An overview of our question answering system DrQA.}
\end{figure}

\subsection{Document Reader}
The \sys{Document Reader} takes the top 5 Wikipedia articles and aims to read all the paragraphs and extracts the possible answers from them. This is exactly the setup as we did in span-based reading comprehension problems, and the \sys{Stanford Attentive Reader} model that we described in Section~\ref{sec:sar} can be directly plugged into this pipeline.

We apply our trained \sys{Document Reader} for each single paragraph that appears in the top 5 Wikipedia articles and it predicts an answer span with a confidence score. To make scores compatible across paragraphs in one or several retrieved documents, we use the unnormalized exponential and take argmax over all considered paragraph spans for our final prediction. This is just a very simple heuristic and there are better ways to aggregate evidence over different paragraphs. We will discuss future work in Section~\ref{sec:openqa-future}.

\subsection{Distant Supervision}
We have built a complete pipeline which integrates a classical retrieval module and our previous neural reading comprehension component. The remaining key question is how can we train this reading comprehension module for the open-domain question answering setting?

The most direct approach is just to reuse the SQuAD dataset~\cite{rajpurkar2016squad} as the training corpus, which was also built on top of Wikipedia paragraphs. However, this approach is limited in the following ways:

\begin{itemize}
    \item
        As we discussed earlier in Section~\ref{sec:future-datasets}, the questions in \sys{SQuAD} were crowdsourced after the annotators see the paragraphs to ensure they can be answered by a span in the passage. This distribution is quite specific and different from that of real-world question-answering when people have a question in mind first and try to find out he answers from the Web or other sources.
    \item
        Many \sys{SQuAD} questions are indeed context-dependent. For example, a question is \ti{What individual is the school named after?} posed on one passage of the Wikipedia article \ti{Harvard University}, or another question is \ti{What did Luther call these donations?} based on a passage that describes \ti{Martin Luther}. Basically, these questions cannot be understood by themselves and thus are useless for open-domain QA problems. \newcite{clark2018simple} estimated around 32.6\% of the questions in \sys{SQuAD} are either document-dependent or passage-dependent.
    \item
        Finally, the size of SQuAD is rather small (80k training examples). It should further improve the system performance if wen can collect more training examples.
\end{itemize}

To overcome these problems, we propose a procedure to automatically create additional training examples from other question answering resources. The idea is to re-use the efficient information retrieval module that we built: if we already have a question answer pair $(q, a)$ and the retrieval module can help us find a paragraph relevant to the question $q$ and the answer segment $a$ appears in the paragraph, then we can create a \ti{distantly-supervised} training example in the form of a $(p, q, a)$ triple for training the reading comprehension models:

\begin{eqnarray}
   & f: (q, a) \Longrightarrow (p, q, a) \\
    & \text{ if } p \in \text{ Document\_Retriever }(q) \text{ and } a \text{ appears in } p \nonumber
\end{eqnarray}

This idea is a similar spirit to the popular approach of using distant supervision (DS) for relation extraction \cite{mintz2009distant} \footnote{The idea for relation extraction is to pair textual mentions which contain the two entities which is known as a relation between them in an existing knowledge base.}. Despite that these examples can be noisy to some extent, it offers a cheap solution to create distantly supervised examples for open-domain question answering and will be a useful addition to \sys{SQuAD} examples. We will describe the effectiveness of these distantly supervised examples in Section~\ref{sec:drqa-eval}.


================================================
FILE: chapters/rc_future/datasets.tex
================================================
%!TEX root = ../../thesis.tex

\section{Future Work: Datasets}
\label{sec:future-datasets}

We have mostly focused on \sys{CNN/Daily Mail} and \sys{SQuAD} and demonstrated that both 1) neural models are able to achieve either super-human or the ceiling performance on them; 2) although these datasets are highly useful, most of the examples are rather simple and don't require much reasoning yet.  What desired properties are still missing in these datasets? What kind of datasets should we work on next? And how to collect better datasets?

% still a quite restricted setup: (a) the crowdworkers can see the passage when they write the questions. As a result, there is usually a high lexical overlap between the question and the paragraph and thus it greatly eases the difficulty of answering these questions;  (b) questions are only allowed when they can be answered using a single span in the passage and this excludes many possible questions from the dataset such as those \ti{yes/no}, \ti{counting} or \ti{why} questions; (c) it is known that most of the questions in \sys{SQuAD} don't really need complex reasoning (combining facts from multiple sentences or background knowledge) and they are usually not compositional (which needs to be decomposed into multiple steps of simple questions).

We think that datasets like \sys{SQuAD} mainly have the following limitations:
\begin{itemize}
    \item
        The questions are \ti{posed based on the passage}.  That said, if a questioner is looking at the passage while they ask a question, they are quite likely to mirror the sentence structure and to reuse the same words. This eases the difficulty of answering questions as many questions words are overlapping with the passage words.
    \item
        It only allows questions that are \ti{answerable by a single span in the passage}. This not only implies all the questions are answerable, but also excludes many possible questions to be posed such as \ti{yes/no}, \ti{counting} questions. As we discussed earlier, most of the questions in \sys{SQuAD} are factoid questions and the answers are generally short (3.1 tokens on average). Therefore, there are also very few \ti{why} (cause and effect) and \ti{how} (procedure) questions in the dataset.
    \item
        Most of the questions can be answered by \ti{a single supporting sentence} in the passage and don't require multiple-sentence reasoning. \newcite{rajpurkar2016squad} estimated that only $13.6\%$ of the examples need multiple sentence reasoning. Among them, we think that most of the cases are resolving conferences, which might be solved by a coreference system.
\end{itemize}

To address these limitations, there have been a number of new datasets collected recently. They follow a similar paradigm of \sys{SQuAD} but are constructed in various ways. Table~\ref{tab:recent-datasets} gives an overview of a few representative datasets. As we can see, these datasets are of a similar order of magnitude (ranging from 33k to 529k training examples), and there is still a gap between the state-of-the-art and the human performance (some gaps are bigger than the others though). In the following, we describe these datasets in detail and discuss how they tackle the aforementioned limitations and their advantages/disadvantages:

\begin{table}[t]
    \centering
    \small
    \begin{tabular}{l | c c c | c | c c c}
      \toprule
      \tf{Dataset} & \tf{\#Train} & \tf{\#Dev} & \tf{\#Test} & \tf{Domain} & \tf{Metric} & \tf{Human} & \tf{SOTA} \\
      \midrule
      \sys{TriviaQA} (Web) & 528,979 & 68,621 & 65,059 & Web & F1 & N/A\footnote{\newcite{joshi2017triviaqa} provided oracles scores of \ti{exact match} accuracies of 82.8\% and 83.0\% of the Web and Wikipedia domain respectively. These numbers measure the percentage of examples that answer can be found in the documents and differ from human performance.} & 71.3 \\
      \sys{TriviaQA} (Wiki.)\footnote{In contrast to the Web domain of \sys{TriviaQA}, the Wikipedia domain is evaluated over questions instead of documents.} & 61,888 & 9,951 & 9,509 & { Wikipedia} & F1 & N/A & 68.9 \\
      \sys{RACE} & 87,866 & 4,887 & 4,934 & Exams & Accuracy & 100.0 & 59.0 \\
      \sys{NarrativeQA}\footnote{We only list the setting where the summaries are given.} & 32,747 & 3,461 & 10,557 & Wikipedia & ROUGE-L & 57.0 & 36.3 \\
      \sys{SQuAD 2.0} & 130,319 & 11,873 & 8,862 & Wikipedia & F1 & 89.5 & 83.1 \\
      \sys{HotpotQA}\footnote{We only list the ``distractor'' setting.}  & 90,564 & 7,405 & 7,405 & {Wikipedia} & F1 & 91.4 & 59.0 \\
      \bottomrule
    \end{tabular}
    \longcaption{A summary of more recent reading comprehension datasets}{\label{tab:recent-datasets}A summary of more recent reading comprehension datasets. We only show the F1 results for span-prediction tasks and ROUGE-L for free-form answer tasks. The state-of-the-art results are taken from \newcite{clark2018simple} for \sys{TriviaQA}~\cite{joshi2017triviaqa}, \newcite{radford2018improving} for \sys{RACE}~\cite{lai2017race}, \newcite{kovcisky2018narrativeqa} for \sys{NarrativeQA}, \newcite{devlin2018bert} for \sys{SQuAD 2.0}~\cite{rajpurkar2018know} and \newcite{yang2018hotpotqa} for \sys{HotpotQA}.}
\end{table}

\paragraph{TriviaQA~\cite{joshi2017triviaqa}.} The key idea of this dataset is that question/answer pairs were collected \ti{before} constructing the corresponding passages. More specifically, they gathered 95k question-answer pairs from trivia and quiz-league websites and collected textual evidence which contained the answer from either Web search results or Wikipedia pages corresponding to the entities which are mentioned in the question. As a result, they collected 650k (passage, question, answer) triples in total. This paradigm effectively solved the problem that questions were dependent on the passage and also it is easier to construct a large dataset cheaply. It is worth noting that the passages used in this dataset are mostly long documents (the average document length is 2,895 words and it is 20 times longer than that of \sys{SQuAD}), and also posed a challenge of scalability for existing models.  However, it has a similar problem to the \sys{CNN/Daily Mail} dataset --- as the dataset was curated heuristically, there is no guarantee that the passage really provides the answer to the question and this influences the quality of the training data.

\paragraph{RACE~\cite{lai2017race}.} Humans' standardized tests are a proper way to evaluate machines' reading comprehension abilities. \sys{RACE} is a multiple choice dataset collected from the English exams for middle-school and high-school Chinese students within the 12–-18 age range. All the questions and answer options were created by experts. As a result, the dataset is more difficult than most existing datasets and it was estimated that 26\% of the questions require multiple sentence reasoning. The state-of-the-art performance is only 59\% so far (each question has 4 candidate answers).

\paragraph{NarrativeQA~\cite{kovcisky2018narrativeqa}.} This is a challenging dataset and it required crowdworkers to ask questions based on the plot summaries of a book or a movie from Wikipedia. The answers are free-form human-generated text and in particular, the annotators were encouraged to use their own words and copying is not allowed in the interface. The plot summaries usually contain more characters and events and more complex to follow than news articles or Wikipedia paragraphs. The dataset consists of two settings: one is to answer questions base on the summary (659 tokens on average) which is more similar to \sys{SQuAD}, and the other is to answer questions based on the full book or movie script (62,528 tokens on average). The second setting is especially difficult, as it requires IR components to locate relevant information in the long documents. One problem with this dataset is that human agreement is low due to its free-form answer form and thus it is difficult to evaluate.

\paragraph{SQuAD 2.0~\cite{rajpurkar2018know}.} \sys{SQuAD 2.0} proposed to add 53,775 negative examples to the original \sys{SQuAD} dataset. These questions are not answerable from the passage, but look similar to the positive ones (relevant and the passage contains a plausible answer). To work well on the dataset, systems need to not only answer questions but also determine when no answer is supported by the paragraph and abstain from answering. This is an important aspect in practical applications but has been omitted in previous datasets.

\paragraph{HotpotQA~\cite{yang2018hotpotqa}.} This dataset aims to construct questions which need multiple supporting documents to answer. To approach this, the crowdworkers were required to ask questions based on two relevant Wikipedia paragraphs (there is a hyperlink from the first paragraph of one article to the other). It also offers a new type of factoid comparison question, for which systems need to compare two entities on some shared properties. The dataset consists of two settings for evaluation -- one is called the \ti{distractor} setting in which each question is provided 10 passages, including the two passages used for constructing the question and 8 distractor passages retrieved from Wikipedia; the second setting is to use the full Wikipedia to answer the question.

Compared to \sys{SQuAD}, these datasets either require more complex reasoning cross sentences or documents, or need to handle longer documents, or need to generate free-form answers instead of extracting a single span, or predict when there is no answer in the passage. They posed new challenges and many are still beyond the scope of existing models. We believe that these datasets will further inspire a series of modeling innovations in the future. After our models can reach the next level of performance, we will need to set out to construct even more difficult datasets to solve.


================================================
FILE: chapters/rc_future/models.tex
================================================
%!TEX root = ../../thesis.tex

\section{Future Work: Models}
\label{sec:future-models}

Next we turn to the research directions of models for future work. We first describe the desiderata of reading comprehension models. Most of the existing work only focuses on \ti{accuracy} --- given a standard training/development/testing split of a dataset, the major goal is to get the best accuracy score on the testing set. However, we argue that there are other important aspects which have been overlooked that we will need to work on in the future, including \ti{speed and scalability}, \ti{robustness} and \ti{interpretability}. Lastly, we discuss what important elements are still missing in the current models, to solve more difficult reading comprehension problems.

\subsection{Desiderata}
Besides \ti{accuracy} (achieving a better performance score on a standard dataset), the following desiderata are also very important for future work:

\paragraph{Speed and Scalability.} How to build faster models (for both training and inference) and how to scale to longer documents is an important direction to pursue. Building faster models for training can lead to lower turnaround time for experimentation and also enable us to train on bigger datasets. Building faster models for inference is highly useful when we deploy the models in practical use. Also, it is unrealistic to encode a very long document (e.g., \sys{TriviaQA}) or even a full book (e.g., \sys{NarrativeQA}) using an RNN and this still remains a severe challenge. For example, the average document length of \sys{TriviaQA} is 2,895 tokens and the authors truncated the documents to the first 800 tokens for the sake of scalability. The average document length of \sys{NarrativeQA} is 62,528 tokens and the authors have to first retrieve a small number of relevant passages from the story using an IR system.

Existing solutions to these problems include:
\begin{itemize}
    \item
        Replacing LSTMs with non-recurrent models such as \sys{Transformer}~\cite{vaswani2017attention} or lighter recurrent units such as \sys{SRU}~\cite{lei2018simple} as we discussed in Section~\ref{sec:alt-lstms}.
    \item
        Training models which learn to skip part of the documents so they don't need to read all of the content. These models can run much faster while still retaining a similar performance. Representative works in this line include \newcite{yu2017learning} and \newcite{seo2018neural}.
    \item
        The choice of optimization algorithms can also greatly affect the convergence speed. Multi-GPU training and hardware performance are also important aspects to consider but they are beyond the scope of this thesis. \newcite{coleman2017dawnbench} provide a benchmark\footnote{\href{https://dawn.cs.stanford.edu/benchmark/}{https://dawn.cs.stanford.edu/benchmark/}} which measures the end-to-end training and inference time to achieve a state-of-the-art accuracy level for a wide range of tasks, including \sys{SQuAD}.
\end{itemize}


\paragraph{Robustness.} We discussed in Section~\ref{sec:squad-errors} that existing models are very brittle to adversarial examples which will become a severe problem when we deploy these models in the real world. Moreover, most of the current works follow the standard paradigm: training and evaluating on the splits of one dataset. It is known that if we train our models on one dataset and evaluate on another dataset, the performance will drop dramatically due to their different source of text and construction methods. For future work, we need to consider:
\begin{itemize}
    \item How to create better adversarial training examples and incorporate them into the training process.
    \item Researching more on transfer learning and multi-task learning, so that we can build models with high performance across various datasets.
    \item We might need to break the standard paradigm of supervised learning, and think about how to create better ways of evaluating our current models for the sake of building more robust models.
\end{itemize}

\paragraph{Interpretability.} The last important aspect is \ti{interpretability} and it has been mostly ignored in the current systems.  Our future systems should not only be able to provide the final answers, but also provide the rationales behind their predictions, so users can decide if they can trust the outputs and leverage them or not. Neural networks are especially notorious for the fact that the end-to-end training paradigm makes these models like a black box and it is hard to interpret their predictions. This is especially crucial if we want to apply these systems to medical or legal domains.

Interpretability can have different definitions. In our context, we think there could be several ways to approach that:
\begin{itemize}
    \item
        The easiest way is to require the models to learn to extract input pieces from the documents as supporting evidence. This has been studied before (e.g., \cite{lei2016rationalizing}) for sentence classification problems but not yet in reading comprehension problems.
    \item
        A more complex way is that the models can indeed generate rationales. Instead of only highlighting the relevant piece of information in the passage, the models need to interpret how these pieces are connected and finally get to the answer. Take Figure~\ref{fig:sar-squad-errors} (c) as an example, the systems need to interpret that the two cities are the two largest and 3.7 million is bigger than 1.3 million thus it is the second largest. We think this desiderata is very important but far beyond the scope of current models.
    \item
        Finally, another important aspect to consider is what training resources we can get to approach this level of interpretability. Inferring rationales from the final answers is feasible but quite difficult. We should consider collecting human explanations as the supervision of training rationales in the future.
\end{itemize}

\subsection{Structures and Modules}
In this section, we are going to discuss what are the missing elements in the current models, if we want to solve more difficult reading comprehension problems.

First of all, current models are all built on either sequence models or tackle all pairs of words symmetrically (e.g., \sys{Transformer}), and omit the inherent structure of language. On the one hand, this forces our models to learn all the relevant linguistic information from scratch, which makes the learning of our models more difficult. On the other hand, the NLP community has put a lot of effort into studying linguistic representation tasks (e.g., syntactic parsing, coreference) and building many linguistic resources and tools for years. Language encodes meaning in terms of hierarchical, nested structures on sequences of words. Would it be still useful to encode linguistic structures more explicitly in our reading comprehension models?

Figure~\ref{fig:corenlp-output} illustrates the \sys{CoreNLP}~\cite{manning2014stanford} output of several examples in \sys{SQuAD}. We believe that this structural information would be useful as follows:

\begin{enumerate}[(a)]
    \item
        The information that \ti{2,400} is a \ti{numeric modifier} of \ti{professors} should help answer the question \ti{What is the total number of professors, instructors, and lecturers at Harvard?} (We have seen this example as an error case in Figure~\ref{fig:sar-squad-errors}).
    \item
        The coreference information that \ti{it} refers to \ti{Harvard} should help answer the question \ti{Starting in what year has Harvard topped the Academic Rankings of World Universities?}.
\end{enumerate}

Therefore, we think that these linguistic knowledge/structures would be still a useful addition to the current models. The remaining questions that we need to answer are: 1) What are the best ways to incorporate these structures into sequence models? 2) Do we want to model the structures as a latent variable or rely on off-the-shelf linguistic tools? For the latter case, are the current tools good enough so that the models can benefit more (rather than suffering from noise)? Can we further improve the performance of these representation tasks?

\begin{figure}[t]
  \center
  (a)
  \includegraphics[scale=0.20]{img/dep_example.png}
  (b)
  \includegraphics[scale=0.42]{img/coref_example.png}
  \longcaption{Example output of \sys{CoreNLP}: dependencies and coreference}{\label{fig:corenlp-output} Example output of \sys{CoreNLP}: (a) dependencies and (b) coreference. The image is taken from \href{http://corenlp.run}{http://corenlp.run}.}
\end{figure}

Another aspect we think is still missing from most existing models is \ti{modules}. The task of reading comprehension is inherently very complex and different types of examples require different types of reasoning capabilities. It still remains a grand challenge if we want to learn everything through a giant neural network (This is reminiscent of why the attention mechanism was proposed because we don't want to squash the meaning of a sentence or a paragraph into one vector!). We believe that, if we want to approach deeper level of reading comprehension, our future models will be more structured, more modularized, and solving one comprehensive task can be decomposed into many subproblems and we can tackle each smaller subproblem (e.g., each reasoning type) separately and combine all of them in the end.

The idea of \ti{modules} has been implemented in \sys{Neural Module Networks (NMN)} \cite{andreas2016learning} before. They first perform a dependency parse of the question, and then decompose the question answering problem into several ``modules'' based on the parse structure. One example they used for a visual question answering (VQA) task is: a question ``What color is the bird?'' can be decomposed as two modules. One module is used to detect the bird in the given image, and another module is used to detect the color of the found region (bird). We believe that this sort of approach holds promise to answer questions such as \ti{What is the population of the second largest city in California?} (Figure~\ref{fig:sar-squad-errors} (c)). However, \sys{NMN} has only been studied on visual question answering or small knowledge-base question question problems so far, and applying to reading comprehension problems can be more challenging due to the flexibility of language variations and question types.


================================================
FILE: chapters/rc_future/overview.tex
================================================
%!TEX root = ../../thesis.tex

% \section{Introduction}
In the previous chapter, we have described how neural reading comprehension models succeeded in current reading comprehension benchmarks and their key insights. Despite its rapid progress, there is still a long way to go towards genuine human-level reading comprehension. In this chapter, we will discuss future work and open questions.

We first examine the error cases of existing models in Section~\ref{sec:squad-errors}, and conclude that they still fail on ``easy'' or ``trivial'' cases despite their high accuracies on average.

As we discussed earlier, the success of recent reading comprehension is attributed to both the creation of large-scale datasets and the development of neural reading comprehension models. In the future, we believe both components will be still equally important. We  discuss the future work of datasets and models respectively in Section~\ref{sec:future-datasets} and \ref{sec:future-models}. What is still missing in the existing datasets and models? How can we approach that?

Finally, we review several important research questions in this field in Section~\ref{sec:research-questions}.

\section{Is SQuAD Solved Yet?}
\label{sec:squad-errors}

Although we have already achieved super-human performance on the \sys{SQuAD} dataset, does it indicate that our reading comprehension models are capable of solving all the \sys{SQuAD} examples or any examples with the same level of difficulty?

Figure~\ref{fig:sar-squad-errors} demonstrates some failure cases of our \sys{Stanford Attentive Reader} model described in Section \ref{sec:sar}. As we can see, the model predicts the answer type perfectly for all these examples: it predicts a number for the question \ti{what is the total number of \ldots ?} and \ti{what is the population \ldots ?} and a team name for the question \ti{Which team won Super Bowl 50?}. However, the model failed to understand the subtleties expressed in the text and can't distinguish among the candidate answers. In more detail,


\begin{enumerate}[(a)]
  \item The number \ti{2,400} modifies \ti{professors, lecturers, and instructors} while \ti{7,200} modifies \ti{undergraduates}. However, the system failed to identify that and we believe that linguistic structures (e.g., syntactic parsing) can help resolve this case.
  \item Both teams \ti{Denver Broncos} and \ti{Carolina Panthers} are modified by the word \ti{champion}, but the system failed to infer that ``X defeated Y'' so ``X won''.
  \item The system predicted \ti{100,000} probably because it is closer to the word \ti{population}. However, to answer the question correctly, the system has to identify that \ti{3.7 million} is the population of \ti{Los Angles}, and \ti{1.3 million} is the population of \ti{San Diego} and compare the two numbers and infer that \ti{1.3 million} is the answer because it is \ti{second largest}. This is a difficult example and probably beyond the scope of all the existing systems.
\end{enumerate}

\begin{figure}[p]
    \centering
    \begin{tabular}{l | p{13.5cm}}
    \hline
    (a) &\tf{Question}: What is the total number of professors, instructors, and lecturers at Harvard? \\
    & \tf{Passage}: Harvard's \blue{2,400} professors, lecturers, and instructors instruct \red{7,200} undergraduates and 14,000 graduate students. The school color is crimson, which is also the name of the Harvard sports teams and the daily newspaper, The Harvard Crimson. The color was unofficially adopted (in preference to magenta) by an 1875 vote of the student body, although the association with some form of red can be traced back to 1858, when Charles William Eliot, a young graduate student who would later become Harvard's 21st and longest-serving president (1869–-1909), bought red bandanas for his crew so they could more easily be distinguished by spectators at a regatta. \\
    & \tf{Gold answer}: 2,400 \\
    & \tf{Predicted answer}: 7,200 \\
    \hline
    (b) & \tf{Question}: Which team won Super Bowl 50? \\
    & \tf{Passage}: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion \blue{Denver Broncos} defeated the National Football Conference (NFC) champion \red{Carolina Panthers} 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. \\
    & \tf{Gold answer}: Denver Broncos \\
    & \tf{Predicted answer}: Carolina Panthers \\
    \hline
    (c) & \tf{Question}: What is the population of the second largest city in California? \\
    & \tf{Passage}: Los Angeles (at 3.7 million people) and San Diego (at \blue{1.3 million} people), both in southern California, are the two largest cities in all of California (and two of the eight largest cities in the United States). In southern California there are also twelve cities with more than 200,000 residents and 34 cities over \red{100,000} in population. Many of southern California's most developed cities lie along or in close proximity to the coast, with the exception of San Bernardino and Riverside. \\
    & \tf{Gold answer}: 1.3 million \\
    & \tf{Predicted answer}: 100,000 \\
    \hline
    \end{tabular}
    \longcaption{Failure cases of our model on SQuAD}{\label{fig:sar-squad-errors}Several failure cases of our model on \sys{SQuAD}. Gold answers are marked as \blue{blue} and predicted answers are marked as \red{red}.}
\end{figure}

\begin{figure}[p]
    \centering
    \small
    \begin{tabular}{l | p{13.5cm}}
    \hline
    (d) &\tf{Question}: What is the least number of members a board of trustees can have? \\
    & \tf{Passage}: The Book of Discipline is the guidebook for local churches and pastors and describes in considerable detail the organizational structure of local United Methodist churches. All UM churches must have a board of trustees with at least \blue{three} members and no more than \red{nine} members and it is recommended that no gender should hold more than a 2/3 majority. All churches must also have a nominations committee, a finance committee and a church council or administrative council. Other committees are suggested but not required such as a missions committee, or evangelism or worship committee. Term limits are set for some committees but not for all. The church conference is an annual meeting of all the officers of the church and any interested members. This committee has the exclusive power to set pastors' salaries (compensation packages for tax purposes) and to elect officers to the committees. \\
    & \tf{Gold answer}: three \\
    & \tf{Predicted answer}: nine \\
    \hline
    (e) & \tf{Question}: Where does centripetal force go? \\
    & \tf{Passage}: where  is the mass of the object,  is the velocity of the object and  is the distance to the center of the circular path and  is the unit vector pointing in the radial direction outwards from the center. This means that the unbalanced centripetal force felt by any object is always directed toward \blue{the center of the curving path}. Such forces act perpendicular to the velocity vector associated with the motion of an object, and therefore do not change the speed of the object (magnitude of the velocity), but only the direction of the velocity vector. The unbalanced force that accelerates an object can be resolved into a component that is perpendicular to the path, and one that is tangential to the path. This yields both the tangential force, which accelerates the object by either slowing it down or speeding it up, and the radial (centripetal) force, which \red{changes its direction}. \\
    & \tf{Gold answer}: the center of the curving path \\
    & \tf{Predicted answer}: changes its direction \\
    \hline
    (f) & \tf{Question}: How many times have the Panthers been in the Super Bowl? \\
    & \tf{Passage}: The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP). They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to their \blue{second} Super Bowl appearance since the franchise was founded in 1995. The Broncos finished the regular season with a 12–4 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20–18 in the AFC Championship Game. They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made \red{eight} appearances in the Super Bowl. \\
    & \tf{Gold answer}: second \\
    & \tf{Predicted answer}: eight \\
    \hline
    \end{tabular}
    \longcaption{Failure cases of the currently best model (\sys{BERT} ensemble) on SQuAD}{\label{fig:bert-squad-errors}Several failure cases of the currently best model (\sys{BERT} ensemble) on \sys{SQuAD}. Gold answers are marked as \blue{blue} and predicted answers are marked as \red{red}.}
\end{figure}

\begin{figure}[!h]
    \centering
    \begin{tabular}{p{13.5cm}}
    \hline
      \tf{Question}: What is the name of the quarterback who was 38 in Super Bowl XXXIII? \\
      \tf{Passage}: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by \blue{John Elway}, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. \ti{Quarterback \red{Jeff Dean} had jersey number 37 in Champ Bowl XXXIV.} \\
    \hline
    \end{tabular}
    \longcaption{An adversarial example used in ~\cite{jia2017adversarial}}{\label{fig:adversarial-example}An adversarial example used in ~\cite{jia2017adversarial}, where a distracting sentence is added to the end of the passage (italicized). \blue{Blue}: the correct answer and \red{red}: the predicted answer.}
\end{figure}

We also took a closer look at the predictions of the best SQuAD model so far --- an ensemble of 7 \sys{BERT} models \cite{devlin2018bert}. As is demonstrated in Figure~\ref{fig:bert-squad-errors}, we can see that this strong model still makes some simple mistakes which humans hardly make. It is fair to conjecture that these models have been doing very sophisticated matching of text while they still have difficulty understanding the inherent structure between entities and the events expressed in the text.


Lastly, \newcite{jia2017adversarial} find that if we add a distracting sentence to the end of the passage (see an example in Figure~\ref{fig:adversarial-example}), the average performance of current reading comprehension systems will drop drastically from 75.4\% to 36.4\%. These distracting sentences have word overlap with the question while not actually contradict the correct answer and not mislead human understanding. The performance is even worse if the distracting sentence is allowed to add ungrammatical sequences of words. These results suggest that 1) The current models strongly depend on the lexical cues between the passage and the question. That's why the distracting sentences can be so destructive; 2) Even though the models achieved a high accuracy on the original development set, they are really not robust to the adversarial examples. This is a critical problem of the standard supervised learning paradigm and it makes existing models difficult to deploy in the real world. We will discuss the property of robustness more in Section~\ref{sec:future-models}.

To sum up, we believe that, although a very high accuracy was already obtained on the \sys{SQuAD} dataset, the current models only focus on the surface-level information of the text, and still make simple errors when it comes to a (slightly) deeper level of understanding. On the other hand, the high accuracies also indicate that most of the \sys{SQuAD} examples are rather easy and require little understanding. There are difficult examples which require complex reasoning in \sys{SQuAD} (for example, (c) in Figure~\ref{fig:sar-squad-errors}), but due to their scarcity, their accuracies aren't really reflected in the averaged metric. Furthermore, the high accuracies only hold when training and development come from the same distribution, and it still remains a severe problem when they differ. In the next two sections, we discuss the possibilities of creating more challenging datasets and building more effective models.


================================================
FILE: chapters/rc_future/questions.tex
================================================
%!TEX root = ../../thesis.tex

\section{Research Questions}
\label{sec:research-questions}

In the last section, we discuss a few central research questions in this field, which still remain as open questions and yet to be answered in the future.

\subsection{How to Measure Progress?}
The first question is: \ti{How can we measure the progress of this field?} The evaluation metrics are certainly clear indicators of measuring progress on our reading comprehension benchmarks. Does this indicate that we make real progress on reading comprehension in general? How can we tell if some progress on one benchmark can generalize to others? How about if model $A$ works better than model $B$ on one dataset, while model $B$ works better on the other dataset? How to tell how far these computer systems are sill from genuine human-level reading comprehension?

On the one hand, we think that taking human's standardized tests could be a good strategy for evaluating the performance of machine reading comprehension systems. These questions are usually carefully curated and designed to test human's reading comprehension abilities at different levels. To get computer systems aligned with human measurements is a proper way in building natural language understanding systems.
% {\red{TODO: Not always correct --- some questions are easy for humans to answer but difficult for machines}}.

On the other hand, we believe that it would be desirable to integrate many reading comprehension datasets as a testing suite for evaluation in the future, instead of only testing on one single dataset. This will help us better distinguish what are genuine progress for reading comprehension and what might be just overfitting to one specific dataset.

More importantly, we need to understand our existing datasets better: characterizing their quality and what skills are required to answer the questions. This will be a crucial step for building more challenging datasets and analyzing the behavior of our models. Besides our work on analyzing the \sys{CNN/Daily Mail} examples in \newcite{chen2016thorough}, \newcite{sugawara2017evaluation} attempted to separate reading comprehension skills into two disjoint sets: \ti{prerequisite skills} and \ti{readability}.  Prerequisite skills measure different types of reasoning and knowledge required to answer the question and 13 skills are defined: object tracking, mathematical reasoning, coreference resolution, logical reasoning, analogy, causal relation, spatiotemporal relation, ellipsis, bridging, elaboration, meta-knowledge, schematic clause relation and punctuation. Readability measures the “text ease of processing”, and a wide range of linguistic features/human readability measurements are used. The authors concluded that these two sets are weakly correlated and it is possible to design difficult questions from the contexts that are easy to read. These studies suggest that we could construct datasets and develop models based on these properties separately.

In addition, \newcite{sugawara2018what} designed a few simple filtering heuristics and divided the examples from many existing datasets into a hard subset and an easy subset, based on 1) whether the question can be answered using only the first few words; 2) whether the answer is contained in the most similar sentence in the passage. They observed that the baseline performances for the hard subsets remarkably degrade compared to those of the entire datasets. Moreover, \newcite{kaushik2018how} analyzed the performance of existing models using passage-only or question-only information, and found that these models sometimes can work surprisingly well and hence there exists annotation artifacts in some of the existing datasets.

In conclusion, we believe that if we want to make steady progress on reading comprehension in the future, we will have to answer these basic questions about the difficulty of examples first. Understanding what is required for the datasets, what our current systems can do and can't do will help us identify the challenges we are facing and measure the progress.

\subsection{Representations vs. Architecture: Which is More Important?}
\label{sec:rep-vs-arch}

\begin{figure}[!t]
    \center
    \includegraphics[scale=0.45]{img/rep_vs_arch.pdf}
    \longcaption{A comparison of a complex architecture vs. a simple architecture with pre-training}{\label{fig:rep-vs-arch}A comparison of a complex architecture (left) vs. a simple architecture with pre-training (right). The parameters in the dashed box can be pre-trained from unlabeled text, while all the remaining parameters are initialized randomly and learned from the reading comprehension datasets.}
\end{figure}

The second important question is to understand the role of representations vs. architectures to the performance of reading comprehension models. Since \sys{SQuAD} was created, there has been a recent trend of increasing the complexity of neural architectures. In particular, more and more complex attention mechanisms have been proposed to capture the similarity between the passage and the question (Section~\ref{sec:attention-mechanisms}). However, recent works~\cite{radford2018improving,devlin2018bert} proposed that if we can pretrain a deep language model on large text corpora, a simple model which takes the concatenation of the question and the passage without modeling any direct interactions between the two can work extremely well on reading comprehension datasets such as \sys{SQuAD} and \sys{RACE} (see Table~\ref{tab:squad-results} and Table~\ref{tab:recent-datasets}).

As illustrated in Figure~\ref{fig:rep-vs-arch}, the first class of models (left) only builds on top of word embeddings (each word type has a vector representation) pre-trained from unlabeled text, while all the remaining parameters (including all the weights to compute various attention functions) need to be learned from the limited training data. The second class of models (right) makes the model architecture very simple and it only models the question and passage as a single sequence. The whole model is pre-trained and all the parameters are kept. Only a few new parameters are added (e.g., the parameters for predicting the start and end positions for \sys{SQuAD}) and the other parameters will be fine-tuned on the training set of the reading comprehension tasks.

We think these two classes of models indicate two extremes. On the one hand, it certainly demonstrates the incredible power of unsupervised representations. As we have a powerful language model pre-trained from huge amount of text, the model already encodes a great deal of properties about language while a simple model which concatenates the passage and the question is sufficient to learn the dependency between the two. On the other hand, when only word embeddings are given, it seems that modeling the interactions between the passage and the question carefully (or giving the model more prior knowledge)  helps. In the future, we suspect that we will need to combine the two and a model like \sys{BERT} is too coarse to handle the examples which require complex reasoning.


\subsection{How Many Training Examples Are Needed?}
The third question is \ti{how many training examples are actually needed?} We have discussed many times that the success of neural reading comprehension is driven by large-scale supervised datasets. All the datasets that we have been actively working on contain at least 50,000 examples. Can we always embrace data abundance and further improve the performance of our systems? Is it possible to train a neural reading comprehension model with only hundreds of annotated examples today?

We think there isn't a clear answer yet. On the one hand, there is clear evidence that having more data helps. \newcite{bajgar2016embracing} demonstrated that inflating the cloze-style training data constructed from books available through project Gutenberg can provide a boost of 7.4\%--14.8\% on the \sys{Children Book Test (CBT)} dataset~\cite{hill2016goldilocks} using the same model. We discussed before that using data augmentation techniques~\cite{yu2018qanet} or augmentating the training data with \sys{TriviaQA} can help improve the performance on \sys{SQuAD} (\# training examples = 87,599).

On the other hand, pre-trained (language) models~\cite{radford2018improving,devlin2018bert} can help us reduce the dependence on large-scale datasets. In these models, most of the parameters are already pretrained on abundant unlabeled data and will be only fine-tuned during training.

In the future, we should encourage more research on unsupervised learning and transfer learning. Leveraging unlabeled data (e.g., text) or other cheap resources or supervision (e.g., datasets like \sys{CNN/Daily Mail}) will relieve us from collecting expensive annotated data. We also should seek better and cheaper ways of collecting supervised datasets.
%
% \red{Chris: I think the main substantive thing missing here is a discussion of more difficult types of questions that probe deeper levels of Reading Comprehension. That is a middle school reading comprehension exercise normally is not so much about answering factoid style questions that but showing that you understood the reasoning and implications of the text and what the author is trying to convey. Often this is done with how/why questions: In the story, why is Cynthia upset with her mother? How does John attempt to make up for his original mistake? How does the author indicate that Benjamin is scared to be left alone? But there are other aspects of deeper comprehension too. We can argue about how successful they have been, but I think very clearly the goal of the AI2 Aristo work has been to try to have comprehension tests where you actually have to understand the underlying science of what is being discussed, rather than just answering from text matching. It would be good to have a paragraph or two on issues like this --- assessing deeper reading comprehension than question text matching.}


================================================
FILE: chapters/rc_models/advances.tex
================================================
%!TEX root = ../../thesis.tex

\section{Further Advances}
\label{sec:advances}

In this section, we summarize recent advances in neural reading comprehension. We divide them into the following four categories: {word representations}, {attention mechanisms}, {alternatives to LSTMs}, and {others} (such as training objectives, data augmentation). We give a summary and discuss their importance in the end.


\subsection{Word Representations}
The first category is better word representations for question and passage words, so the neural models are built off of better grounds. Learning better distributed word representations from text or finding the best set of word embeddings for specific tasks still remains an active research topic --- for example, \newcite{mikolov2017advances} find that replacing \sys{GloVe} pre-trained vectors with the new \sys{fastText} vectors~\cite{bojanowski2017enriching} in our model brings about 1 point of improvement on \sys{SQuAD}. More than that, there are two key ideas which have been proved (highly) useful:

\subsubsection*{Character embeddings}
The first idea is to use character-level embeddings to represent words, which are especially helpful for rare or out-of-vocabulary words. Most of the existing works employ a \sys{convolutional neural network} (CNN), which can usefully exploit the surface patterns of $n$-gram characters. More concretely, let  $\mathcal{C}$ be the vocabulary of characters and each word type $x$ can be represented as a sequence of characters $(c_1, \ldots, c_{|x|}), \forall c_i \in \mathcal{C}$. We first map each character in $\mathcal{C}$ into a $d_c$-dimensional vector, so word $x$ can be represented as $\mf{c}_1, \ldots, \mf{c}_{|x|}$.

Next we apply a convolution layer with a filter $\mf{w} \in \R^{d_c \times w}$ of width $w$, and we denote $\mf{c}_{i:i+j}$ as the concatenation of $\mf{c}_i, \mf{c}_{i + 1}, \ldots, \mf{c}_{i + j}$. Therefore, for $i = 1, \ldots, |x| - w + 1$, we can apply this filter $\mf{w}$ and after which we add a bias $b$ and apply a nonlinearity $\tanh$ as follows:
\begin{equation}
    f_i = \tanh\left(\mf{w}^{\intercal} \mf{c}_{i:i+w-1} + b \right).
\end{equation}
Finally we can apply a \ti{max-pooling} operation on $f_1, \ldots, f_{|x| - w + 1}$ and obtain one scalar feature:
\begin{equation}
    f = \max_{i}{\{f_i\}}
\end{equation}
This feature essentially picks out a character $n$-gram, where the size of the $n$-gram corresponds to the filter width $w$. We can repeat the above process by repeating $d^*$ different filters $\mf{w}_1, \ldots, \mf{w}_{d^*}$. As a result, we can obtain a character-based word representation for each word type $\mf{E}_c(x) \in \R^{d^*}$. All the character embeddings, filter weights $\{\mf{w}\}$ and biases $\{b\}$ are learned during training. More details can be found in \newcite{kim2014convolutional}.  In practice, the dimension of character embeddings $d_c$ usually takes a small value (e.g., 20), width $w$ usually takes $3 - 5$, while $100$ is a typical value for $d^*$.

\subsubsection*{Contextualized word embeddings}
Another important idea is \ti{contextualized word embeddings}. Different from traditional word embeddings in which each word type is mapped to one single vector, contextualized word embeddings assign each word a vector as a function of the entire input sentence. These word embeddings can model better complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., polysemy).

A concrete implementation is \sys{ELMo} detailed in \newcite{peters2018deep}: their contextualized word embeddings are learned functions of the internal states of a deep bidirectional language model, which is pretrained on a large text corpus. Basically, given a sequence of words $(x_1, x_2, \ldots, x_n)$, they run an $L$-layer forward LSTM and models the sequence probability as:
\begin{equation}
    P(x_1, x_2, \ldots, x_n) =  \prod_{k = 1}^{n}P(x_k \mid x_1, \ldots, x_{k - 1})
\end{equation}
Only the top layer of the LSTM $\overrightarrow{\mf{h}}^{(L)}_k$ is used to predict the next token $x_{k + 1}$. Similarly, another $L$-layer LSTM is run backward and $\overleftarrow{\mf{h}}^{(L)}_k$ is used to predict the token $x_{k - 1}$. The overall training objective is to maximize the log-likelihood from both directions:
\begin{equation}
  \small
    \sum_{k=1}^{n}\left({\log P (x_k \mid x_1, \ldots, x_{k-1}; {\Theta}_x, \overrightarrow{{\Theta}}_{\text{LSTM}}, {\Theta}_s ) + \log P (x_k \mid x_{k+1}, \ldots, x_{n}; {\Theta}_x, \overleftarrow{{\Theta}}_{\text{LSTM}}, {\Theta}_s )}\right),
\end{equation}
where $\Theta_x$ and $\Theta_s$ are word embeddings and softmax parameters and shared for both LSTMs. The final contextualized word embeddings are computed as a linear combination of all the biLSTM layers and the input word embeddings, multiplied by a linear scalar:
\begin{equation}
    \sys{ELMo}(x_k) = \gamma \left(s_0 \mf{x}_k + \sum_{j=1}^{L}{\overrightarrow{s}_{j} \overrightarrow{\mf{h}}^{(j)}_k} + \sum_{j=1}^{L}{\overleftarrow{s}_{j} \overleftarrow{\mf{h}}^{(j)}_k} \right)
\end{equation}
All the weights $\gamma, s_0, \overrightarrow{s}_{j}, \overleftarrow{s}_{j}$ are task-specific and learned during the training process.

These contextualized word embeddings are usually used in conjunction with traditional word type embeddings and character embeddings. It turns out that this sort of contextualized word embeddings pre-trained on very large text corpora (e.g., 1B Word Benchmark~\cite{chelba2014one}) has been highly effective. \newcite{peters2018deep} demonstrated that adding ELMo embeddings ($L = 2$ biLSTM layers with $4096$ units and $512$ dimension projections) to an existing competitive model can bring the F1 score on \sys{SQuAD} from $81.1$ to $85.8$ directly, a $4.7$ point of absolute improvement.

Earlier than \sys{Elmo}, \newcite{mccann2017learned} proposed \sys{CoVe}, which learned contextualized word embeddings in a neural machine translation framework, and the resulting encoder can be used in a similar way as an addition to the word embeddings. They also demonstrated a $4.3$ point of absolute improvement on \sys{SQuAD}.

Very recently, \newcite{radford2018improving} and \newcite{devlin2018bert} find that these contextualized word embeddings can not only be used as features of word representations in a task-specific neural architecture (a reading comprehension model in our context), but we can fine-tune the deep language models directly with minimal modifications to perform downstream tasks. This is indeed a very striking result at the time of writing this thesis and we will discuss it more in Section~\ref{sec:rep-vs-arch} and there still remain many open questions to answer in the future. Additionally, \newcite{devlin2018bert} proposed a clever way to train bidirectional language models: instead of always stacking LSTMs in one direction and predicting the next word,\footnote{To make it clear, although ELMo adopts a biLSTM, it is essentially the use of two unidirectional LSTMs for predicting the next word in both directions.} they mask out some words at random at the input layer, stack bidirectional layers and predict these masked words at the top layer. They find this training strategy extremely useful empirically.

\subsection{Attention Mechanisms}
\label{sec:attention-mechanisms}

There have been a multitude of attention variants proposed for neural reading comprehension models, and they aim to capture semantic similarity between the question and the passage, at different levels, multiple granularity, or in a hierarchical way. A typical complex example in this direction can be found at \cite{huang2018fusionnet}. To our best knowledge, there isn't a conclusion yet if there is one single variant that stands out. Our \sys{Stanford Attentive Reader} (Section~\ref{sec:sar}) takes the most simple possible form of attention (Figure~\ref{fig:att-overview} illustrates an overview of different layers of attention). Besides that, we think there are two ideas which can generally further improve the performance of these systems:

\begin{figure}[t]
\centering
\vspace{1em}
\includegraphics[scale=0.25]{img/gen_fusion.pdf}
\vspace{1em}

\begin{tabular}{l|ccccc}
\hline
\bf Architectures & \bf (1) & \bf (2) & \bf (2') & \bf (3) & \bf (3') \\ \hline
Match-LSTM \citep{wang2017machine} & & \checkmark & & & \\
DCN \citep{xiong2017dynamic} & & \checkmark & & & \checkmark \\
BiDAF \citep{seo2017bidirectional} & & \checkmark & & & \checkmark \\
RaSoR \citep{lee2016learning} & \checkmark & & \checkmark & & \\
R-net \citep{wang2017gated} & & \checkmark & & \checkmark & \\
\hline
Our model & \checkmark & & & &  \\
\hline
\end{tabular}
\longcaption{A summary of different layers of attention.}{\label{fig:att-overview} A summary of different layers of attention. Image courtesy: \cite{huang2018fusionnet} with minimal modifications.}
\end{figure}

\subsubsection*{Bidirectional attention}

\newcite{seo2017bidirectional} first introduced the idea of \ti{bidirectional attention}. In addition to what we already have, the key difference is that they have the \ti{question-to-passage} attention, which signifies which passage words have the closest similarity to each of the question words. In practice, this can be implemented as: for each word in the question, we can compute an attention map over all the passage words, similar as we did in Equation~\ref{eq:aligned_question} and \ref{eq:aligned_question_attention}, but in opposite directions:

\begin{equation}
    f_{q\_align}(q_i) = \sum_j{b_{i, j} \mf{E}(p_j)}.
\end{equation}
After this, we can simply feed $f_{q\_align}(q_i)$ into the input layer of the question encoding (Section~\ref{sec:question-encoding}).

The attention mechanism in \newcite{seo2017bidirectional} is a bit more complex, but we think it is similar. We also argue that the attention function in this direction is less useful, as also demonstrated in \newcite{seo2017bidirectional}. This is because the questions are generally short (10-20 words on average) and using one single LSTM for question encoding (without extra attention) is usually sufficient.

\subsubsection*{Self-attention over passage}
The second idea is \ti{self-attention} over the passage words, first introduced in \newcite{wang2017gated}.\footnote{They named it as ``self-matching attention mechanism'' in the paper.} The intuition is that the passage words can be aligned to the other passage words, with the hope that it can address coreference problems and aggregate information (of the same entity) from multiple places in the passage.

In detail, \newcite{wang2017gated} first compute the hidden vectors for the passage: $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p}$ (Equation~\ref{eq:passage-lstm}), and then for each $\mf{p}_i$, they apply an attention function over $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p}$ via one hidden layer of MLP (Equation~\ref{eq:mlp-att}):
\begin{eqnarray}
    a_{i, j} & =&  \frac{\exp\left(g_{\text{MLP}}(\mf{p}_i, \mf{p}_j)\right)}{\sum_{j'}\exp\left(g_{\text{MLP}}(\mf{p}_i, \mf{p}_{j'})\right)} \\
    \mf{c}_i & = & \sum_{j}{a_{i, j}\mf{p}_j}
\end{eqnarray}
Later, $\mf{c}_i$ and $\mf{p}_i$ are concatenated and fed into another BiLSTM: $\mf{h}^{(p)}_i = \text{BiLSTM}(\mf{h}^{(p)}_{i-1}, [\mf{p}_i, \mf{c}_i])$, and can be used as the final passage representations.

\subsection{Alternatives to LSTMs}
\label{sec:alt-lstms}
All the models we discussed so far are based on recurrent neural networks (RNNs), in particular, LSTMs. It is well known that increasing the depth of neural networks can improve the capacity of models and bring gains in performance~\cite{he2016deep}. We also discussed earlier that deep BiLSTMs of $3$ or $4$ layers usually perform better than a single layer of BiLSTM (Section~\ref{sec:imp-details}). However, we are facing two challenges as we further increase the depth of the LSTM models: 1) It gets more difficult to optimize due to the vanishing gradient problem; 2) Scalability becomes an issue as the training/inference time increases linearly as the number of layer grows. It is known that LSTMs are difficult to parallelize and thus scale poorly due to their sequential nature.

On the one hand, there are works which attempt to add highway connections~\cite{srivastava2015training} or residual connections~\cite{he2016deep} between layers, so it eases the optimization and enables training more layers of LSTMs. On the other hand, people set out to find replacements for LSTMs, getting rid of recurrent structures while still performing similarly or even better.

The most notable work in this line is the \sys{Transformer} model proposed by Google researchers~\cite{vaswani2017attention}. \sys{Transformer} only builds on top of word embeddings and simple positional encodings with stacked self-attention layers and position-wise fully connected layers. With residual connections, this model is able to be trained fast with many layers. It first demonstrated superior performance on a machine translation task with $L = 6$ layers (each layer consists of a self-attention and a fully connected feedforward network), and then later was adapted by~\cite{yu2018qanet} for reading comprehension.

The model called \sys{QANet} \cite{yu2018qanet} stacks multiple convolutional layers followed by the self-attention and fully connected layer, as a building block, for both question and passage encoding as well as a few more layers stacked before the final prediction. The model demonstrated state-of-the-art performance at the time (Table~\ref{tab:squad-results}) while showing significant speed-ups.

Another research work by \newcite{lei2018simple} proposed a lightweight recurrent unit called \sys{Simple Recurrent Unit} (SRU) by simplifying the LSTM formulation while enabling CUDA-level optimizations for high parallelization. Their results suggest that simplified recurrence retains strong modeling capacity through layer stacking. They also demonstrate that replacing the LSTMs in our model with their \sys{SRU} unit can improve the F1 score by 2 points while being faster for training and inference.

\subsection{Others}

\paragraph{Training objectives.} It is also possible to make further progress by improving the training objectives. It is usually straightforward to employ a cross-entropy or max-margin loss for the cloze style or multiple choice problems. However, for span prediction problems, \newcite{xiong2018dcn+} suggest that there is a discrepancy between the cross-entropy loss of predicting two endpoints of the answer and the final evaluation metrics, which concerns the word overlap between gold answer and ground truth. For the following example:

\begin{displayquote}
\tf{passage}: Some believe that the Golden State Warriors team of 2017 is one of the greatest teams in NBA history \ldots \\
\tf{question}: Which team is considered to be one of the greatest teams in NBA history? \\
\tf{ground truth answer}: the Golden State Warriors team of 2017
\end{displayquote}
Span ``Warriors'' is also a correct answer, however, from the perspective of cross entropy based training it is no better than the span ``history''. \newcite{xiong2018dcn+} propose to use a mixed training objective which combines cross entropy loss over positions and the word overlap measure trained with reinforcement learning. Basically, they use $P^{(\text{start})}(i)$ and $P^{(\text{end})}(i)$ trained with cross-entroy loss to sample the start and end positions of the answer and then use the F1 score as reward function.

For the free-form answer of reading comprehension problems, there has been many recent advances in training better \sys{seq2seq} models especially in the context of neural machine translation, such as sentence-level training~\cite{ranzato2016sequence} and minimum risk training~\cite{shen2016minimum}. However, we don't see many such applications in reading comprehension problems yet.

\paragraph{Data augmentation.} Data augmentation has been a very successful approach for image recognition, while it is less explored in NLP problems. \newcite{yu2018qanet} proposed a simple technique for creating more training data for reading comprehension models. The technique is called \ti{backtranslation} --- basically they leverage two state-of-the-art neural machine translation models: one model from English to French and the other model from French to English, and paraphrase each single sentence in the passage by running through the two models (with some modifications to the answer if needed). They obtained ~2 points gain in F1 by doing this on \sys{SQuAD}. \newcite{devlin2018bert} also find that joint training of \sys{SQuAD} and \sys{TriviaQA}~\cite{joshi2017triviaqa} can help improve the performance on \sys{SQuAD} modestly.

\subsection{Summary}
So far, we have discussed recent advances in different aspects, which, in sum, contribute to the latest progress on current reading comprehension benchmarks (esp. \sys{SQuAD}). Which components are more important than the others? Do we need to add up all of these? Are these recent advances able to generalize to other reading comprehension tasks? How are they correlated with different capacities of language understanding? We think there isn't a clear answer to most of these questions yet and it still requires a lot of investigation.

\begin{table}[!t]
    \centering
    \begin{tabular}{p{6cm} | c l}
    \hline
      \tf{Components} & \tf{F1 improvement} & \tf{References} \\
    \hline
      \sys{Glove}$\Rightarrow$\sys{Fasttext} & 78.9 $\Rightarrow$ 79.8: $+0.9$ & \cite{mikolov2017advances} \\
      Character embeddings & 75.4 $\Rightarrow$ 77.3: $+1.9$ & \cite{seo2017bidirectional} \\
      {\small Contextualized embeddings: \sys{ELMo}} & 81.1 $\Rightarrow$ 85.8: $+4.7$ & \cite{peters2018deep} \\
    \hline
      Question to passage attention & 73.7 $\Rightarrow$ 77.3: $+3.6$ & \cite{seo2017bidirectional} \\
      Self-attention over passage & 76.7 $\Rightarrow$ 79.5: $+2.8$ & \cite{wang2017gated} \\
    \hline
      3-layer LSTMs $\Rightarrow$ 6-layer SRUs & 78.8 $\Rightarrow$ 80.2: $+1.4$ & \cite{lei2018simple} \\
    \hline
      Mixed training objective & 82.1 $\Rightarrow$ 83.1: $+1.0$ & \cite{xiong2018dcn+} \\
      Data augmentation & 82.7 $\Rightarrow$ 83.8: $+1.1$ & \cite{yu2018qanet} \\
    \hline
    \end{tabular}
    \longcaption{A summary of recent advances on \sys{SQuAD}}{\label{tab:impr-squad} A summary of recent advances on \sys{SQuAD}. The numbers are taken from the corresponding papers, on the development set of \sys{SQuAD}.}
\end{table}

We compiled the improvements of different components on \sys{SQuAD} in Table~\ref{tab:impr-squad}. We would like to caution readers that these numbers are really not directly comparable, as they are built on different model architectures and different implementations. We hope that this table at least reflects some ideas regarding the importance of these components on the \sys{SQuAD} dataset. As is seen, all these components contribute to the final performance, more or less. The most important innovation is probably the use of contextualized word embeddings (e.g., \sys{ELMo}), while the formulation of attention functions is also crucial. It will be important to investigate whether these advances can generalize to other reading comprehension tasks in the future.


================================================
FILE: chapters/rc_models/experiments.tex
================================================
%!TEX root = ../../thesis.tex

\section{Experiments}
\label{sec:sar-experiments}

\subsection{Datasets}

We evaluate our model on \sys{CNN/Daily Mail}~\cite{hermann2015teaching} and \sys{SQuAD}~\cite{rajpurkar2016squad}, the two most popular and competitive reading comprehension datasets. We have described them before in Section~\ref{sec:deep-learning-era} regarding their importance in the development of neural reading comprehension and the way the datasets were constructed. Now we give a brief review of these datasets and the statistics.

\begin{itemize}
\item
The \sys{CNN/Daily Mail} datasets were made from articles on the news websites CNN and Daily Mail, utilizing articles and their bullet point summaries. One bullet point is converted to a question with one entity replaced by a placeholder and the answer is this entity. The text has been run through a Google NLP pipeline. It it tokenized, lowercased, and named entity recognition and coreference resolution have been run. For each coreference chain containing at least one named entity, all items in the chain are replaced by an @entity$n$ marker, for a distinct index $n$ (Table~\ref{tab:rc-examples} (a)). On average, both the \sys{CNN} and \sys{Daily Mail} contain 26.2 different entities in the article. The training, development, and testing examples were collected from the news articles at different times. The accuracy (percentage of examples predicting the correct entity) is used for evaluation.

\item
The \sys{SQuAD} dataset was collected based on Wikipedia articles. 536 high-quality Wikipedia articles were sampled and crowdworkers created questions based on each individual paragraph (paragraphs shorter than 500 characters were discarded), requiring that answer must be highlighted from the paragraph (Table~\ref{tab:rc-examples} (c)). The training/development/testing splits were made randomly based on articles (80\% vs. 10\% vs. 10\%). To estimate human performance and also make evaluation more reliable, they collected a few additional answers for each question (each question in the development set has 3.3 answers on average). Exact match and macro-averaged F1 scores are used for evaluation, as we discussed in Section~\ref{sec:evaluation}. Note that \sys{SQuAD} 2.0~\cite{rajpurkar2018know} was proposed more recently, which added 53,775 unanswerable questions to the original dataset and we will discuss it in Section~\ref{sec:future-datasets}. For most of this thesis, \sys{SQuAD} refers to \sys{SQuAD} 1.1 unless stated otherwise.
\end{itemize}


\begin{table}[t]
  \centering
  \begin{tabular}{l | r r | r }
  \hline
    & \multicolumn{2}{c|}{cloze style} & span prediction \\
    & \tf{CNN} & \tf{Daily Mail} & \tf{SQuAD} \\
  \hline
  \#Train & 380,298 & 879,450 & 87,599 \\
  \#Dev & 3,924 & 64,835 & 10,570 \\
  \#Test & 3,198 & 53,182 & 9,533 \\
  \hline
  Passage: avg. tokens & 761.8 & 813.1 & 134.4 \\
  Question: avg. tokens & 12.5 & 14.3 & 11.3 \\
  Answer: avg. tokens & 1.0 & 1.0 & 3.1 \\
  \hline
  \end{tabular}
  \longcaption{Data statistics of \sys{CNN/Daiily Mail} and \sys{SQuAD}}{\label{tab:data-statistics}Data statistics of \sys{CNN/Daily Mail} and \sys{SQuAD}. The average numbers of tokens are computed based on the training set.}
\end{table}


Table~\ref{tab:data-statistics} gives more detailed statistics of the datasets. As it is shown, the \sys{CNN/Daily Mail} datasets are much larger than \sys{SQuAD} (almost one order of magnitude bigger) due to the way the datasets were constructed. The passages used in \sys{CNN/Daily Mail} are also much longer --- 761.8 and 813.1 tokens for \sys{CNN} and \sys{Daily Mail} respectively, while it is 134.4 tokens for SQuAD. Finally, the answers in \sys{SQuAD} consists of only 3.1 tokens on average, which reflects the fact the most of the \sys{SQuAD} questions are factoid and a large portion of the answers are common nouns or named entities.


\subsection{Implementation Details}
\label{sec:imp-details}

Besides different model architecture designs, implementation details also play a crucial role in the final performance of these neural reading comprehension systems. In the following we highlight a few important aspects that we haven't covered yet and finally give the model specifications that we used on the two datasets.

\paragraph{Stacked BiLSTMs.} One simple idea is to increase the depth of bidirectional LSTMs for question and passage encoding. It computes $\mf{h}_t = [\overrightarrow{\mf{h}}_t; \overleftarrow{\mf{h}}_t] \in \R^{2h}$ and then regard $\mf{h}_t$ as the input $\mf{x}_t$ of the next layer and pass them into another BiLSTM, and so on. We generally find that stacking BiLSTMs works better than a one-layer BiLSTM and we used $3$ layers for the SQuAD experiment.\footnote{We only used a shallow one-layer BiLSTM for the CNN/Daily Mail experiments in 2016 though.}

\paragraph{Dropout.} Dropout is an effective and widely used approach to regularization in neural networks. Simply put, dropout refers to masking out some units at random during the training process. For our model, dropout can be added to the word embeddings, input vectors and hidden vectors of every LSTM layer. Finally, the variational dropout approach \cite{gal2016theoretically} has demonstrated to work better than the standard dropout on regularizing RNNs. The idea is to apply the same dropout mask at each time step for both inputs, outputs and recurrent layers, i.e., the same units are dropped at each time step. We suggest readers to use this variant in practice.\footnote{We didn't include variational dropout in our published paper results but later found it useful.}

\paragraph{Handling word embeddings.} One common way (and also our default choice) to handle word embeddings is to keep the most frequent $K$ (e.g., $K = 500,000$) word types in the training set and map all other words to an $\left<unk\right>$ token and then use pre-trained word embeddings to initialize the $K$ words. Typically, when the training set is large enough, we fine tune all the word embeddings; when the training set is relatively small (e.g., \sys{SQuAD}), we usually keep all the word embeddings fixed as static features. In \newcite{chen2017reading}, we find that it helps to fine-tune the most frequent question words because the representations of these key words such as \ti{what}, \ti{how}, \ti{which} could be crucial for reading comprehension systems. Some studies such as \cite{dhingra2017comparative} demonstrated the use of pre-trained embeddings and the ways of handling out-of-vocabulary words have a large impact on the performance of reading comprehension tasks.

\paragraph{Model specifications.}
For all the experiments which require linguistic annotations (lemma, part-of-speech tags, named entity tags, dependency parses), we use the Stanford CoreNLP toolkit~\cite{manning2014stanford} for preprocessing. For training all the neural models, we sort all the examples by the length of its passage, and randomly sample a mini-batch of size 32 for each update.

For the results on \sys{CNN/Daily Mail}, we use the lowercased, 100-dimensional pre-trained \sys{Glove} word embeddings~\cite{pennington2014glove} trained on Wikipedia and Gigaword for initialization. The attention and output parameters are initialized from a uniform distribution between $(-0.01, 0.01)$, and the LSTM weights are initialized from a Gaussian distribution $\mathcal{N}(0, 0.1)$. We use a 1-layer BiLSTM of hidden size $h = 128$ for \sys{CNN} and $h = 256$ for \sys{Daily Mail}. Optimization is carried out using vanilla stochastic gradient descent (SGD), with a fixed learning rate of $0.1$.  We also apply dropout with probability $0.2$ to the embedding layer and gradient clipping when the norm of gradients exceeds $10$.

For the results on \sys{SQuAD}, we use 3-layer BiLSTMs with $h = 128$ hidden units for both paragraph and question encoding. We use \sys{Adamax} for optimization as described in \cite{kingma2014adam}. Dropout with probably $0.3$ is applied to word embeddings and all the hidden units of LSTMs. We used the $300$-dimensional \sys{Glove} word embeddings trained from 840B Web crawl data for initialization and only fine-tune the 1000 most frequent question words.

Other implementation details can be found in the following two Github repositories:
\begin{itemize}
    \item
        \href{https://github.com/danqi/rc-cnn-dailymail}{https://github.com/danqi/rc-cnn-dailymail} for our experiments in \newcite{chen2016thorough}.
    \item
        \href{https://github.com/facebookresearch/DrQA}{https://github.com/facebookresearch/DrQA} for our experiments in \newcite{chen2017reading}.
\end{itemize}

We also would like to caution readers that our experimental results were published in two papers (2016 and 2017) and they differ in various places. A key difference is that our results on \sys{CNN/Daily Mail} didn't include manual features $f_{token}(p_i)$, exact match features $f_{exact\_match}(p_i)$, aligned question embeddings $f_{align}(p_i)$ and $\tilde{\mf{p}}_i$ just takes the word embedding $\mf{E}(p_i)$. Another difference is that we didn't have the attention layer in question encoding before but simply concatenated the last hidden vectors from the LSTMs in both directions. We believe that that these additions are useful on \sys{CNN/Daily Mail} and other cloze style tasks as well, but we didn't further investigate it.


\subsection{Experimental Results}

\subsubsection{Results on \sys{CNN/Daily Mail}}

\begin{table}[t]
\centering
\begin{tabular}{l c c c c}
\toprule
\multirow{2}{*}{\tf{Model}} & \multicolumn{2}{c}{\sys{CNN}} &  \multicolumn{2}{c}{\sys{Daily Mail}} \\
& \tf{Dev} & \tf{Test} & \tf{Dev} & \tf{Test} \\
\midrule
 Frame-semantic model $^\dagger$ &36.3  & 40.2 & 35.5 & 35.5 \\
 Word distance model $^\dagger$ & 50.5 & 50.9 & 56.4 & 55.5 \\
 Deep LSTM Reader $^\dagger$ & 55.0 & 57.0 & 63.3 & 62.2 \\
Attentive Reader $^\dagger$ & 61.6 & 63.0 & 70.5 & 69.0 \\
 Impatient Reader $^\dagger$ & 61.8 & 63.8 & 69.0 & 68.0 \\
\midrule
MemNNs (window memory) $^\ddagger$ & 58.0 & 60.6 & N/A & N/A \\
MemNNs (window memory + self-sup.) $^\ddagger$ & 63.4 & 66.8 & N/A & N/A\\
MemNNs (ensemble) $^\ddagger$ & 66.2\rlap{$^*$} & 69.4\rlap{$^*$} & N/A & N/A \\
\midrule
Our feature-based classifier & 67.1 & 67.9 & 69.1 & 68.3 \\
\midrule
Stanford Attentive Reader & 72.5 & 72.7 & 76.9 & 76.0 \\
Stanford Attentive Reader (ensemble) &  76.2\rlap{$^*$} & 76.5\rlap{$^*$} & 79.5\rlap{$^*$} & 78.7\rlap{$^*$} \\
\bottomrule
\end{tabular}
\longcaption{Evaluation results on CNN/Daily Mail}{\label{tab:cnn-dm-results}Accuracy of all models on the \sys{CNN} and \sys{Daily Mail} datasets. Results marked $^\dagger$ are from \newcite{hermann2015teaching} and results marked $^\ddagger$ are from \newcite{hill2016goldilocks}. The numbers marked with $^*$ indicate that the results are from ensemble models.}
\end{table}


Table~\ref{tab:cnn-dm-results} presents the results that we reported in \newcite{chen2016thorough}. We run our neural models 5 times independently with different random seeds and report average performance across the runs. We also report ensemble results which average the prediction probabilities of the 5 models. We also present the results for the feature-based classifier we described in Section~\ref{sec:feature-models}.

\paragraph{Baselines.} We were among the earliest groups to study this first large-scale reading comprehension dataset. At the time, \newcite{hermann2015teaching} and \newcite{hill2016goldilocks} proposed a few baselines, both symbolic approaches and neural models, for this task. The baselines include:
\begin{itemize}
    \item
        A \sys{frame-semantic} model in \newcite{hermann2015teaching}, which they run a state-of-the-art semantic parser, and extract entity-predicate triples denoted as $(e_1, V, e_2)$ from both the question and the passage, and attempt to match the correct entity using a number of heuristic rules.
    \item
        A \sys{word distance} model in \newcite{hermann2015teaching}, in which they align the placeholder of the question with each possible entity, and compute a distance measure between the question and the passage around the aligned entity.
    \item
        Several LSTM-based neural models proposed in \newcite{hermann2015teaching}, named \sys{Deep LSTM Reader}, \sys{Attentive Reader} and \sys{Impatient Reader}. The \sys{Deep LSTM Reader} just processes the question and the passage as one sequence using a deep LSTM (without attention mechanism), and makes a prediction in the end. The \sys{Attentive Reader} is similar in spirit to ours, as it computes an attention function between the question vector and all the passage vectors; while the \sys{Impatient Reader} computes an attention function for all the question words and recurrently accumulates information as the model reads each question word.
    \item
        \sys{Window-based memory networks} proposed by \newcite{hill2016goldilocks} is based on the memory network architecture \cite{weston2015memory}. We think the model is also similar to ours and the biggest difference is their way of encoding passages: they only use a 5-word context window when evaluating a candidate entity and they use a positional unigram approach to encode the contextual embeddings. If a window consists of $5$ words $x_1, x_2, \ldots, x_5$, then it is encoded as $\sum{\mf{E}_i(x_i)}$, resulting in $5$ separate embedding matrices to learn. They encode the $5$-word window surrounding the placeholder in a similar way and all other words in the question text are ignored. In addition, they simply use a dot product to compute the ``relevance'' between the question and a contextual embedding.
\end{itemize}

As seen in Table~\ref{tab:cnn-dm-results}, our feature-based classifier obtains 67.9\% accuracy on the \sys{CNN} test set and 68.3\% accuracy on the \sys{Daily Mail} test set. It significantly outperforms any of the symbolic approaches reported in \newcite{hermann2015teaching}. We feel that their frame-semantic model is not suitable for these tasks due to the poor coverage of the parser and is not representative of what a straightforward NLP system can achieve. Indeed, the frame-semantic model is even markedly inferior to the word distance model. To our surprise, our feature-based classifier even outperforms all the neural network systems in \newcite{hermann2015teaching} and the best single-system result reported from \newcite{hill2016goldilocks}.   Moreover, our single-model neural network surpasses the previous results by a large margin (over 5\%), pushing up the state-of-the-art accuracies to 72.7\% and 76.0\% respectively. The ensembles of 5 models consistently bring further 2-4\% gains.

\subsubsection{Results on \sys{SQuAD}}
\begin{table}[t]
\begin{center}
\begin{tabular}{p{8.5cm} c c c c}
\hline
 \bf Method &  \multicolumn{2}{c}{\bf Dev} & \multicolumn{2}{c}{\bf Test} \\
&  \tf{EM} & \tf{F1} & \tf{EM} & \tf{F1} \\
\hline
Logistic regression \cite{rajpurkar2016squad} & 40.0 & 51.0 & 40.4 & 51.0 \\
\hline
Match-LSTM~\cite{wang2017machine} &  64.1 & 73.9 & 64.7 & 73.7 \\
RaSoR~\cite{lee2016learning} & 66.4 & 74.9 & 67.4 & 75.5 \\
DCN~\cite{xiong2017dynamic} & 65.4 & 75.6 & 66.2 & 75.9 \\
BiDAF~\cite{seo2017bidirectional}  & 67.7 & 77.3 & 68.0 & 77.3 \\
\hline
\tf{Our model}~\cite{chen2017reading} & 69.5 & 78.8 & 70.0 &  79.0\\
\hline
R-NET~\cite{wang2017gated} & 71.1 & 79.5 & 71.3 & 79.7 \\
BiDAF + self-attention~\cite{peters2018deep} & N/A & N/A & 72.1 & 81.1 \\
FusionNet~\cite{huang2018fusionnet} & N/A & N/A & 76.0 & 83.9 \\
QANet~\cite{yu2018qanet} & 73.6 & 82.7 & N/A & N/A \\
SAN~\cite{liu2018stochastic} & 76.2 & 84.1 & 76.8 & 84.4 \\
{\small BiDAF + self-attention + ELMo}~\cite{peters2018deep} & N/A & N/A & 78.6 & 85.8 \\
BERT~\cite{devlin2018bert} & 84.1 & 90.9 & N/A & N/A \\
\hline
Human performance \cite{rajpurkar2016squad} & 80.3 & 90.5 & 82.3 & 91.2 \\
\hline
\end{tabular}
\end{center}
\longcaption{Evaluation results on SQuAD}{\label{tab:squad-results} Evaluation results on the SQuAD dataset (single model only). The results below ``our model'' were released after we finished the paper in Feb 2017. We only list representative models and report the results from the published papers. For a fair comparison, we didn't include the results which use other training resources (e.g., TriviaQA) or data augmentation techniques, except pre-trained language models, but we will discuss them in Section~\ref{sec:advances}. }
\end{table}

Table~\ref{tab:squad-results} presents our evaluation results on both the development and testing sets. SQuAD has been a very competitive benchmark since it was created and we only list a few representative models and the single-model performance. It is well known that the ensemble models can further improve the performance by a few points. We also included results from the logistic regression baseline (i.e., feature-based classifiers) created by the original authors \cite{rajpurkar2016squad}.


Our system can achieve 70.0\% exact match and 79.0\% F1 scores on the test set, which surpassed all the published results and matched the top performance on the SQuAD leaderboard\footnote{\href{https://stanford-qa.com}{https://stanford-qa.com}.} at the time we wrote the paper~\cite{chen2017reading}. Additionally, we think that our model is conceptually simpler than most of the existing systems. Compared to the logistic regression baseline, which gets $\text{F1} = 51.0$, this model is already close to a 30\% absolute improvement and it is a big win for neural models.

Since then, \sys{SQuAD} has received tremendous attention and great progress has been made on this dataset, as seen in Table~\ref{tab:squad-results}. Recent advances include pre-trained language models for initialization, more fine-grained attention mechanisms, data augmentation techniques and even better training objectives. We will discuss them in Section~\ref{sec:advances}.


\subsubsection{Ablation studies}

\begin{table}[h]
	\begin{center}
	\begin{tabular}{l | l}
    \hline
    \bf Features & \bf F1\\
    \hline
    Full & 78.8 \\
    \hline
    No $f_{token}$ & 78.0 (-0.8)\\
    No $f_{exact\_match}$ & 77.3 (-1.5)\\
    No $f_{aligned}$ & 77.3 (-1.5)\\
    No $f_{aligned}$ and $f_{exact\_match}$ & 59.4 (-19.4) \\
    \hline
    \end{tabular}
    \end{center}
    \longcaption{Feature ablation analysis on SQuAD}{\label{tab:feature-ablation}Feature ablation analysis of the paragraph representations of our model. Results are reported on the SQuAD development set.}
\end{table}

In \newcite{chen2017reading}, we conducted an ablation analysis on the components of the passage representations. As shown in Table~\ref{tab:feature-ablation}, all the components contribute to the performance of our final system. We find that, without the aligned question embeddings (only word embeddings and a few manual features), our system is still able to achieve F1 over ~77\%. The effectiveness of exact match features $f_{exact\_match}$ also indicates that there are a lot of words overlapping between the passage and the question on this dataset. More interestingly, if we remove both $f_{aligned}$ and $f_{exact\_match}$, the performance drops dramatically, so we conclude that both features play a similar but complementary role in the feature representation, like the hard and soft alignments between question and passage words.

% \subsubsection{Attention visualization}
% \red{TODO}


\subsection{Analysis: What Have the Models Learned?}

In \newcite{chen2016thorough}, we attempted to understand better what these models have actually learned, and what depth of language understanding is needed to solve these problems. We approach this by doing a careful hand-analysis of 100 randomly sampled examples from the development set of the \sys{CNN} dataset.

We roughly classify them into the following categories (if an example satisfies more than one category, we classify it into the earliest one):
\begin{description}
   \item[\tf{Exact match}] The nearest words around the placeholder are also found in the passage surrounding an entity marker; the answer is self-evident.
   \item[\tf{Sentence-level paraphrasing}] The question text is entailed\slash rephrased by \ti{exactly} one sentence in the passage, so the answer can definitely be identified from that sentence.
   \item[\tf{Partial clue}] In many cases, even though we cannot find a complete semantic match between the question text and some sentence, we are still able to infer the answer through partial clues, such as some word/concept overlap.
   \item[\tf{Multiple sentences}] Multiple sentences must be processed to infer the correct answer.
   \item[\tf{Coreference errors}] It is unavoidable that there are many coreference errors in the dataset. This category includes those examples with critical coreference errors for the answer entity or key entities appearing in the question. Basically we treat this category as ``not answerable''.
   \item[\tf{Ambiguous or hard}] This category includes examples for which we think humans are not able to obtain the correct answer (confidently).
\end{description}

Table~\ref{tab:cnn-ex-breakdown} provides our estimate of the percentage for each category, and Figure~\ref{fig:cnn-examples} presents one representative example from each category. We observe that \ti{paraphrasing} accounts for 41\% of the examples and 19\% of the examples are in the \ti{partial clue} category. Adding the most simple \ti{exact match} category, we hypothesize a large portion (73\% in this subset) of the examples are able to be answered by identifying the most relevant (single) sentence and inferring the answer based upon it. Additionally, only 2 examples require multiple sentences for inference. This is a lower rate than we expected and this suggests that the dataset requires less reasoning than previously thought. To our surprise, “coreference errors” and “ambiguous/hard” cases account for 25\% of this sample set, based on our manual analysis, and this certainly will be a barrier for training models with an accuracy much above 75\% (although, of course, a model can sometimes make a lucky guess). In fact, our ensemble neural network model is already able to achieve 76.5\% on the development set, and we think that the prospect for further improving on this dataset is small.

\begin{figure}[p]
\centering
\begin{tabular}{l p{4.5cm} p{6.5cm}}
\toprule
Category & Question & Passage \\
\midrule
Exact Match & \ti{it 's clear @entity0 is leaning toward} {\tf{@placeholder}} ,  says an expert who monitors @entity0 & \ldots @entity116 , who follows @entity0 's operations and propaganda closely , recently told @entity3 , \ti{it 's clear @entity0 is leaning toward} \tf{@entity60}  in terms of doctrine , ideology and an emphasis on holding territory after operations . \ldots  \\
\midrule
Paraphrasing & {\tf{@placeholder} says he understands why @entity0 wo n't play at his tournament} &  \ldots @entity0 called me personally to let me know that he would n't be playing here at @entity23 , " \tf{@entity3} said on his @entity21 event 's website . \ldots \\
\midrule
Partial clue & a tv movie based on @entity2 's book \tf{@placeholder} casts a @entity76 actor as @entity5 & \ldots  to @entity12  @entity2 professed that his \tf{@entity11} is not a religious book . \ldots \\
\midrule
Multiple sent. &  he 's doing a his - and - her duet all by himself ,  @entity6 said of \tf{@placeholder} &  \ldots we got some groundbreaking performances , here too , tonight ,  @entity6 said . we got \tf{@entity17} , who will be doing some musical performances . he 's doing a his - and - her duet all by himself .  \ldots \\
\midrule
Coref. Error & rapper \tf{@placeholder} " disgusted , " cancels upcoming show for @entity280 & \ldots with hip - hop star \tf{@entity246} saying on @entity247 that he was canceling an upcoming show for the @entity249 . \ldots  (but @entity249 = @entity280 = SAEs)\\
\midrule
Hard & pilot error and snow were reasons stated for \tf{@placeholder} plane crash  & \ldots a small aircraft carrying \tf{@entity5} , @entity6 and @entity7 the @entity12  @entity3 crashed a few miles from @entity9 , near @entity10 , @entity11 . \ldots \\
\bottomrule
\end{tabular}
\longcaption{Some representative examples from each category}{\label{fig:cnn-examples}Some representative examples from each category on the \sys{CNN} dataset.}
\end{figure}

\begin{table}[!t]
  \centering
    \begin{tabular}{l  l  r}
      \toprule
    \tf{\#} & \tf{Category} &    \\
    \midrule
    1 & Exact match & 13\%   \\
    2 & Paraphrasing & 41\% \\
    3 & Partial clue & 19\%  \\
    4 & Multiple sentences & 2\%  \\
    \midrule
    5 & Coreference errors & 8\% \\
    6 & Ambiguous / hard &  17\% \\
    \bottomrule
    \end{tabular}
    \longcaption{An estimate of the breakdown of \sys{CNN} examples}{\label{tab:cnn-ex-breakdown}An estimate of the breakdown of the dataset into classes, based on the analysis of our sampled 100 examples from the \sys{CNN} dataset.}
\end{table}

\begin{figure}[!t]
    \center
    \includegraphics[scale=0.6]{img/cnn_analysis.png}
    \longcaption{The per-category performance of our two systems}{\label{fig:category-performance} The per-category performance of our two systems: the \sys{Stanford Attentive Reader} and the feature-based classifier, on the sampled 100 examples of the \sys{CNN} dataset.}
\end{figure}

% \begin{table}[h]
%    \centering
%     \begin{tabular}{@{} l  r @{\hspace*{0.25em}} r r @{\hspace*{0.25em}} r @{}}
%       \toprule
%      {Category} &  \multicolumn{2}{c}{{Classifier}} & \multicolumn{2}{c}{{Neural net}} \\
%     \midrule
%      Exact match & 13 & (100.0\%) & 13 & (100.0\%) \\
%      Paraphrasing &  32 & (78.1\%) & 39 & (95.1\%) \\
%      Partial clue & 14 & (73.7\%) &  17 & (89.5\%) \\
%      Multiple sentences &  1 & (50.0\%) & 1 & (50.0\%) \\
%     \midrule
%      Coreference errors &  4 & (50.0\%) & 3 & (37.5\%)\\
%      Ambiguous / hard &  2 & (11.8\%) & 1 & (5.9\%)  \\
%      \midrule
%      All & 66 & (66.0\%) & 74 & (74.0\%) \\
%     \bottomrule
%     \end{tabular}
%     \longcaption{The per-category performance of our two systems}{\label{tab:category-performance} The per-category performance of our two systems: the \sys{Stanford Attentive Reader} and the feature-based classifier, on the sampled 100 examples of the \sys{CNN} dataset.}
% \end{table}


Let's further take a closer look at the per-category performance of our neural network and feature-based classifier, based on the above categorization. As shown in Figure~\ref{fig:category-performance}, we have the following observations: (i)~The exact-match cases are quite simple and both systems get 100\% correct. (ii)~For the ambiguous\slash hard and entity-linking-error cases, meeting our expectations, both of the systems perform poorly. (iii)~The two systems mainly differ in the paraphrasing cases, and some of the ``partial clue'' cases. This clearly shows how neural networks are better capable of learning semantic matches involving paraphrasing or lexical variation between the two sentences. (iv)~We believe that the neural network model already achieves near-optimal performance on all the single-sentence and unambiguous cases.

To sum up, we find that neural networks are certainly more powerful at recognizing lexical matches and paraphrases compared to conventional feature-based models; while it is still unclear whether they also win out on the examples which require more complex textual reasoning as the current datasets are still quite limited in that respect.


================================================
FILE: chapters/rc_models/feature_classifier.tex
================================================
%!TEX root = ../../thesis.tex

\section{Previous Approaches: Feature-based Models}
\label{sec:feature-models}

% \red{TODO: What is the space of possible entities? How do you keep it from being too large?}

We first describe a strong feature-based model that we built in \newcite{chen2016thorough} for cloze style problems, in particular, the \sys{CNN/Daily Mail} dataset~\cite{hermann2015teaching}. We will then discuss similar models built for multiple choice and span prediction problems.

For the cloze style problems, the task is formulated as predicting the correct entity $a \in \mathcal{E}$ that can fill in the blank of the question $q$ based on reading the passage $p$ (one example can be found in Table~\ref{tab:rc-examples}), where $\mathcal{E}$ denotes the candidate set of entities. Conventional linear, feature-based classifiers usually need to construct a feature vector $f_{{p}, {q}}(e) \in \R^d$ for each candidate entity $e \in \mathcal{E}$, and to learn a weight vector $\mf{w} \in \R^d$ such that the correct answer $a$ is expected to rank higher than all other candidate entities:
\begin{equation}
\mf{w}^{\intercal}f_{p, q}(a) > \mf{w}^{\intercal}f_{{p}, {q}}(e), \forall e \in \mathcal{E} \setminus \{{a}\},
\end{equation}

After all the feature vectors are constructed for each entity $e$, we can then apply any popular machine learning algorithms (e.g., logistic regression or SVM). In \newcite{chen2016thorough}, we chose to use \sys{LambdaMART}~\cite{wu2010adapting}, as it is naturally a ranking problem and forests of boosted decision trees have been very successful lately.

\begin{table}[t]
\centering
\begin{tabular}{l p{14cm}}
\toprule
\tf{\#} & \tf{Feature} \\
\midrule
1 & Whether entity $e$ occurs in the passage. \\
2 & Whether entity $e$ occurs in the question. \\
3 & The \tf{frequency} of entity $e$ in the passage. \\
4 & The \tf{first position} of occurrence of entity $e$ in the passage. \\
5 & \tf{Word distance}: we align the placeholder with each occurrence of entity $e$, and compute the average minimum distance of each non-stop question word from the entity in the passage. \\
6 & \tf{Sentence co-occurrence}: whether entity $e$ co-occurs with another entity or verb that appears in the question, in some sentence of the passage. \\
7 & \tf{$n$-gram exact match}: whether there is an exact match between the text surrounding the placeholder and the text surrounding entity $e$. We have features for all combinations of matching left and/or right one or two words. \\
8 & \tf{Dependency parse match}: we dependency parse both the question and all the sentences in the passage, and extract an indicator feature of whether $w \xrightarrow{r} \text{@placeholder}$ and $w \xrightarrow{r} e$ are both found; similar features are constructed for $\text{@placeholder} \xrightarrow{r} w$ and $e \xrightarrow{r} w$. \\
\bottomrule
\end{tabular}
\longcaption{Features used in our entity-centric classifier}{\label{tab:classifier-features}Features used in our entity-centric classifier in \newcite{chen2016thorough}.}
\end{table}

The key question left is how can we build useful feature vectors from the passage $p$, the question $q$ and every entity $e$? Table~\ref{tab:classifier-features} lists 8 sets of features that we proposed for the \sys{CNN/Daily Mail} task. As shown in the table, these features are well designed and characterize the information of the entity (e.g., frequency, position and whether it is a question/passage word) and how they are aligned with the passage/question (e.g., co-occurrence, distance, linear and syntactic matching). Some features (\#6 and \#8) also rely on linguistic tools such as dependency parsing and part-of-speech tagging (deciding whether a word is a verb or not).  Generally speaking, for non-neural models, how to construct a useful set of features always remains as a challenge. Useful features need to be informational and well-tailored to specific tasks, while not too sparse to generalize well from the training set. We have argued before in Sec~\ref{sec:ml-approaches} that this is a common problem in most of the feature-based models. Also, using the off-the-shelf linguistic tools makes the model more expensive and their final performance depends on the the accuracy level of these annotations.

\newcite{rajpurkar2016squad} and \newcite{joshi2017triviaqa} also attempted to build feature-based models for the \sys{SQuAD} and \sys{TriviaQA} datasets respectively. The models are similar in spirit to ours, except that for these span prediction tasks, they need to first decide on a set of possible answers.
For \sys{SQuAD}, \newcite{rajpurkar2016squad} consider all constituents in parses generated by Stanford CoreNLP~\cite{manning2014stanford} as candidate answers; while for \sys{TriviaQA}, \newcite{joshi2017triviaqa} consider all $n$-gram ($1 \leq n \leq 5$) that occurs in the sentences which contain at least one word in common with the question. They also tried to add more lexicalized features and labels from constituency parses. Other attempts have been made for multiple choice problems such as \cite{wang2015machine} for the \sys{MCTest} dataset and a rich set of features have been used including semantic frames, word embeddings and coreference resolution.

We will demonstrate the empirical results of these feature-based classifiers and compare them to the neural models in Section~\ref{sec:sar-experiments}.


================================================
FILE: chapters/rc_models/intro.tex
================================================
%!TEX root = ../../thesis.tex

% \section{Introduction}

In this chapter, we will cover the essence of neural network models: from the basic building blocks, to more recent advances.

Before delving into the details of neural models, we give a brief introduction to non-neural, feature-based models for reading comprehension in Section~\ref{sec:feature-models}. In particular, we describe a model that we built in \newcite{chen2016thorough}. We hope this will give readers a better sense about how these two approaches differ fundamentally.

In Section~\ref{sec:sar}, we present a neural approach to reading comprehension called \sys{The Stanford Attentive Reader}, which we first proposed in \newcite{chen2016thorough} for the cloze style reading comprehension tasks, and then later adapted to the span prediction problems \cite{chen2017reading} for \sys{SQuAD}. We first briefly review the basic building blocks of modern neural NLP models, and then describe how our model is built on top of them. We discuss its extensions to the other types of reading comprehension problems in the end.

Next we present the empirical results of our model on the \sys{CNN/Daily Mail} and the \sys{SQuAD} datasets, and provide more implementation details in Section~\ref{sec:sar-experiments}. We further conduct careful error analyses to help us better understand: 1) which components are most important for final performance; 2) where the neural models excel compared to non-neural feature-based models empirically.

Finally, we summarize recent advances in neural reading comprehension in Section~\ref{sec:advances}.

% This chapter is going to cover the following topics:
% \begin{itemize}
%     \item
%        Talk about non-neural approaches and use my baseline in the ACL'16 paper as an example
%     \item
%           Introduce SAR (and its variants on different RC tasks) -- I am hoping to give more intuitions (\red{how?})
%     \item
%        Probably need to give some background of neural NLP: word embeddings and recurrent neural networks etc
%     \item
%        Talk about experiments on CNN/Daily Mail and SQuAD: the architectures are slightly different but it should be fine...
%     \item
%        Analysis: 1) ablation studies of SQuAD from the ACL17 paper 2) comparison between the neural approach and non-neural approach on the CNN dataset
%     \item
%        Further advances: 1) word representations 2) alternatives of RNNs 3) attention mechanisms 4) better training objectives
% \end{itemize}


================================================
FILE: chapters/rc_models/sar.tex
================================================
%!TEX root = ../../thesis.tex

\section{A Neural Approach: The Stanford Attentive Reader}
\label{sec:sar}

\subsection{Preliminaries}
In the following, we outline a minimal set of elements and the key ideas which form the basis of modern neural NLP models. For more details, we refer readers to \cite{cho2015natural,goldberg2017neural}.

\subsubsection*{Word embeddings}
The first key idea is to represent words as low-dimensional (e.g., 300), real-valued vectors. Before the deep learning age, it was common to represent a word as an index into the vocabulary, which is a notational variant of using one-hot word vectors: each word is represented as a high-dimensional, sparse vector where only one entry of that word is 1 and all other entires are 0's:
\begin{eqnarray*}
\mf{v}_{\text{car}} = [0, 0, \ldots, 0, 0, 1, 0, \ldots, 0]^{\intercal} \\
\mf{v}_{\text{vehicle}} = [0, 1, \ldots, 0, 0, 0, 0, \ldots, 0]^{\intercal}
\end{eqnarray*}

The biggest problem with these sparse vectors is that they don't share any semantic similarity between words, i.e., for any pair of different words $a, b$, $\cos(\mf{v}_a, \mf{v}_b) = 0$. Low-dimensional word embeddings effectively alleviated this problem and similar words can be encoded as similar vectors in geometry space: $\cos(\mf{v}_{\text{car}}, \mf{v}_{\text{vechicle}}) < \cos(\mf{v}_{\text{car}}, \mf{v}_{\text{man}})$.

These word embeddings can be effectively learned from large unlabeled text corpora, based on the assumption that words occur in similar contexts tend to have similar meanings (a.k.a. the \ti{distributional hypothesis}). Indeed, learning word embeddings from text has a long-standing history and has been finally popularized by recent scalable algorithms and released sets of pretrained word embeddings such as \sys{word2vec}~\cite{mikolov2013distributed}, \sys{glove}~\cite{pennington2014glove} and \sys{fasttext}~\cite{bojanowski2017enriching}. They have become the mainstay of modern NLP systems.

\subsubsection*{Recurrent neural networks}
The second important idea is the use of recurrent neural networks (RNNs) to model sentences or paragraphs in NLP. \ti{Recurrent neural networks} are a class of neural networks which are suitable to handle sequences of variable length. More concretely, they apply a parameterized function recursively on a sequence $\mf{x}_1, \ldots, \mf{x}_n$:
\begin{equation}
    \mf{h}_t = f(\mf{h}_{t-1}, \mf{x}_t; \Theta)
\end{equation}
For NLP applications, we represent a sentence or a paragraph as a sequence of words where each word is transformed into a vector (usually through pre-trained word embeddings): $\mf{x} = \mf{x}_1, \mf{x}_2, \ldots, \mf{x}_n \in \R^d$ and $\mf{h}_t \in \R^h$ can be used to model the contextual information of $\mf{x}_{1:t}$.

Vanilla RNNs take the form of
\begin{equation}
    \mf{h}_t = \tanh(\mf{W}^{hh}\mf{h}_{t-1} + \mf{W}^{hx}\mf{x}_t + \mf{b}),
\end{equation}
where $\mf{W}^{hh} \in \R^{h \times h}, \mf{W}^{hx} \in \R^{h\times d}$, $\mf{b} \in \R^h$ are the parameters to be learned. To ease the optimization, many variants of RNNs have been proposed. Among them, long short-term memory networks (LSTMs)~\cite{hochreiter1997} and gated recurrent units (GRUs)~\cite{cho2014learning} are the commonly used ones. Arguably, LSTM is still the most competitive RNN variant for NLP applications today and also our default choice for the neural models that we will describe. Mathematically, LSTMs can be formulated as follows:
\begin{eqnarray}
    \mf{i}_t & = & \sigma(\mf{W}^{ih}\mf{h}_{t-1} + \mf{W}^{ix}\mf{x_t} + \mf{b}^{i}) \\
    \mf{f}_t & = & \sigma(\mf{W}^{fh}\mf{h}_{t-1} + \mf{W}^{fx}\mf{x_t} + \mf{b}^{f}) \\
    \mf{o}_t & = & \sigma(\mf{W}^{oh}\mf{h}_{t-1} + \mf{W}^{ox}\mf{x_t} + \mf{b}^{o}) \\
    \mf{g}_t & = & \tanh(\mf{W}^{gh}\mf{h}_{t-1} + \mf{W}^{gx}\mf{x_t} + \mf{b}^{g}) \\
    \mf{c}_t & = & \mf{f}_t \odot \mf{c}_{t-1} + \mf{i}_t \odot \mf{g}_t \\
    \mf{h}_t & = & \mf{o}_t \odot \tanh(\mf{c}_t),
\end{eqnarray}
where $\mf{W}^{ih}, \mf{W}^{fh}, \mf{W}^{oh}, \mf{W}^{gh} \in \R^{h \times h}$, $\mf{W}^{ix}, \mf{W}^{fx}, \mf{W}^{ox}, \mf{W}^{gx} \in \R^{h \times d}$ and $\mf{b}^{i}, \mf{b}^{f}, \mf{b}^{o}, \mf{b}^{g} \in \R^h$ are the parameters to be learned.

Finally, a useful elaboration of an RNN is a \ti{bidirectional RNN}. The idea is simple: for a sentence or a paragraph, $\mf{x} = \mf{x}_1, \ldots, \mf{x}_n$, a forward RNN is used from left to right and then another backward RNN is used from right to left:
\begin{eqnarray}
    \overrightarrow{\mf{h}}_t & = & f(\overrightarrow{\mf{h}}_{t-1}, \mf{x}_t; \overrightarrow{\Theta}), \quad t = 1, \ldots, n\\
    \overleftarrow{\mf{h}}_t & = & f(\overleftarrow{\mf{h}}_{t+1}, \mf{x}_t; \overleftarrow{\Theta}), \quad t = n, \ldots, 1
\end{eqnarray}
We define $\mf{h}_t = [\overrightarrow{\mf{h}}_t; \overleftarrow{\mf{h}}_t] \in \R^{2h}$ which takes the concatenation of the hidden vectors from the RNNs in both directions. These representations can usefully encode information from both the left context and the right context and are suitable for general-purpose trainable feature-extracting component of many NLP tasks.

\subsubsection*{Attention mechanism}
The third important component is an attention mechanism. It was first introduced in the \textit{sequence-to-sequence} (seq2seq) models \cite{sutskever2014sequence} for neural machine translation \cite{bahdanau2015neural,luong2015effective} and has later been extended to other NLP tasks.

The key idea is, if we want to predict the sentiment of a sentence, or translate a sentence of one language to the other, we usually apply recurrent neural networks to encode a single sentence (or the source sentence for machine translation): $\mf{h}_1, \mf{h}_2, \ldots, \mf{h}_n$ and use the last time step $\mf{h}_n$ to predict the sentiment label or the first word in the target language:

\begin{equation}
  P(Y = y) = \frac{\exp(\mf{W}_y\mf{h}_n)}{\sum_{y'}{\exp\left(\mf{W}_{y'}\mf{h}_n\right)}}
\end{equation}

This requires the model to be able to compress all the necessary information of a sentence into a fixed-length vector, which causes an information bottleneck in improving performance. An attention mechanism is designed to solve this problem: instead of squashing all the information into the last hidden vector, it looks at the hidden vectors at all time steps and chooses a subset of these vectors adaptively:
\begin{eqnarray}
    \alpha_i & = & \frac{\exp\left(g(\mf{h}_i, \mf{w}; \Theta_g)\right)}{\sum_{i'=1}^{n}\exp\left(g(\mf{h}_{i'}, \mf{w}; \Theta_g)\right)} \label{eq:attention} \\
    \mf{c} & = & \sum_{i=1}^{n}{\alpha_i \mf{h}_i} \label{eq:context-vector}
\end{eqnarray}

Here $\mf{w}$ can be a task-specific vector learned from the training process, or taken as the current target hidden state in machine translation and $g$ is a parameteric function which can be chosen in various ways, such as dot product, bilinear product, or one hidden layer of MLP:
\begin{eqnarray}
    g_{\text{dot}}(\mf{h}_i, \mf{w}) &=& {\mf{h}_i}^{\intercal}\mf{w} \\
    g_{\text{bilinear}}(\mf{h}_i, \mf{w}) &=& {\mf{h}_i}^\intercal\mf{W}\mf{w} \\
    g_{\text{MLP}}(\mf{h}_i, \mf{w}) &=& {\mf{v}}^\intercal\tanh(\mf{W}^h\mf{h}_i + \mf{W}^w\mf{w}) \label{eq:mlp-att}
\end{eqnarray}

Roughly, an attention mechanism computes a similarity score for each $\mf{h}_i$ and then a softmax function is applied which returns a discrete probability distribution over all the time steps. Thus $\alpha$ essentially captures which parts of the sentence are indeed relevant and $\mf{c}$ aggregates over all the time steps with a weighted sum and can be used for final prediction. We are not going into more details and interested readers are referred to \newcite{bahdanau2015neural,luong2015effective}.

Attention mechanisms have been proved widely effective in numerous applications and become an integral part of neural NLP models. Recently, \newcite{parikh2016decomposable} and \newcite{vaswani2017attention} conjectured that attention mechanisms don't have to be used in conjunction with recurrent neural networks and can be built purely based on word embeddings and feed-forward networks, while providing minimal sequence information. This class of models usually requires less parameters and is more parallelizable and scalable --- in particular, the \sys{Transformer} model proposed in \newcite{vaswani2017attention} has become a recent trend and we will discuss it more in Section~\ref{sec:alt-lstms}.

\subsection{The Model}
At this point, we are equipped with all the building blocks. How can we build effective neural models out of them for reading comprehension? What are the key ingredients? Next we introduce our model: the \sys{Stanford Attentive Reader}. Our model is inspired by the \sys{Attentive Reader} described in \newcite{hermann2015teaching} and other concurrent works, with a goal of making the model simple yet powerful. We first describe its full form for span prediction problems that we introduced in \newcite{chen2017reading} and then later we discuss its other variants.

\begin{figure}[t]
\begin{center}
\includegraphics[height=8cm]{img/drqa_reader.pdf}
\end{center}
\longcaption{A full model of \sys{Stanford Attentive Reader}}{\label{fig:sar} A full model of \sys{Stanford Attentive Reader}. Image courtesy: \\ \href{https://web.stanford.edu/~jurafsky/slp3/23.pdf}{https://web.stanford.edu/~jurafsky/slp3/23.pdf}.}
\end{figure}

Let's first recap the setting of span-based reading comprehension problems: Given a single passage $p$ consisting of $l_p$ tokens $(p_1, p_2, \ldots, p_{l_p})$ and a question $q$ consisting of $l_q$ tokens $(q_1, q_2, \ldots, q_{l_q})$, the goal is to predict a span $(a_{\text{start}}, a_{\text{end}})$ where $1 \leq a_{\text{start}} \leq a_{\text{end}} \leq l_p$ so that the corresponding string $p_{a_{\text{start}}}, p_{a_{\text{start}} + 1}, \ldots, p_{a_{\text{end}}}$ gives the answer to the question.

The full model is illustrated in Figure~\ref{fig:sar}. At a high level, the model first builds a vector representation for the question and builds a vector representation for each token in the passage. It then computes a similarity function between the question and its passage word in context, and then uses the question-passage similarity scores to decide the starting and ending positions of the answer span. The model builds on top of the low-dimensional, pre-trained word embeddings for each word in the passage and question (with linguistic annotations optionally). All the parameters for passage/question encoding and similarity functions are optimized jointly for the final answer prediction. Let's go into further details of each component:

\subsubsection*{Question encoding}
\label{sec:question-encoding}
The question encoding is relatively simple: we first map each question word $q_i$ into its word embedding $\mf{E}(q_i) \in \R^d$ and then we apply a bi-directional LSTM on top of them and finally obtain:
\begin{equation}
    \mf{q}_{1}, \mf{q}_2, \ldots, \mf{q}_{l_q} = \text{BiLSTM}(\mf{E}(q_1), \mf{E}(q_2), \ldots, \mf{E}(q_{l_q}); \Theta^{(q)}) \in \R^{h}
\end{equation}

We then aggregate these hidden units into one single vector through an attention layer:
\begin{eqnarray}
    b_j & = & \frac{\exp({\mf{w}^{q}}^\intercal \mf{q}_j)}{\sum_{j'}{\exp({\mf{w}^{q}}^\intercal \mf{q}_{j'})}} \\
    \mf{q} & = & \sum_j{b_j \mf{q}_j}
\end{eqnarray}
$b_j$ measures the importance of each question word and $\mf{w}^{q} \in \R^h$ is a weight vector to be learned. Therefore, $\mf{q} \in \R^h$ is the final vector representation for the question. Indeed, it was simpler (and also common) to represent $\mf{q}$ as the concatenation of the last hidden vector from the LSTMs in both directions. However, based on the empirical performance, we find that adding this attention layer helps consistently as it adds more weight to the more relevant question words.

\subsubsection*{Passage encoding}
Passage encoding is similar, as we also first form an input representation $\tilde{\mf{p}}_i \in \R^{\tilde{d}}$ for each word in the passage and pass them through another bidirectional LSTM:
\begin{equation}
  \label{eq:passage-lstm}
    \mf{p}_{1}, \mf{p}_2, \ldots, \mf{p}_{l_p} = \text{BiLSTM}\left(\tilde{\mf{p}}_1, \tilde{\mf{p}}_2, \ldots, \tilde{\mf{p}}_{l_p}; \Theta^{(p)}\right) \in \R^{h}
\end{equation}

The input representation $\tilde{\mf{p}}_i$ can be divided into two categories: one is to encode \ti{the properties of each word itself}, and the other is to encode \ti{its relevance with respect to the question}.

For the first category, in addition to word embedding $f_{emb}(p_i) = \mf{E}(p_i) \in \R^d$, we also add some manual features which reflect the properties of word $p_i$ in its context, including its part-of-speech (POS) and named entity recognition (NER) tags and its (normalized) term frequency (TF): $f_{token}(p_i) = \left(\text{POS}(p_i), \text{NER}(p_i), \text{TF}(p_i)\right)$. For POS and NER tags, we run off-the-shelf tools and convert it into a one-hot representation as the set of tags is small. The TF feature is real-valued number which measures how many times the word appears in the passage divided by the total number of words.

For the second category, we consider two types of representations:
\begin{itemize}
  \item
  \tf{Exact match}: $f_{exact\_match}(p_i) = \mathbb{I}(p_i \in q) \in \R$. In practice, we use three simple binary features, indicating whether $p_i$ can be exactly matched to one question word in $q$, either in its original, lowercase or lemma form.
  \item
  \tf{Aligned question embeddings}: The exact match features encode the hard alignment between question words and passage words. Aligned question embeddings aim to encode a soft notion of alignment between words in the word embedding space, so that similar (but non-identical) words, e.g., \textit{car} and \textit{vehicle}, can be well aligned. Concretely, we use
  \begin{equation}
      \label{eq:aligned_question}
    f_{align}(p_i) = \sum_j{a_{i, j} \mf{E}(q_j)}
  \end{equation}
  where $a_{i, j}$ are the attention weights which capture the similarity between $p_i$ and each question words $q_j$ and $\mf{E}(q_j) \in \R^d$ is the word embedding for each question word. $a_{i, j}$ is computed by the dot product between nonlinear mappings of word embeddings:
  \begin{equation}
    \label{eq:aligned_question_attention}
    a_{i, j} = \frac{\exp\left(\text{MLP}(\mf{E}(p_i))^{\intercal} \text{MLP}(\mf{E}(q_{j}))\right)}{\sum_{j'}{\exp\left(\text{MLP}(\mf{E}(p_i)) ^{\intercal} \text{MLP}(\mf{E}(q_{j'}))\right)}},
  \end{equation} and $\text{MLP}(\mf{x}) = \max(0, \mf{W}_{\text{MLP}}\mf{x} + \mf{b}_{\text{MLP}})$ is a single dense layer with ReLU nonlinearity, where $\mf{W}_{\text{MLP}} \in \R^{d \times d}$ and $\mf{b}_{\text{MLP}} \in \R^d$.
\end{itemize}
Finally, we simply concatenate the four components and form the input representation:
\begin{equation}
    \tilde{\mf{p}_i} = (f_{emb}(p_i), f_{token}(p_i), f_{exact\_match}(p_i), f_{align}(p_i)) \in \R^{\tilde{d}}
\end{equation}

\subsubsection*{Answer prediction}
We have vector representations for both the passage $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p} \in \R^h$ and the question $\mf{q} \in \R^h$ and the goal is to predict the span that is most likely the correct answer. We employ the idea of attention mechanism again and train two separate classifiers, one is to predict the start position of the span while the other is to predict the end position. More specifically, we use a bilinear product to capture the similarity between $\mf{p}_i$ and $\mf{q}$:
\begin{eqnarray}
P^{(\text{start})}(i) & = & \frac{\exp\left(\mf{p}_i \mf{W}^{(\text{start})} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W}^{(\text{start})} \mf{q}\right)} \\
P^{(\text{end})}(i) & = & \frac{\exp\left(\mf{p}_i \mf{W}^{(\text{end})} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W}^{(\text{end})} \mf{q}\right)},
\end{eqnarray}
where $\mf{W}^{(\text{start})}, \mf{W}^{(\text{end})} \in \R^{h \times h}$ are additional parameters to be learned. This is slightly different from the formulation of attention as we don't need to take the weighted sum of all the vector representations. Instead, we use the normalized weights to make direct predictions. We use bilinear products because we find them to work well empirically.

\subsubsection*{Training and inference}
The final training objective is to minimize the cross-entropy loss:
\begin{equation}
    \mathcal{L} = - \sum \log{P^{(\text{start})}(a_{\text{start}})} - \sum \log{P^{(\text{end})}(a_{\text{end}})},
\end{equation}
and all the parameters $\Theta = \Theta^{(p)}, \Theta^{(q)}, \mf{w}^{(q)}, \mf{W}_{\text{MLP}}, \mf{b}_{\text{MLP}}, \mf{W}^{(\text{start})}, \mf{W}^{(\text{end})}$ are optimized jointly with stochastic gradient methods.\footnote{We exclude word embeddings here but it is also common to treat all or a subset of the word embeddings as parameters and fine-tune them during training.}

During inference, we choose the span $p_i, \ldots, p_{i'}$ such that $i \leq i' \leq i + max\_len$ and $P^{(\text{start})}(i) \times P^{(\text{end})}(i')$ is maximized. $max\_len$ is a pre-defined constant (e.g., 15) which controls the maximum length of the answer.

\subsection{Extensions}
In the following, we give a few variants of the \sys{Stanford Attentive Reader} for other types of reading comprehension problems.
All these models follow the same process of passage encoding and question encoding as described above, hence we have $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p} \in \R^h$ and $\mf{q} \in \R^h$. We only discuss the answer prediction component and training objectives.

\paragraph{\tf{Cloze style.}} Similarly, we can compute an attention function using a bilinear product of the question over all the words in the passage, and then compute an output vector $\mf{o}$ which takes a weighted sum of all the paragraph representations:
\begin{eqnarray}
    \alpha_i & = & \frac{\exp\left(\mf{p}_i \mf{W} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W} \mf{q}\right)} \\
    \mf{o} & = & \sum_{i}{\alpha_i \mf{p}_i}   \label{eqn:output_vector}
\end{eqnarray}
The output vector $\mf{o}$ can be used to predict the missing word or entity:
\begin{equation}
    P(Y = e \mid p, q) = \frac{\exp(\mf{W}^{(a)}_e \mf{o})}{\sum_{e' \in \mathcal{E}}\exp\left(\mf{W}^{(a)}_{e'} \mf{o}\right)},
\end{equation}
where $\mathcal{E}$ denotes the candidate set of entities or words. It is straightforward to adopt a negative log-likelihood objective for training and choose $e \in \mathcal{E}$ which maximizes $\mf{W}^{(a)}_{e} \mf{o}$ during prediction. This model has been studied in our earlier paper \cite{chen2016thorough} for the \sys{CNN/Daily Mail} dataset and \cite{onishi2016did} for the \sys{Who-Did-What} dataset.

\paragraph{\tf{Multiple choice.}} In this setting, $k$ hypothesized answers are given $\mathcal{A} = \{a_1, \ldots, a_k\}$ and we can encode each of them into a vector $\mf{a}_i$ by applying a third BiLSTM, similar to our question encoding step. We can then compute the output vector $\mf{o}$ as in Equation~\ref{eqn:output_vector} and compare it with each hypothesized answer vector $\mf{a}_i$ through another similarity function using a bilinear product:
\begin{equation}
    P(Y = i \mid p, q) = \frac{\exp(\mf{a}_i \mf{W}^{(a)} \mf{o})}{\sum_{i'=1, \ldots, k}\exp\left(\mf{a}_{i'}\mf{W}^{(a)} \mf{o}\right)}
\end{equation}
The cross-entroy loss is also used for training. This model has been studied in \newcite{lai2017race} for the \sys{RACE} dataset.

\paragraph{\tf{Free-form answer.}} For this type of problems, the answer isn't restricted to a single entity or a span in the passage and can take any sequence of words and the most common solution is to incorporate an LSTM sequence decoder into the current framework. In more detail, assume the answer string is $a = (a_1, a_2, \ldots, a_{l_a})$ and a special ``end-of-sequence'' token $\left<eos\right>$ is added to the end of each answer. We can compute the output vector $\mf{o}$ again as in Equation~\ref{eqn:output_vector}. and the decoder generates a word at a time and hence the conditional probability can be decomposed as:
\begin{equation}
    P(a \mid p, q) =  P(a \mid \mf{o}) = \prod_{j = 1}^{l_a}P(a_j \mid a_{<j}, \mf{o})
\end{equation}

$P(a_j \mid a_{<j}, \mf{o})$ is parameterized as an LSTM which takes $\mf{o}$ as the initial hidden vector, and $a_j$ is predicted based on the hidden vector $\mf{h}_j$ for the full vocabulary $\mathcal{V} \cup \{\left<eos\right>\}$. The training objective is
\begin{equation}
\mathcal{L} = -\log{P(a \mid p, q)} = -\log\sum_{j = 1}^{l_a}P(a_j \mid a_{<j}, \mf{o})
\end{equation}
For prediction, one word is predicted at a time which maximizes $P(a_j \mid a_{<j}, \mf{o})$ and then fed into the next time step, until the token $\left<eos\right>$ is predicted. We are not going to elaborate more on it as they are standard components in sequence-to-sequence models \cite{sutskever2014sequence}.

This class of models has been studied on the \sys{MS MARCO}~\cite{nguyen2016ms} and the \sys{NarrativeQA}~\cite{kovcisky2018narrativeqa} datasets. However, as the free-form answer reading comprehension problems are more complex and more difficult to evaluate, we think that these methods haven't been fully explored yet, compared to other types of problems. Lastly, we believe that a \textit{copy mechanism} proposed for summarization tasks  \cite{gu2016incorporating,see2017get}, which allows the decoder to choose either to copy a word from the source text or to generate a word from the vocabulary, would be highly useful for reading comprehension tasks as well, as answer words are still likely to appear in the paragraph or question. We will discuss one model with a copy mechanism in Section~\ref{sec:coqa-models}.


================================================
FILE: chapters/rc_overview/discussions.tex
================================================
%!TEX root = ../../thesis.tex

\section{Reading Comprehension vs. Question Answering}
\label{sec:rc-qa-diff}

There is a close relationship between reading comprehension and question answering.  We can view reading comprehension as an instance of question answering because it is essentially a question answering problem over a short passage of text. Nevertheless, although reading comprehension and general question answering share many common characteristics in problem formulation, approaches and evaluation, we think they emphasize different thing as their final goals:

\begin{itemize}
    \item
        The ultimate goal of question answering is to build computer systems which are able to automatically \ti{answer questions} posed by humans, no matter what sort of resources they depend on. These resources can be structured knowledge bases, unstructured text collections (encyclopedias, dictionaries, newswire articles and general Web documents), semi-structured tables or even other modalities. Towards the better performance of QA systems, a lot of efforts have been put into (1) how to search and identify relevant resources, (2) how to integrate answers from different pieces of information, or even (3) to study what types of questions humans usually ask in the real world.
    \item
        However, reading comprehension puts more emphasis on \ti{text understanding} with answering questions regarded as a way to measure language understanding. Therefore it requires a deep understanding of the given passage in order to answer the question. Due to this key difference, early works in this field mostly focused on fictional stories ~\cite{lehnert1977process} (later extended to Wikipedia or Web documents), so all the information to answer comprehension questions comes from the passage itself instead of any world knowledge. The questions are also specifically devised to test different aspects of text comprehension. This distinction is akin to what questions people usually ask on search engines versus what sorts of questions are usually posed in human reading comprehension tests.
\end{itemize}

Similarly, early work \cite{mitchell2009populating} used the terms \tf{micro-reading} and \tf{macro-reading} to differentiate these two scenarios. Micro-reading focuses on reading a single text document and aims to extract the full information content of that document (similar to our reading comprehension setting), while macro-reading takes a large text collection (e.g., the Web) as input and extracts a large collection of facts expressed in the text, without requiring that every single fact is extracted. Macro-reading can effectively leverage the \ti{redundancy} of information across documents by focusing on analyzing simple wordings of the fact in the text, while micro-reading has to investigate deeper level of language understanding.

This thesis mostly focuses on reading comprehension. In Chapter~\ref{chapter:openqa}, we will come back to more general question answering problems, discuss its related work and also demonstrate that reading comprehension can be also helpful in building question answering systems.

\section{Datasets and Models}
\label{sec:rc-drive}


As seen in Section~\ref{sec:deep-learning-era}, the recent success of reading comprehension has been mainly driven by two key components: \ti{large-scale reading comprehension datasets} and \ti{end-to-end neural reading comprehension models}. They work together to advance the field and push the boundaries of building better reading comprehension systems:

\begin{description}
\item
On the one hand, the creation of large-scale reading comprehension datasets has made it possible to train neural models, while demonstrating their competitiveness over symbolic NLP systems. The availability of these datasets further attracted a lot of attention in our research community and inspired a series of modeling innovations. Tremendous progress has been made thanks to all these efforts.
\item
On the other hand, understanding the performance of existing models further helps identify the limitations of existing datasets. This motivates us to seek better ways to construct more challenging datasets, towards the ultimate goal of machine comprehension of text.
\end{description}


\begin{figure}[!t]
    \center
    \includegraphics[scale=1.0]{img/timeline.pdf}
    \longcaption{The recent development of datasets and models in neural reading comprehension}{\label{fig:timeline}The recent development of datasets (black) and models (blue) in neural reading comprehension. For the timeline, we use the date that the corresponding papers were published, except \sys{BERT}~\cite{devlin2018bert}.}
\end{figure}

Figure~\ref{fig:timeline} shows a timeline of the recent development of key datasets and models since 2016. As is seen, although it has been only 3 years, the field has been moving strikingly fast. The innovations in building better datasets and more effective models have occurred alternately and both contributed to the development of the field. In the future, we believe it will be equally important to continue to develop both components.

In the next chapter, we will mainly focus on the modeling aspect, using the two representative datasets that we described earlier: \sys{CNN/Daily Mail} and \sys{SQuAD}. In Chapter~\ref{chapter:rc-future}, we will discuss more about the advances and future work, for both datasets and models.


================================================
FILE: chapters/rc_overview/history.tex
================================================
%!TEX root = ../../thesis.tex

\section{History}
\label{sec:rc-history}

\subsection{Early Systems}
The history of building automated reading comprehension systems dates back to over forty years ago. In the 1970s, researchers already recognized the importance of reading comprehension as an appropriate way of testing the language understanding abilities of computer programs.

% \red{TODO: want to cite \cite{charniak1972toward} but I don't know much of its context.}

One of the most notable early works is the \sys{QUALM} system detailed in \newcite{lehnert1977process}. Built on top of the framework of scripts and plans as devices for modeling human story comprehension \cite{schank1977scripts}, \newcite{lehnert1977process} devised a theory of question answering and focused on pragmatic issues and the importance of the context of the story in responding to questions. This early work set a strong vision for language understanding, but the actual systems built at that time were very small and limited to hand-coded scripts, and difficult to generalize to broader domains.

Due to the complexity of the problem, this line of research was mostly neglected in the 1980s and 1990s.\footnote{There has been a large body of work in story comprehension developed within the psychology community, see \cite{kintsch1998comprehension}.} In the late 1990s, there was some small revival of interest, following the creation of a reading comprehension dataset by \newcite{hirschman1999deep} and a subsequent Workshop on Reading Comprehension Tests as Evaluation for Computer-based Understanding Systems at ANLP/NAACL 2000. The dataset consists of 60 stories for development and 60 stories for testing of 3rd to 6th grade material, followed by short-answer \ti{who}, \ti{what}, \ti{when}, \ti{where} and \ti{why} questions. It only requires systems to return a sentence which contains the right answer. The systems developed at this stage were mostly rule-based bag-of-words approaches with shallow linguistic processing such as stemming, semantic class identification and pronoun resolution in the \sys{Deep Read} system \cite{hirschman1999deep}, or manually generated rules based on lexical and semantic correspondence in the \sys{Quarc} system \cite{riloff2000rule} or their combinations \cite{charniak2000reading}. These systems achieved $30\%$--$40\%$ accuracy on retrieving the correct sentence.

% {\red{TODO: add a sentence about TREC QA?}}

\subsection{Machine Learning Approaches}
\label{sec:ml-approaches}

\afterpage{
\LTcapwidth=\textwidth
\begin{longtable}{l | p{13.5cm}}
\toprule
\text{(a)} & \tf{CNN/Daily Mail} (cloze style) \\
& \tf{passage}: {\small ( @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel `` @entity11 '' will feature a capable but flawed @entity13 official named @entity14 who `` also happens to be a lesbian . '' the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of `` @entity6 '' books at @entity28 imprint @entity26 .} \\
& \tf{question}: {\small characters in `` \underline{\hspace{1cm}} '' movies have gradually become more diverse} \\
& \tf{answer}: {\small @entity6} \\
\midrule
\text{(b)} & \tf{MCTest} (multiple choice) \\
& \tf{passage}: {\small Once upon a time, there was a cowgirl named Clementine. Orange was her favorite color. Her favorite food was the strawberry. She really liked her Blackberry phone, which allowed her to call her friends and family when out on the range. One day Clementine thought she needed a new pair of boots, so she went to the mall. Before Clementine went inside the mall, she smoked a cigarette. Then she got a new pair of boots. She couldn't choose between brown and red. Finally she chose red, which the seller really liked. Once she got home, she found that her red boots didn't match her blue cowgirl clothes, so she knew she needed to return them. She traded them for a brown pair. While she was there, she also bought a pretzel from Auntie Anne's.} \\
&\tf{question}: {\small What did the cowgirl do before buying new boots?} \\
&\tf{hypothesized answers}: {\small A. She ate an orange B. She ate a strawberry C. She called her friend D. She smoked a cigarette} \\
&\tf{answer}: {\small D. She smoked a cigarette} \\
\midrule
\text{(c)} &\tf{SQuAD} (span prediction) \\
&\tf{passage}: {\small Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion \hl{Denver Broncos} defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.}  \\
&\tf{question}: {\small Which NFL team won Super Bowl 50?} \\
&\tf{answer}: {\small Denver Broncos} \\
\midrule
\text{(d)} &\tf{NarrativeQA} (free-form text) \\
&\tf{passage}: {\small \ldots In the eyes of the city, they are now considered frauds. Five years later, Ray owns an occult bookstore and works as an unpopular children s entertainer with Winston; Egon has returned to Columbia University to conduct experiments into human emotion; and Peter hosts a pseudo-psychic television show. Peter's former girlfriend Dana Barrett has had a son, Oscar, with a violinist whom she married then divorced when he received an offer to join the London Symphony Orchestra.\ldots }  \\
&\tf{question}: {\small How is Oscar related to Dana?} \\
&\tf{answer}: {\small He is her son} \\
\bottomrule
\longcaption{Examples from representative reading comprehension datasets}{\label{tab:rc-examples} A few examples from representative reading comprehension datasets: (a) \sys{CNN/Daily Mail}~\cite{hermann2015teaching}, (b) \sys{MCTest}~\cite{richardson2013mctest}, (c) \sys{SQuAD}~\cite{rajpurkar2016squad} and (d) \sys{NarrativeQA}~\cite{kovcisky2018narrativeqa}.}
\end{longtable}
}

Between 2013 and 2015, there were remarkable efforts of formulating reading comprehension as a \ti{supervised learning} problem: researchers collected human-labeled training examples in the form of (passage, question, answer) triples, with the hope that we can train statistical models which learn to map a passage and question pair into their corresponding answer: $f: (\text{passage}, \text{question}) \longrightarrow \text{answer}.$

Two notable datasets during this period are \sys{MCTest}~\cite{richardson2013mctest} and \sys{ProcessBank}~\cite{berant2014modeling}. \sys{MCTest} collects 660 fictional stories, with 4 multiple choice questions per story (each question comes with 4 hypothetical answers and one of them is correct) (Table~\ref{tab:rc-examples} (b)). \sys{ProcessBank} is designed to answer binary-choice questions in a paragraph describing a biological process and requires an understanding of the relations between entities and events in the process. The dataset comprises 585 questions spread over the 200 paragraphs.

In the original \sys{MCTest} paper, \newcite{richardson2013mctest} proposed several rule-based baselines without leveraging any training data. One is a heuristic sliding window approach, which measures the weighted word overlap/distance information between words in the question, the answer and the sliding window; the other is to run an off-the-shelf textual entailment system by converting each question-answer pair into a statement. This dataset later inspired a strand of machine learning models \cite{sachan2015learning,narasimhan2015machine,wang2015machine}. These models were mostly built on top of a simple max-margin learning framework with a rich set of hand-engineered linguistic features, including syntactic dependencies, semantic frames, coreference resolution, discourse relations and word embeddings. The performance was improved modestly from 63\% to around 70\% on the \sys{MC500} portion. On the \sys{ProcessBank} dataset, \newcite{berant2014modeling} proposed a statistical model which learns to predict the process structure first and then maps the question to formal queries that can be executed against the structure. Similarly, the model incorporates a large set of manual features,\footnote{See \href{https://nlp.stanford.edu/pubs/berant-srikumar-manning-emnlp14-supp.pdf}{https://nlp.stanford.edu/pubs/berant-srikumar-manning-emnlp14-supp.pdf}.} and eventually obtains 66.7\% accuracy on the binary classification task.

These machine learning models have achieved modest progress compared to earlier rule-based heuristic methods. However, their improvements are still rather limited and their weaknesses are summarized as follows:
\begin{itemize}
    \item
        These models relied heavily on existing linguistic tools such as dependency parsers and semantic role labeling (SRL) systems. However, these linguistic representation tasks are far from solved and off-the-shelf tools are often trained from one single domain (e.g., newswire articles) and suffer from generalization problems in practical use. Therefore, leveraging existing linguistic annotations as features sometimes adds noise in these feature-based machine learning models and the situation gets worse for higher level annotations (e.g., discourse relations vs. part-of-speech tagging).
    \item
        Simulating human-level comprehension is an elusive challenge and it is always the case that it is difficult to construct effective features from current linguistic representations. For example, for the third question in Figure~\ref{fig:mctest-example}: \ti{How many friends does Alyssa have in this story?}, it is impossible to construct an effective feature when the evidence is spread over the passage.
    \item
        Although it is inspiring that we can train models from human-labeled reading comprehension examples, these datasets are still too small to support expressive statistical models. For example, the English Penn Treebank dataset for training dependency parsers consists of 39,832 examples, while in \sys{MCTest}, only 1,480 examples are used for training --- let alone reading comprehension which, as a comprehensive language understanding task, is more complex and requires different reasoning capabilities.
\end{itemize}

\subsection{A Resurgence: The Deep Learning Era}
\label{sec:deep-learning-era}

A turning point for this field came in 2015. The DeepMind researchers \newcite{hermann2015teaching} proposed a novel and cheap solution for creating large-scale supervised training data for learning reading comprehension models. They also proposed a neural network model --- an attention-based LSTM model named \sys{The Attentive Reader} --- and demonstrated that it outperformed symbolic NLP approaches by a large margin. In their experiments, the \sys{Attentive Reader} achieved 63.8\% accuracy while symbolic NLP systems obtained 50.9\% at most on the \sys{CNN} dataset.  The idea of the data creation is as follows: CNN and Daily Mail are accompanied by a number of bullet points, summarizing aspects of the information contained in the article. They take a news article as the passage and convert one of its bullet points as a cloze style question by replacing one entity at at time with a placeholder, and the answer is this replaced entity. In order to ensure that systems approaching this task need to genuinely understand the passage, rather than using world knowledge or a language model to answer questions, they run entity recognition and coreference resolution systems and replace all the entity mentions in each coreference chain by an abstract entity marker e.g., \ti{@entity6} (see an example in Table~\ref{tab:rc-examples} (a)). As a result, nearly 1 million data examples were collected at almost no cost.

Taking a step further, our work \cite{chen2016thorough} investigated this first-ever large reading comprehension dataset and demonstrated that a simple, carefully designed neural network model (Section~\ref{sec:sar}) is able to push the performance to 72.4\% on the \sys{CNN} dataset, another 8.6\% absolute improvement. More importantly, we justified that the neural network models are better at recognizing lexical matches and paraphrases compared to conventional feature-based classifiers. However,  although this semi-synthetic dataset provides a promising avenue for training effective statistical models, we concluded that the dataset appears to be noisy due to its method of data creation and coreference errors and is limited for driving further progress.

To address these limitations, \newcite{rajpurkar2016squad} collected a new dataset named \sys{the Stanford Question Answering Dataset (SQuAD)}. The dataset contains 107,785 question-answer pairs on 536 Wikipedia articles and the questions were posed by crowdworkers, and the answer to every question is a span of text from the corresponding reading passage (Table~\ref{tab:rc-examples} (c)). \sys{SQuAD} was the first large-scale reading comprehension dataset with natural questions. Thanks to its high quality and reliable automatic evaluation, this dataset has spurred tremendous interest in the NLP community and become the central benchmark in this field. It in turn inspired a wide array of new reading comprehension models \cite{wang2017machine,seo2017bidirectional,chen2017reading,wang2017gated,yu2018qanet} and the progress has been rapid --- as of Oct 2018, the best-performing single system achieved an F1 score of 91.8\% \cite{devlin2018bert} which already exceeds the estimated human performance of 91.2\%, while a feature-based classifier built by the original authors in 2016 only obtained an F1 of 51.0\%, as shown in Figure~\ref{fig:squad-progress}.

\begin{figure}[!t]
\center
\includegraphics[scale=0.8]{img/squad_progress.png}
\longcaption{The progress on \sys{SQuAD} 1.1}{\label{fig:squad-progress}The progress on \sys{SQuAD} 1.1 (single model) since the dataset was released in June 2016. The data points are taken from the leaderboard at \href{http://stanford-qa.com/}{http://stanford-qa.com/}.}
\end{figure}

All the current top-performing systems on \sys{SQuAD} are built on \ti{end-to-end neural networks}, or \ti{deep learning} models. These models usually start off from the idea of representing every single word in the passage and question as a dense vector (e.g., 300 dimensions), passing through several modeling or interaction layers, and finally making predictions. All the parameters can be optimized jointly using the gradient descent algorithm or its variants. This class of models can be referred to as \ti{neural reading comprehension} and we will describe it in detail in Chapter~\ref{chapter:rc-models}. Differing from feature-based classifiers, neural reading comprehension models have several great advantages:
\begin{itemize}
    \item
        They don't rely on any downstream linguistic features (e.g., dependency parsing or coreference resolution) and all the features are learned on their own in one unified end-to-end framework. This can avoid noise in linguistic annotations while also providing great flexibility in the space of useful features.
    \item
        Conventional symbolic NLP systems suffer from one severe problem: features are usually very sparse and generalize poorly. For example, to answer a question ``\ti{How many individual libraries \tf{make up} the main school library?}'' from a passage ``\ldots\quad\quad\ti{Harvard Library, which is the world's largest academic and private library system, \tf{comprising} 79 individual libraries with over 18 million volumes.}'', a system has to learn the correspondence between \ti{comprising} and \ti{make up} based on indicator features such as:
        $$\text{pw}_i = \text{comprising} \wedge \text{qw}_{j} = \text{make} \wedge \text{qw}_{j + 1} = \text{up}.$$
        There is insufficient data to correctly weight most such features. It is a common problem in all non-neural NLP models. Making use of low-dimensional, dense word embeddings can effectively alleviate sparsity by sharing statistical strength between similar words.
    \item
        They are relieved from the labor of constructing a large set of manual features. Therefore, neural models are conceptually simpler and the focus can move to the design of neural architectures instead. Thanks to the development of modern deep learning frameworks such as \sys{Tensorflow} and \sys{PyTorch}, great progress has been made, and now it is fast and easy to develop new models.
\end{itemize}

% \red{TODO: add ``power of end-to-end optimization for a final goal''}
% \red{TODO: add ``more effective utilization of context in interpretation for WSD etc. This is what LSTMs give you!''}

There is no doubt that achieving human-performance on \sys{SQuAD} is incredible and arguably one of the biggest results we have seen in the NLP community in the past few years. Nevertheless, solving the \sys{SQuAD} task isn't equivalent to solving machine reading comprehension. We need to acknowledge that SQuAD is restricted in that questions must be answered using a single span in the passage and most SQuAD examples are fairly simple and don't really need complex reasoning.

The field has been further evolving. Following the theme of creating large-scale and more challenging reading comprehension datasets, a multitude of datasets have been collected recently: \sys{TriviaQA} \cite{joshi2017triviaqa}, \sys{RACE} \cite{lai2017race}, \sys{QAngaroo} \cite{welbl2018constructing}, \sys{NarrativeQA} \cite{kovcisky2018narrativeqa}, \sys{MultiRC} \cite{khashabi2018looking}, SQuAD 2.0~\cite{rajpurkar2018know}, \sys{HotpotQA}~\cite{yang2018hotpotqa} and many others. These datasets were collected from a variety of sources (Wikipedia, newswire articles, fictional stories or other Web resources) and constructed in very different ways and they aim to tackle many challenges that haven't been addressed before --- questions which are curated independent of the passages, questions which require multiple sentences or even multiple documents to answer, questions based on long documents like a full book, or questions which are not answerable from the passage. At the time of this writing, most of these datasets have not been solved yet and there remains a large gap between state-of-the-art methods and human performance levels. Reading comprehension has become one of the most active fields in NLP today and there are still many open questions to solve. We will discuss the recent development of reading comprehension datasets in more detail in Section~\ref{sec:future-datasets}.


================================================
FILE: chapters/rc_overview/intro.tex
================================================
%!TEX root = ../../thesis.tex

\epigraph{When a person understands a story, he can demonstrate his understanding by answering questions about the story. Since questions can be devised to query any aspect of text comprehension, the ability to answer questions is the strongest possible demonstration of understanding. If a computer is said to understand a story, we must demand of the computer the same demonstration of understanding that we require of people. Until such demands are met, we have no way of evaluating text understanding programs.}{Wendy Lehnert, 1977}


In this chapter, we aim to provide readers with an overview of reading comprehension. We begin with the history of reading comprehension (Section~\ref{sec:rc-history}), from the early systems developed in the 1970s, to the attempts to build machine learning models for this task, to the more recent resurgence of neural (deep learning) approaches. This field has been completely reshaped by neural reading comprehension, and the progress is very exciting.

We then formally define the reading comprehension task as a supervised learning problem in Section~\ref{sec:task-definition} and describe four different categories based on the answer type. We end by discussing their evaluation metrics.

Next we discuss briefly how reading comprehension differs from question answering, especially in their final goals (Section~\ref{sec:rc-qa-diff}). Finally, we discuss that how the interplay of large-scale datasets and neural models contributes to the development of modern reading comprehension in Section~\ref{sec:rc-drive}.

% This chapter is going to cover the following topics:
% \begin{itemize}
% \item
%   Task definition: formalize 4 different types of RC tasks and their evaluation metrics.
% \item
%   Disucss how RC and QA are different
% \item
%   Recap the history of reading comprehension: early systems, machine learning models and the deep learning era.
% \item
%   Finally,
% \end{itemize}


================================================
FILE: chapters/rc_overview/task.tex
================================================
%!TEX root = ../../thesis.tex

\section{Task Definition}
\label{sec:task-definition}

\subsection{Problem Formulation}

The task of reading comprehension can be formulated as a supervised learning problem: given a collection of training examples $\{({p}_i, {q}_i, {a}_i)\}_{i=1}^{n}$, the goal is to learn a predictor $f$ which takes a passage of text ${p}$ and a corresponding question ${q}$ as inputs and gives the answer ${a}$ as output.
\begin{equation}
  f: ({p}, {q}) \longrightarrow {a}
\end{equation}

Let ${p} = (p_1, p_2, \ldots, p_{l_p})$ and ${q} = (q_1, q_2, \ldots, q_{l_q})$\footnote{A preprocessing step of tokenization is usually required on most current reading comprehension datasets.} where $l_p$ and $l_q$ denote the length of the passage and the question, $p_i \in \mathcal{V}$ for $i = 1, \ldots, l_p$ and $q_i \in \mathcal{V}$ for $i = 1, \ldots, l_q$ where $\mathcal{V}$ is a pre-defined vocabulary. Here we only consider the passage ${p}$ as a short paragraph represented as a sequence of $l_p$ words. It is straightforward to extend it to a multi-paragraph setting \cite{clark2018simple} where ${p}$ is a set of paragraphs or decompose it into smaller linguistic units such as sentences.\footnote{There have been some efforts (e.g., \cite{xie2017constituent}) which model the paragraph as a sequence of sentences, but there is no clear evidence that it outperforms methods that treat the whole paragraph as a long sequence at this point.}

Depending on the answer type, the answer ${a}$ can take very different forms. Generally, we can divide existing reading comprehension tasks into four categories:

\begin{description}
\item[Cloze style.] The question contains a placeholder. For instance,
\begin{displayquote}
Tottenham manager Juande Ramos has hinted he will allow \underline{\hspace{1cm}} to leave if the Bulgaria striker makes it clear he is unhappy.
\end{displayquote}
In these tasks, the systems must guess which word or entity completes the sentence (question), based on the passage, and the answer ${a}$ is either chosen from a pre-defined set of choices $\mathcal{A}$ or the full vocabulary $\mathcal{V}$. For example, in the \sys{Who-did-What} dataset \cite{onishi2016did}, ${a}$ must be one of the person named entities in the passage and $|\mathcal{A}| = 3.5$ on average.

\item[Multiple choice.] In this category, the correct answer is chosen from $k$ hypothesized answers (e.g., $k = 4$):
$$\mathcal{A} = \{{a}_1, \ldots, {a}_k\}  \text{ where } {a}_{k} = (a_{k, 1}, a_{k, 2} \ldots, a_{k, l_{a, k}}), a_{k, i} \in \mathcal{V},$$
can be a word, a phrase or a sentence. One of the hypothesized answers is correct and thus ${a}$ must be chosen from $\{{a}_1, \ldots, {a}_k\}$.


\item[Span prediction.] This category is also referred to as \ti{extractive question answering} and the answer ${a}$ must be a single span in the passage. Therefore, ${a}$ can be represented as $(a_{start}, a_{end})$ where $1 \leq a_{start} \leq a_{end} \leq l_p$, and the answer string corresponds to $p_{a_{start}}, \ldots, p_{a_{end}}.$

\item[Free-form answer.] The last category allows the answer to be any free-text form (i.e., a word sequence of arbitrary length), formally, ${a} \in \mathcal{V}^*$.
\end{description}

Table~\ref{tab:rc-examples} gives an example in each of the categories from four representative datasets: \sys{CNN/Daily Mail}~\cite{hermann2015teaching} (cloze style), \sys{MCTest}~\cite{richardson2013mctest} (multiple choice), \sys{SQuAD}~\cite{rajpurkar2016squad} (span prediction) and \sys{NarrativeQA}~\cite{kovcisky2018narrativeqa} (free-form answer).


\subsection{Evaluation}
\label{sec:evaluation}
We have formally defined the four different categories of reading comprehension tasks, next we discuss their evaluation metrics.

For \tf{multiple choice} or \tf{cloze style} tasks, it is quite straightforward to measure the accuracy: the percentage of questions for which systems give the exactly correct answer, as the answer is chosen from a small set of hypothesized answers.

For \tf{span prediction} tasks, we need to compare the predicted answer string to the ground truth. Typically, we use the two evaluation metrics proposed in \newcite{rajpurkar2016squad}, which measure both exact match and partial scores:

\begin{itemize}
    \item
        \tf{Exact match (EM)} assigns a full credit $1.0$ if the predicted answer is equal to the gold answer and $0.0$ otherwise.
    \item
        \tf{F1 score} computes the average word overlap between predicted and gold answers. The prediction and the gold answer are treated as a bag of tokens and a token-level F1 score is calculated as: $$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}. $$
\end{itemize}


Following \newcite{rajpurkar2016squad}, all punctuation is ignored in the evaluation and for English, articles  \ti{a}, \ti{an}, and \ti{the} are also ignored.

To make the evaluation more reliable, it is also common to collect multiple gold answers to each question. Therefore, the exact match score is only required to match any of the gold answers while the F1 score is to compute the maximum over all of the gold answers and then averaged over all of the questions.

Lastly, for the \tf{free-form answer} reading comprehension tasks, there isn't a consensus yet on what is the most ideal evaluation metric. A common way is to use standard evaluation metrics in natural language generation (NLG) tasks such as machine translation or summarization, including BLEU \cite{papineni2002bleu}, Meteor \cite{banerjee2005meteor} and ROUGE \cite{lin2004rouge}.


================================================
FILE: conclude.tex
================================================
%!TEX root = thesis.tex


In this dissertation, we gave readers a thorough overview of neural reading comprehension: the foundations (\sys{Part I}) and the applications (\sys{Part II}), as well as how we contributed to the development of this field since it emerged in late 2015.

In Chapter~\ref{chapter:rc-overview}, we walked through the history of reading comprehension, which dates back to the 1970s. At the time, researchers already recognized its importance as a proper way of testing the language understanding abilities of computer programs. However, it was not until the 2010s that, reading comprehension started to be formulated as a supervised learning problem by collecting human-labeled training examples in the form of (passage, question, answer) triples. Since 2015, the field has been completed reshaped, by the creation of large-scale supervised datasets, and the development of neural reading comprehension models.  Although it has been only 3 years so far, the field has been moving strikingly fast. Innovations in building better datasets and more effective models have occurred alternately, and both contributed to the development of the field. We also formally defined the task of reading comprehension, and described the four most common types of problems: \ti{cloze style}, \ti{multiple choice}, \ti{span prediction} and \ti{free-form answers} and their evaluation metrics.


In Chapter~\ref{chapter:rc-models}, we covered all the elements of modern neural reading comprehension models. We introduced the \sys{Stanford Attentive Reader}, which we first proposed for the \sys{CNN/Daily Mail} cloze style task, and is one of the earliest neural reading comprehension models in this field. Our model has been studied extensively on other cloze style and multiple choice tasks. We later adapted it to the \sys{SQuAD} dataset and achieved what was then state-of-the-art performance.  Compared to conventional feature-based models, this model doesn't rely on any downstream linguistic features and all the parameters are jointly optimized together. Through empirical experiments and a careful hand-analysis, we concluded that neural models are more powerful at recognizing lexical matches and paraphrases. We also discussed recent advances in developing neural reading comprehension models, including better \ti{word representations}, \ti{attention mechanisms}, \ti{alternatives to LSTMs}, and other advances such as training objectives and data augmentation.

In Chapter~\ref{chapter:rc-future}, we discussed future work and open questions in this field. We examined error cases on \sys{SQuAD} (for both our model and the state-of-the-art model which surpasses the human performance). We concluded that these models have been doing very sophisticated matching of text but they still have difficulty understanding the inherent structure between entities and the events expressed in the text. We later discussed future work in both models and datasets. For models, we argued that besides \ti{accuracy}, there are other important aspects which have been overlooked that we will need to work on in the future, including \ti{speed and scalability}, \ti{robustness}, and \ti{interpretability}. We also believe that future models will need more structures and modules to solve more difficult reading comprehension problems. For datasets, we discussed more recent datasets developed after \sys{SQuAD} --- these datasets either require more complex reasoning across sentences or documents, or need to handle longer documents, or need to generate free-form answers instead of extracting a single span, or predict when there is no answer in the passage. Lastly, we examined several questions we think are important to the future of neural reading comprehension.

In \sys{Part II}, the key questions we wanted to answer are: Is reading comprehension only a task of measuring language understanding? If we can build high-performing reading comprehension systems which can answer comprehension questions over a short passage of text, can it enable useful applications?

In Chapter~\ref{chapter:openqa}, we showed that we can combine information retrieval techniques and neural reading comprehension
models to build an open-domain question-answering system: answering general questions over a large encyclopedia or the Web. In particular, we implemented this idea in the \sys{DrQA} project, a large-scale, factoid question answering system over English Wikipedia. We demonstrated the feasibility of doing this by evaluating the system on multiple question answering benchmarks. We also proposed a procedure to automatically create additional distantly-supervised training examples from other question answering resources and demonstrated the effectiveness of this approach. We hope that our work takes the first step in this research direction and this new paradigm of combining information retrieval and neural reading comprehension will eventually lead to a new generation of open-domain question answering systems.

In Chapter~\ref{chapter:coqa}, we addressed the conversational question answering problem, where a computer system needs to understand a text passage and answer a series of questions that appear in a conversation. To approach this, we built \sys{CoQA}: a Conversational Question Answering challenge for measuring the ability of machines to participate in a question-answering style conversation. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. We also built several competitive baselines for this new task, based on conversational and reading comprehension models. We believe that building such systems will play a crucial role in our future conversational AI systems.

All together, we are really excited about the progress that has been made in this field for the past 3 years and have been glad to be able to contribute to this field. At the same time, we also deeply believe there is still a long way to go towards genuine human-level reading comprehension, and we are still facing enormous challenges and a lot of open questions that we will need to address in the future. One key challenge is that we still don't have good ways to approach deeper levels of reading comprehension --- those questions which require understanding the reasoning and implications of the text. Often this occurs with \ti{how} or \ti{why} questions, such as \ti{In the story, why is Cynthia upset with her mother?}, \ti{How does John attempt to make up for his original mistake?} In the future, we will have to address the underlying science of what is being discussed, rather than just answering from text matching, to achieve this level of reading comprehension.

We also hope to encourage more researchers to work on the applications, or apply neural reading comprehension to new domains or tasks. We believe that it will lead us towards building better question answering and conversational agents and hope to see these ideas implemented and developed in industry applications.


================================================
FILE: fitch.sty
================================================
% Macros for Fitch-style natural deduction. 
% Author: Peter Selinger, University of Ottawa
% Created: Jan 14, 2002
% Modified: Feb 8, 2005
% Version: 0.5
% Copyright: (C) 2002-2005 Peter Selinger
% Filename: fitch.sty
% Documentation: fitchdoc.tex
% URL: http://quasar.mathstat.uottawa.ca/~selinger/fitch/

% License:
%
% This program is free software; you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation; either version 2, or (at your option)
% any later version.
%
% This program is distributed in the hope that it will be useful, but
% WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
% General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with this program; if not, write to the Free Software Foundation, 
% Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

% USAGE EXAMPLE:
% 
% The following is a simple example illustrating the usage of this
% package.  For detailed instructions and additional functionality, see
% the user guide, which can be found in the file fitchdoc.tex.
% 
% \[
% \begin{nd}
%   \hypo{1}  {P\vee Q}   
%   \hypo{2}  {\neg Q}                         
%   \open                              
%   \hypo{3a} {P}
%   \have{3b} {P}        \r{3a}
%   \close                   
%   \open
%   \hypo{4a} {Q}
%   \have{4b} {\neg Q}   \r{2}
%   \have{4c} {\bot}     \ne{4a,4b}
%   \have{4d} {P}        \be{4c}
%   \close                             
%   \have{5}  {P}        \oe{1,3a-3b,4a-4d}                 
% \end{nd}
% \]

{\chardef\x=\catcode`\*
\catcode`\*=11
\global\let\nd*astcode\x}
\catcode`\*=11

% References

\newcount\nd*ctr
\def\nd*render{\expandafter\ifx\expandafter\nd*x\nd*base\nd*x\the\nd*ctr\else\nd*base\ifnum\nd*ctr<0\the\nd*ctr\else\ifnum\nd*ctr>0+\the\nd*ctr\fi\fi\fi}
\expandafter\def\csname nd*-\endcsname{}

\def\nd*num#1{\nd*numo{\nd*render}{#1}\global\advance\nd*ctr1}
\def\nd*numopt#1#2{\nd*numo{$#1$}{#2}}
\def\nd*numo#1#2{\edef\x{#1}\mbox{$\x$}\expandafter\global\expandafter\let\csname nd*-#2\endcsname\x}
\def\nd*ref#1{\expandafter\let\expandafter\x\csname nd*-#1\endcsname\ifx\x\relax%
  \errmessage{Undefined natdeduction reference: #1}\else\mbox{$\x$}\fi}
\def\nd*noop{}
\def\nd*set#1#2{\ifx\relax#1\nd*noop\else\global\def\nd*base{#1}\fi\ifx\relax#2\relax\else\global\nd*ctr=#2\fi}
\def\nd*reset{\nd*set{}{1}}
\def\nd*refa#1{\nd*ref{#1}}
\def\nd*aux#1#2{\ifx#2-\nd*refa{#1}--\def\nd*c{\nd*aux{}}%
  \else\ifx#2,\nd*refa{#1}, \def\nd*c{\nd*aux{}}%
  \else\ifx#2;\nd*refa{#1}; \def\nd*c{\nd*aux{}}%
  \else\ifx#2.\nd*refa{#1}. \def\nd*c{\nd*aux{}}%
  \else\ifx#2)\nd*refa{#1})\def\nd*c{\nd*aux{}}%
  \else\ifx#2(\nd*refa{#1}(\def\nd*c{\nd*aux{}}%
  \else\ifx#2\nd*end\nd*refa{#1}\def\nd*c{}%
  \else\def\nd*c{\nd*aux{#1#2}}%
  \fi\fi\fi\fi\fi\fi\fi\nd*c}
\def\ndref#1{\nd*aux{}#1\nd*end}

% Layer A

% define various dimensions (explained in fitchdoc.tex):
\newlength{\nd*dim} 
\newdimen\nd*depthdim
\newdimen\nd*hsep
\newdimen\ndindent
\ndindent=1em
% user command to redefine dimensions
\def\nddim#1#2#3#4#5#6#7#8{\nd*depthdim=#3\relax\nd*hsep=#6\relax%
\def\nd*height{#1}\def\nd*thickness{#8}\def\nd*initheight{#2}%
\def\nd*indent{#5}\def\nd*labelsep{#4}\def\nd*justsep{#7}}
% set initial dimensions
\nddim{4.5ex}{3.5ex}{1.5ex}{1em}{1.6em}{.5em}{2.5em}{.2mm}

\def\nd*v{\rule[-\nd*depthdim]{\nd*thickness}{\nd*height}}
\def\nd*t{\rule[-\nd*depthdim]{0mm}{\nd*height}\rule[-\nd*depthdim]{\nd*thickness}{\nd*initheight}}
\def\nd*i{\hspace{\nd*indent}} 
\def\nd*s{\hspace{\nd*hsep}}
\def\nd*g#1{\nd*f{\makebox[\nd*indent][c]{$#1$}}}
\def\nd*f#1{\raisebox{0pt}[0pt][0pt]{$#1$}}
\def\nd*u#1{\makebox[0pt][l]{\settowidth{\nd*dim}{\nd*f{#1}}%
    \addtolength{\nd*dim}{2\nd*hsep}\hspace{-\nd*hsep}\rule[-\nd*depthdim]{\nd*dim}{\nd*thickness}}\nd*f{#1}}

% Lists

\def\nd*push#1#2{\expandafter\gdef\expandafter#1\expandafter%
  {\expandafter\nd*cons\expandafter{#1}{#2}}}
\def\nd*pop#1{{\def\nd*nil{\gdef#1{\nd*nil}}\def\nd*cons##1##2%
    {\gdef#1{##1}}#1}}
\def\nd*iter#1#2{{\def\nd*nil{}\def\nd*cons##1##2{##1#2{##2}}#1}}
\def\nd*modify#1#2#3{{\def\nd*nil{\gdef#1{\nd*nil}}\def\nd*cons##1##2%
    {\advance#2-1 ##1\advance#2 1 \ifnum#2=1\nd*push#1{#3}\else%
      \nd*push#1{##2}\fi}#1}}

\def\nd*cont#1{{\def\nd*t{\nd*v}\def\nd*v{\nd*v}\def\nd*g##1{\nd*i}%
    \def\nd*i{\nd*i}\def\nd*nil{\gdef#1{\nd*nil}}\def\nd*cons##1##2%
    {##1\expandafter\nd*push\expandafter#1\expandafter{##2}}#1}}

% Layer B

\newcount\nd*n
\def\nd*beginb{\begingroup\nd*reset\gdef\nd*stack{\nd*nil}\nd*push\nd*stack{\nd*t}%
  \begin{array}{l@{\hspace{\nd*labelsep}}l@{\hspace{\nd*justsep}}l}}
\def\nd*resumeb{\begingroup\begin{array}{l@{\hspace{\nd*labelsep}}l@{\hspace{\nd*justsep}}l}}
\def\nd*endb{\end{array}\endgroup}
\def\nd*hypob#1#2{\nd*f{\nd*num{#1}}&\nd*iter\nd*stack\relax\nd*cont\nd*stack\nd*s\nd*u{#2}&}
\def\nd*haveb#1#2{\nd*f{\nd*num{#1}}&\nd*iter\nd*stack\relax\nd*cont\nd*stack\nd*s\nd*f{#2}&}
\def\nd*havecontb#1#2{&\nd*iter\nd*stack\relax\nd*cont\nd*stack\nd*s\nd*f{\hspace{\ndindent}#2}&}
\def\nd*hypocontb#1#2{&\nd*iter\nd*stack\relax\nd*cont\nd*stack\nd*s\nd*u{\hspace{\ndindent}#2}&}

\def\nd*openb{\nd*push\nd*stack{\nd*i}\nd*push\nd*stack{\nd*t}}
\def\nd*closeb{\nd*pop\nd*stack\nd*pop\nd*stack}
\def\nd*guardb#1#2{\nd*n=#1\multiply\nd*n by 2 \nd*modify\nd*stack\nd*n{\nd*g{#2}}}

% Layer C

\def\nd*clr{\gdef\nd*cmd{}\gdef\nd*typ{\relax}}
\def\nd*sto#1#2#3{\gdef\nd*typ{#1}\gdef\nd*byt{}%
  \gdef\nd*cmd{\nd*typ{#2}{#3}\nd*byt\\}}
\def\nd*chtyp{\expandafter\ifx\nd*typ\nd*hypocontb\def\nd*typ{\nd*havecontb}\else\def\nd*typ{\nd*haveb}\fi}
\def\nd*hypoc#1#2{\nd*chtyp\nd*cmd\nd*sto{\nd*hypob}{#1}{#2}}
\def\nd*havec#1#2{\nd*cmd\nd*sto{\nd*haveb}{#1}{#2}}
\def\nd*hypocontc#1{\nd*chtyp\nd*cmd\nd*sto{\nd*hypocontb}{}{#1}}
\def\nd*havecontc#1{\nd*cmd\nd*sto{\nd*havecontb}{}{#1}}
\def\nd*by#1#2{\ifx\nd*x#2\nd*x\gdef\nd*byt{\mbox{#1}}\else\gdef\nd*byt{\mbox{#1, \ndref{#2}}}\fi}

% multi-line macros
\def\nd*mhypoc#1#2{\nd*mhypocA{#1}#2\\\nd*stop\\}
\def\nd*mhypocA#1#2\\{\nd*hypoc{#1}{#2}\nd*mhypocB}
\def\nd*mhypocB#1\\{\ifx\nd*stop#1\else\nd*hypocontc{#1}\expandafter\nd*mhypocB\fi}
\def\nd*mhavec#1#2{\nd*mhavecA{#1}#2\\\nd*stop\\}
\def\nd*mhavecA#1#2\\{\nd*havec{#1}{#2}\nd*mhavecB}
\def\nd*mhavecB#1\\{\ifx\nd*stop#1\else\nd*havecontc{#1}\expandafter\nd*mhavecB\fi}
\def\nd*mhypocontc#1{\nd*mhypocB#1\\\nd*stop\\}
\def\nd*mhavecontc#1{\nd*mhavecB#1\\\nd*stop\\}

\def\nd*beginc{\nd*beginb\nd*clr}
\def\nd*resumec{\nd*resumeb\nd*clr}
\def\nd*endc{\nd*cmd\nd*endb}
\def\nd*openc{\nd*cmd\nd*clr\nd*openb}
\def\nd*closec{\nd*cmd\nd*clr\nd*closeb}
\let\nd*guardc\nd*guardb

% Layer D

% macros with optional arguments spelled-out
\def\nd*hypod[#1][#2]#3[#4]#5{\ifx\relax#4\relax\else\nd*guardb{1}{#4}\fi\nd*mhypoc{#3}{#5}\nd*set{#1}{#2}}
\def\nd*haved[#1][#2]#3[#4]#5{\ifx\relax#4\relax\else\nd*guardb{1}{#4}\fi\nd*mhavec{#3}{#5}\nd*set{#1}{#2}}
\def\nd*havecont#1{\nd*mhavecontc{#1}}
\def\nd*hypocont#1{\nd*mhypocontc{#1}}
\def\nd*base{undefined}
\def\nd*opend[#1]#2{\nd*cmd\nd*clr\nd*openb\nd*guard{#1}#2}
\def\nd*close{\nd*cmd\nd*clr\nd*closeb}
\def\nd*guardd[#1]#2{\nd*guardb{#1}{#2}}

% Handling of optional arguments.

\def\nd*optarg#1#2#3{\ifx[#3\def\nd*c{#2#3}\else\def\nd*c{#2[#1]{#3}}\fi\nd*c}
\def\nd*optargg#1#2#3{\ifx[#3\def\nd*c{#1#3}\else\def\nd*c{#2{#3}}\fi\nd*c}

\def\nd*five#1{\nd*optargg{\nd*four{#1}}{\nd*two{#1}}}
\def\nd*four#1[#2]{\nd*optarg{0}{\nd*three{#1}[#2]}}
\def\nd*three#1[#2][#3]#4{\nd*optarg{}{#1[#2][#3]{#4}}}
\def\nd*two#1{\nd*three{#1}[\relax][]}

\def\nd*have{\nd*five{\nd*haved}}
\def\nd*hypo{\nd*five{\nd*hypod}}
\def\nd*open{\nd*optarg{}{\nd*opend}}
\def\nd*guard{\nd*optarg{1}{\nd*guardd}}

\def\nd*init{%
  \let\open\nd*open%
  \let\close\nd*close%
  \let\hypo\nd*hypo%
  \let\have\nd*have%
  \let\hypocont\nd*hypocont%
  \let\havecont\nd*havecont%
  \let\by\nd*by%
  \let\guard\nd*guard%
  \def\ii{\by{$\Rightarrow$I}}%
  \def\ie{\by{$\Rightarrow$E}}%
  \def\Ai{\by{$\forall$I}}%
  \def\Ae{\by{$\forall$E}}%
  \def\Ei{\by{$\exists$I}}%
  \def\Ee{\by{$\exists$E}}%
  \def\ai{\by{$\wedge$I}}%
  \def\ae{\by{$\wedge$E}}%
  \def\ai{\by{$\wedge$I}}%
  \def\ae{\by{$\wedge$E}}%
  \def\oi{\by{$\vee$I}}%
  \def\oe{\by{$\vee$E}}%
  \def\ni{\by{$\neg$I}}%
  \def\ne{\by{$\neg$E}}%
  \def\be{\by{$\bot$E}}%
  \def\nne{\by{$\neg\neg$E}}%
  \def\r{\by{R}}%
}

\newenvironment{nd}{\begingroup\nd*init\nd*beginc}{\nd*endc\endgroup}
\newenvironment{ndresume}{\begingroup\nd*init\nd*resumec}{\nd*endc\endgroup}

\catcode`\*=\nd*astcode

% End of file fitch.sty


================================================
FILE: img/scripts/gen_cnn_analysis.py
================================================
from pylab import figure, ylabel, xticks, bar, \
                  legend, savefig, text

from pylab import rcParams
rcParams['figure.figsize'] = 10, 5

groups = ["Exact match", "Paraphrasing", "Partial clue", "Multi. sent.", "Coref. errors", "Hard", "All"]

data = [[100, 95.1, 89.5, 50.0, 37.5, 5.9, 74.0],
        [100, 78.1, 73.7, 50.0, 50.0, 11.8, 66.0]]

figure()
ylabel('Accuracy')

x1 = [2.0 + 10.0 * k for k in range(7)]

xticks([x + 0.75 for x in x1], groups)

width = 2.5
bar(x1, data[0], width=width, color="#56B4E9", label="Stanford Attentive Reader", edgecolor='k')
bar([x + width for x in x1], data[1], width=width, color="#E69F00", label="Feature-based Classifier", edgecolor='k')
# bar([x + 1.0 for x in x1], data[2], width=0.45, color="#94c6da", label="WebQuestions", edgecolor='k')
# bar([x + 1.5 for x in x1], data[3], width=0.45, color="#1770ab", label="WikiMovies", edgecolor='k')

for j in range(len(data[0])):
    text(x1[j] - 1.5, data[0][j] + 0.75, str(data[0][j]))

for j in range(len(data[1])):
    text(x1[j] + width - 1.0, data[1][j] + 0.75, str(data[1][j]))

legend()
savefig('barplot.png')


================================================
FILE: img/scripts/gen_qa_stat.py
================================================
from pylab import figure, ylabel, xticks, bar, \
                  legend, savefig, text

groups = ["Question length", "Answer length"]

data = [
    [10.4, 3.2],
    [7.2, 1.8],
    [6.7, 2.4],
    [7.5, 2.1]
    ]

figure()
ylabel('#tokens')

x1 = [2.0, 5.0]

xticks([x + 0.75 for x in x1], groups)

bar(x1, data[0], width=0.45, color="#c30d24", label="SQuAD", edgecolor='k')
bar([x + 0.5 for x in x1], data[1], width=0.45, color="#cccccc", label="TREC", edgecolor='k')
bar([x + 1.0 for x in x1], data[2], width=0.45, color="#94c6da", label="WebQuestions", edgecolor='k')
bar([x + 1.5 for x in x1], data[3], width=0.45, color="#1770ab", label="WikiMovies", edgecolor='k')

for i in range(len(data)):
    for j in range(len(data[i])):
        text(x1[j] + i * 0.5 - 0.1, data[i][j] + 0.1, str(data[i][j]))

legend()
savefig('barplot.png')


================================================
FILE: img/scripts/gen_squad_progress.py
================================================
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
matplotlib.rcParams['legend.handlelength'] = 0
matplotlib.rcParams['legend.numpoints'] = 1


def isfloat(value):
    try:
        float(value)
        return True
    except ValueError:
        return False


mapping = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
           'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}


def get_month(s):
    assert s in mapping
    return mapping[s]


def earlier(date1, date2):
    if date1[0] != date2[0]:
        return date1[0] < date2[0]
    if date1[1] != date2[1]:
        return date1[1] < date2[1]
    return False


records = []
with open('squad_leaderboard.txt') as f:
    for line in f.readlines():
        sp = line.strip().split('\t')
        if len(sp) == 2 and ('2016' in sp[0] or '2017' in sp[0] or '2018' in sp[0]):
            date = sp[0]
            system = sp[1]
        if len(sp) >= 2 and isfloat(sp[-2]) and isfloat(sp[-1]):
            em = float(sp[-2])
            f1 = float(sp[-1])

            if 'ensemble' not in system:
                print('-' * 100)
                year = int(date.split(' ')[-1])
                month = get_month(date.split(' ')[0])
                date = int(date.split(' ')[1][:-1])

                print(year, month)
                print(system)
                print(em, f1)

                if len(records) == 0 or earlier((year, month), (records[-1][0], records[-1][1])):
                    records.append((year, month, date, f1))
print(records)
records.append((2016, 6, 16, 51.0))

x = []
y = []
for rec in records:
    x.append(dt.date(rec[0], rec[1], rec[2]))
    y.append(rec[3])

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.plot([x[-1], x[-2]], [y[-1], y[-2]], '-b', marker='v', label='non-neural', markevery=5)
plt.plot(x[:-1], y[:-1], '-b', marker='x', label='neural')
plt.gca().set_ylim([45.0, 100.0])
plt.gcf().autofmt_xdate()
plt.ylabel('F1')
plt.text(x[-1], 92.0, 'human: 91.2')
plt.legend(loc=4)
plt.axhline(y=91.221, color='r', linestyle='--')
plt.savefig('squad_progress.png')


================================================
FILE: img/scripts/gen_timeline.py
================================================
"""
    Fourth example on:
    https://kristw.github.io/d3kit-timeline/

    Labella.py doesn't implement the full colorscale from d3, just category 10
    and category 20. Below we show an example of implementing specific colors
    based on the input data.
"""

from datetime import date

from labella.timeline import TimelineTex, TimelineSVG


def color(d):
    idx = d['episode']
    if idx == 1:
        return '#000000'
    else:
        return '#1F77B4'


def main():
    items = [
            {'time': date(2016, 11, 1), 'episode': 1,
                'text': 'SQuAD 1.1'},
            {'time': date(2015, 12, 7), 'episode': 1,
                'text': 'CNN/Daily Mail'},
            {'time': date(2016, 5, 2), 'episode': 1,
                'text': 'Children Book Test'},
            {'time': date(2017, 7, 30), 'episode': 1,
                'text': 'TriviaQA'},
            {'time': date(2017, 9, 1), 'episode': 1,
                'text': 'RACE'},
            {'time': date(2018, 5, 1), 'episode': 1,
                'text': 'NarrativeQA'},
            {'time': date(2018, 7, 15), 'episode': 1,
                'text': 'SQuAD 2.0'},
            {'time': date(2018, 11, 2), 'episode': 1,
                'text': 'HotpotQA'},
            {'time': date(2015, 12, 7), 'episode': 2,
                'text': 'Attentive Reader'},
            {'time': date(2016, 8, 7), 'episode': 2,
                'text': 'Stanford Attentive Reader'},
            {'time': date(2017, 4, 24), 'episode': 2,
                'text': 'Match-LSTM'},
            {'time': date(2017, 4, 24), 'episode': 2,
                'text': 'BiDAF'},
            {'time': date(2017, 7, 30), 'episode': 2,
                'text': 'R-Net'},
            {'time': date(2018, 4, 30), 'episode': 2,
                'text': 'QANet'},
            {'time': date(2018, 6, 1), 'episode': 2,
                'text': 'BiDAF+self-att.+ELMo'},
            {'time': date(2018, 10, 11), 'episode': 2,
                'text': 'BERT'},
            ]

    options = {
        'initialWidth': 400,
        'initialHeight': 400,
        'direction': 'right',
        'dotColor': color,
        'labelBgColor': color,
        'linkColor': color,
        'textFn': lambda d: d['text'],
        'margin': {'left': 0, 'right': 0, 'top': 0, 'bottom': 0},
        'layerGap': 20,
        'labella': {
            'maxPos': 1200,
            'algorithm': 'simple'
            }
        }

    tl = TimelineSVG(items, options=options)
    tl.export('timeline.svg')

    tl = TimelineTex(items, options=options)
    tl.export('timeline.tex')


if __name__ == '__main__':
    main()


================================================
FILE: img/scripts/squad_leaderboard.txt
================================================
1
Oct 05, 2018	BERT (ensemble)
Google AI Language
https://arxiv.org/abs/1810.04805	87.433	93.160
2
Oct 05, 2018	BERT (single model)
Google AI Language
https://arxiv.org/abs/1810.04805	85.083	91.835
2
Sep 09, 2018	nlnet (ensemble)
Microsoft Research Asia
85.356	91.202
2
Sep 26, 2018	nlnet (ensemble)
Microsoft Research Asia
85.954	91.677
3
Jul 11, 2018	QANet (ensemble)
Google Brain & CMU
84.454	90.490
4
Jul 08, 2018	r-net (ensemble)
Microsoft Research Asia
84.003	90.147
5
Mar 19, 2018	QANet (ensemble)
Google Brain & CMU
83.877	89.737
5
Sep 09, 2018	nlnet (single model)
Microsoft Research Asia
83.468	90.133
5
Jun 20, 2018	MARS (ensemble)
YUANFUDAO research NLP
83.982	89.796
6
Sep 01, 2018	MARS (single model)
YUANFUDAO research NLP
83.185	89.547
7
Jan 03, 2018	r-net+ (ensemble)
Microsoft Research Asia
82.650	88.493
7
May 09, 2018	MARS (single model)
YUANFUDAO research NLP
82.587	88.880
7
Feb 19, 2018	Reinforced Mnemonic Reader + A2D (ensemble model)
Microsoft Research Asia & NUDT
82.849	88.764
7
Jan 22, 2018	Hybrid AoA Reader (ensemble)
Joint Laboratory of HIT and iFLYTEK Research
82.482	89.281
7
Jun 20, 2018	QANet (single)
Google Brain & CMU
82.471	89.306
7
Mar 06, 2018	QANet (ensemble)
Google Brain & CMU
82.744	89.045
7
Jun 21, 2018	MARS (single model)
YUANFUDAO research NLP
83.122	89.224
8
Jan 05, 2018	SLQA+ (ensemble)
Alibaba iDST NLP
82.440	88.607
9
Feb 02, 2018	Reinforced Mnemonic Reader (ensemble model)
NUDT and Fudan University
https://arxiv.org/abs/1705.02798	82.283	88.533
9
Feb 27, 2018	QANet (single model)
Google Brain & CMU
82.209	88.608
10
Dec 22, 2017	AttentionReader+ (ensemble)
Tencent DPDAC NLP
81.790	88.163
11
May 09, 2018	Reinforced Mnemonic Reader + A2D (single model)
Microsoft Research Asia & NUDT
81.538	88.130
11
Dec 17, 2017	r-net (ensemble)
Microsoft Research Asia
http://aka.ms/rnet	82.136	88.126
12
May 09, 2018	Reinforced Mnemonic Reader + A2D + DA (single model)
Microsoft Research Asia & NUDT
81.401	88.122
12
Apr 23, 2018	r-net (single model)
Microsoft Research Asia
81.391	88.170
12
Apr 03, 2018	KACTEIL-MRC(GF-Net+) (ensemble)
Kangwon National University, Natural Language Processing Lab.
81.496	87.557
12
Feb 27, 2018	QANet (single model)
Google Brain & CMU
80.929	87.773
12
Nov 17, 2017	BiDAF + Self Attention + ELMo (ensemble)
Allen Institute for Artificial Intelligence
81.003	87.432
12
Feb 19, 2018	Reinforced Mnemonic Reader + A2D (single model)
Microsoft Research Asia & NUDT
80.919	87.492
13
Feb 12, 2018	Reinforced Mnemonic Reader + A2D (single model)
Microsoft Research Asia & NUDT
80.489	87.454
13
Apr 12, 2018	AVIQA+ (ensemble)
aviqa team
80.615	87.311
14
Mar 20, 2018	DNET (ensemble)
QA geeks
80.164	86.721
14
Jan 22, 2018	Hybrid AoA Reader (single model)
Joint Laboratory of HIT and iFLYTEK Research
80.027	87.288
14
Jan 12, 2018	EAZI+ (ensemble)
Yiwise NLP Group
80.426	86.912
14
Jan 13, 2018	SLQA+
single model
80.436	87.021
14
Jan 04, 2018	{EAZI} (ensemble)
Yiwise NLP Group
80.436	86.912
15
Feb 12, 2018	BiDAF + Self Attention + ELMo + A2D (single model)
Microsoft Research Asia & NUDT
79.996	86.711
16
Jan 29, 2018	Reinforced Mnemonic Reader (single model)
NUDT and Fudan University
https://arxiv.org/abs/1705.02798	79.545	86.654
16
Apr 10, 2018	Unnamed submission by null
80.027	86.612
16
Feb 23, 2018	MAMCN+ (single model)
Samsung Research
79.692	86.727
16
Jan 03, 2018	r-net+ (single model)
Microsoft Research Asia
79.901	86.536
16
Dec 28, 2017	SLQA+ (single model)
Alibaba iDST NLP
79.199	86.590
16
Dec 05, 2017	SAN (ensemble model)
Microsoft Business AI Solutions Team
https://arxiv.org/abs/1712.03556	79.608	86.496
17
Oct 17, 2017	Interactive AoA Reader+ (ensemble)
Joint Laboratory of HIT and iFLYTEK
79.083	86.450
18
Oct 24, 2017	FusionNet (ensemble)
Microsoft Business AI Solutions Team
https://arxiv.org/abs/1711.07341	78.978	86.016
18
Jun 01, 2018	MDReader
single model
79.031	86.006
18
Feb 01, 2018	Unnamed submission by null
78.999	86.151
19
Oct 24, 2018	WDNet (single model)
Beijing Normal University
78.926	85.810
19
Oct 22, 2017	DCN+ (ensemble)
Salesforce Research
https://arxiv.org/abs/1711.00106	78.852	85.996
20
Mar 29, 2018	KACTEIL-MRC(GF-Net+) (single model)
Kangwon National University, Natural Language Processing Lab.
78.664	85.780
20
Nov 03, 2017	BiDAF + Self Attention + ELMo (single model)
Allen Institute for Artificial Intelligence
78.580	85.833
21
May 09, 2018	KakaoNet (single model)
Kakao NLP Team
78.401	85.724
22
Nov 30, 2017	SLQA(ensemble)
Alibaba iDST NLP
78.328	85.682
22
Jan 02, 2018	Conductor-net (ensemble)
CMU
https://arxiv.org/abs/1710.10504	78.433	85.517
22
Jun 01, 2018	MDReader0
single model
78.171	85.543
22
Sep 18, 2018	BiDAF++ with pair2vec (single model)
UW and FAIR
78.223	85.535
22
Jan 03, 2018	MEMEN (single model)
Zhejiang University
https://arxiv.org/abs/1707.09098	78.234	85.344
22
Mar 19, 2018	aviqa (ensemble)
aviqa team
78.496	85.469
23
Jan 29, 2018	test
single
78.087	85.348
24
Jul 25, 2017	Interactive AoA Reader (ensemble)
Joint Laboratory of HIT and iFLYTEK Research
77.845	85.297
25
Jan 10, 2018	Unnamed submission by null
77.436	85.130
26
Dec 06, 2017	AttentionReader+ (single)
Tencent DPDAC NLP
77.342	84.925
26
Sep 18, 2018	BiDAF++ (single model)
UW and FAIR
77.573	84.858
26
Dec 13, 2017	RaSoR + TR + LM (single model)
Tel-Aviv University
https://arxiv.org/abs/1712.03609	77.583	84.163
26
Apr 10, 2018	Unnamed submission by null
77.489	84.735
26
Mar 20, 2018	DNET (single model)
QA geeks
77.646	84.905
27
Nov 06, 2017	Conductor-net (ensemble)
CMU
https://arxiv.org/abs/1710.10504	76.996	84.630
27
Sep 26, 2018	{gqa} (single model)
FAIR
77.090	83.931
27
Dec 21, 2017	Jenga (ensemble)
Facebook AI Research
77.237	84.466
27
Jan 23, 2018	MARS (single model)
YUANFUDAO research NLP
76.859	84.739
28
Nov 01, 2017	SAN (single model)
Microsoft Business AI Solutions Team
https://arxiv.org/abs/1712.03556	76.828	84.396
29
Oct 13, 2017	r-net (single model)
Microsoft Research Asia
http://aka.ms/rnet	76.461	84.265
29
Dec 19, 2017	FRC (single model)
in review
76.240	84.599
29
May 14, 2018	VS^3-NET (single model)
Kangwon National University in South Korea
76.775	84.491
30
Oct 22, 2017	Conductor-net (ensemble)
CMU
76.146	83.991
31
Sep 08, 2017	FusionNet (single model)
Microsoft Business AI Solutions team
https://arxiv.org/abs/1711.07341	75.968	83.900
31
Oct 18, 2018	KAR (single model)
York University
https://arxiv.org/abs/1809.03449	76.125	83.538
32
Jul 14, 2017	smarnet (ensemble)
Eigen Technology & Zhejiang University
75.989	83.475
32
Oct 22, 2017	Interactive AoA Reader+ (single model)
Joint Laboratory of HIT and iFLYTEK
75.821	83.843
32
Mar 15, 2018	AVIQA-v2 (single model)
aviqa team
75.926	83.305
33
Oct 05, 2018	Unnamed submission by null
74.950	83.294
33
Aug 18, 2017	RaSoR + TR (single model)
Tel-Aviv University
https://arxiv.org/abs/1712.03609	75.789	83.261
34
Oct 23, 2017	DCN+ (single model)
Salesforce Research
https://arxiv.org/abs/1711.00106	75.087	83.081
35
Feb 13, 2018	SSR-BiDAF
ensemble model
74.541	82.477
35
Nov 01, 2017	Mixed model (ensemble)
Sean
75.265	82.769
36
Jan 02, 2018	Conductor-net (single model)
CMU
https://arxiv.org/abs/1710.10504	74.405	82.742
36
Nov 17, 2017	two-attention-self-attention (ensemble)
guotong1988
75.223	82.716
36
May 21, 2017	MEMEN (ensemble)
Eigen Technology & Zhejiang University
https://arxiv.org/abs/1707.09098	75.370	82.658
37
Mar 09, 2017	ReasoNet (ensemble)
MSR Redmond
https://arxiv.org/abs/1609.05284	75.034	82.552
38
Aug 14, 2018	eeAttNet (single model)
BBD NLP Team
https://www.bbdservice.com	74.604	82.501
38
Jul 10, 2017	DCN+ (single model)
Salesforce Research
https://arxiv.org/abs/1711.00106	74.866	82.806
38
Feb 06, 2018	Jenga (single model)
Facebook AI Research
74.373	82.845
38
Oct 27, 2017	Unnamed submission by null
74.489	82.312
38
Oct 31, 2017	SLQA (single model)
Alibaba iDST NLP
74.489	82.815
39
Jul 14, 2017	Mnemonic Reader (ensemble)
NUDT and Fudan University
https://arxiv.org/abs/1705.02798	74.268	82.371
40
Dec 23, 2017	S^3-Net (ensemble)
Kangwon National University in South Korea
74.121	82.342
41
Jul 25, 2017	Interactive AoA Reader (single model)
Joint Laboratory of HIT and iFLYTEK Research
73.639	81.931
41
Jul 29, 2017	SEDT (ensemble model)
CMU
https://arxiv.org/abs/1703.00572	74.090	81.761
42
Dec 14, 2017	Jenga (single model)
Facebook AI Research
73.303	81.754
42
Nov 06, 2017	Conductor-net (single)
CMU
https://arxiv.org/abs/1710.10504	73.240	81.933
42
Apr 22, 2017	SEDT+BiDAF (ensemble)
CMU
https://arxiv.org/abs/1703.00572	73.723	81.530
42
Feb 22, 2017	BiDAF (ensemble)
Allen Institute for AI & University of Washington
https://arxiv.org/abs/1611.01603	73.744	81.525
42
Jan 24, 2017	Multi-Perspective Matching (ensemble)
IBM Research
https://arxiv.org/abs/1612.04211	73.765	81.257
42
Jul 06, 2017	SSAE (ensemble)
Tsinghua University
74.080	81.665
43
May 01, 2017	jNet (ensemble)
USTC & National Research Council Canada & York University
https://arxiv.org/abs/1703.04617	73.010	81.517
44
Oct 22, 2017	Conductor-net (single)
CMU
72.590	81.415
44
Apr 17, 2018	Unnamed submission by null
72.831	80.622
44
Nov 16, 2017	two-attention-self-attention (single model)
guotong1988
72.600	81.011
44
Apr 12, 2017	T-gating (ensemble)
Peking University
72.758	81.001
44
Sep 20, 2017	BiDAF + Self Attention (single model)
Allen Institute for Artificial Intelligence
https://arxiv.org/abs/1710.10723	72.139	81.048
45
Dec 15, 2017	S^3-Net (single model)
Kangwon National University in South Korea
71.908	81.023
45
Apr 17, 2018	Unnamed submission by null
72.831	80.622
46
Mar 03, 2018	AVIQA (single model)
aviqa team
72.485	80.550
47
Nov 06, 2017	attention+self-attention (single model)
guotong1988
71.698	80.462
48
Nov 01, 2016	Dynamic Coattention Networks (ensemble)
Salesforce Research
https://arxiv.org/abs/1611.01604	71.625	80.383
49
Jul 14, 2017	smarnet (single model)
Eigen Technology & Zhejiang University
https://arxiv.org/abs/1710.02772	71.415	80.160
49
Apr 13, 2017	QFASE
NUS
71.898	79.989
50
Oct 27, 2017	M-NET (single)
UFL
71.016	79.835
50
Apr 22, 2018	MAMCN (single model)
Samsung Research
70.985	79.939
50
Jul 14, 2017	Mnemonic Reader (single model)
NUDT and Fudan University
https://arxiv.org/abs/1705.02798	70.995	80.146
50
May 23, 2018	AttReader (single)
College of Computer & Information Science, SouthWest University, Chongqing, China
71.373	79.725
50
Mar 24, 2017	jNet (single model)
USTC & National Research Council Canada & York University
https://arxiv.org/abs/1703.04617	70.607	79.821
50
Apr 02, 2017	Ruminating Reader (single model)
New York University
https://arxiv.org/abs/1704.07415	70.639	79.456
50
Mar 14, 2017	Document Reader (single model)
Facebook AI Research
https://arxiv.org/abs/1704.00051	70.733	79.353
50
Dec 28, 2016	FastQAExt
German Research Center for Artificial Intelligence
https://arxiv.org/abs/1703.04816	70.849	78.857
50
May 13, 2017	RaSoR (single model)
Google NY, Tel-Aviv University
https://arxiv.org/abs/1611.01436	70.849	78.741
50
Mar 08, 2017	ReasoNet (single model)
MSR Redmond
https://arxiv.org/abs/1609.05284	70.555	79.364
51
Apr 14, 2017	Multi-Perspective Matching (single model)
IBM Research
https://arxiv.org/abs/1612.04211	70.387	78.784
52
Aug 30, 2017	SimpleBaseline (single model)
Technical University of Vienna
69.600	78.236
52
Feb 05, 2018	SSR-BiDAF
single model
69.443	78.358
53
Apr 12, 2017	SEDT+BiDAF (single model)
CMU
https://arxiv.org/abs/1703.00572	68.478	77.971
54
Jun 25, 2017	PQMN (single model)
KAIST & AIBrain & Crosscert
68.331	77.783
55
Apr 12, 2017	T-gating (single model)
Peking University
68.132	77.569
56
Nov 28, 2016	BiDAF (single model)
Allen Institute for AI & University of Washington
https://arxiv.org/abs/1611.01603	67.974	77.323
56
Feb 22, 2018	Unnamed submission by null
68.478	77.220
57
Feb 22, 2018	Unnamed submission by null
68.425	77.077
57
Dec 28, 2016	FastQA
German Research Center for Artificial Intelligence
https://arxiv.org/abs/1703.04816	68.436	77.070
57
Jul 29, 2017	SEDT (single model)
CMU
https://arxiv.org/abs/1703.00572	68.163	77.527
58
Oct 26, 2016	Match-LSTM with Ans-Ptr (Boundary) (ensemble)
Singapore Management University
https://arxiv.org/abs/1608.07905	67.901	77.022
58
Jan 22, 2018	FABIR (Single Model)
in review
67.744	77.605
59
Sep 19, 2017	AllenNLP BiDAF (single model)
Allen Institute for AI
http://allennlp.org/	67.618	77.151
60
Feb 05, 2017	Iterative Co-attention Network
Fudan University
67.502	76.786
61
Jan 03, 2018	newtest
single model
66.527	75.787
61
Nov 01, 2016	Dynamic Coattention Networks (single model)
Salesforce Research
https://arxiv.org/abs/1611.01604	66.233	75.896
62
Feb 24, 2018	Unnamed submission by null
65.992	75.469
63
Jan 10, 2018	Unnamed submission by null
64.796	74.272
64
Dec 09, 2017	Unnamed submission by ravioncodalab
64.439	73.921
64
Oct 26, 2016	Match-LSTM with Bi-Ans-Ptr (Boundary)
Singapore Management University
https://arxiv.org/abs/1608.07905	64.744	73.743
65
Feb 19, 2017	Attentive CNN context with LSTM
NLPR, CASIA
63.306	73.463
66
Nov 02, 2016	Fine-Grained Gating
Carnegie Mellon University
https://arxiv.org/abs/1611.01724	62.446	73.327
66
Sep 21, 2017	OTF dict+spelling (single)
University of Montreal
https://arxiv.org/abs/1706.00286	64.083	73.056
67
Sep 21, 2017	OTF spelling (single)
University of Montreal
https://arxiv.org/abs/1706.00286	62.897	72.016
68
Sep 21, 2017	OTF spelling+lemma (single)
University of Montreal
https://arxiv.org/abs/1706.00286	62.604	71.968
69
Sep 28, 2016	Dynamic Chunk Reader
IBM
https://arxiv.org/abs/1610.09996	62.499	70.956
70
Aug 27, 2016	Match-LSTM with Ans-Ptr (Boundary)
Singapore Management University
https://arxiv.org/abs/1608.07905	60.474	70.695
71
Sep 11, 2018	Unnamed submission by Will_Wu
59.058	69.436
72
Jan 10, 2018	Unnamed submission by null
58.764	69.276
73
Aug 27, 2016	Match-LSTM with Ans-Ptr (Sentence)
Singapore Management University
https://arxiv.org/abs/1608.07905	54.505	67.748
74
Oct 26, 2018	Unnamed submission by minjoon
52.533	62.757


================================================
FILE: intro.tex
================================================
%!TEX root = thesis.tex

\section{Motivation}

Teaching machines to understand human language documents is one of the most elusive and long-standing challenges in Artificial Intelligence. Before we proceed, we must ask what it means to understand human language? Figure~\ref{fig:mctest-example} demonstrates a children's story from the \sys{MCTest} dataset~\cite{richardson2013mctest}, with simple vocabulary and grammar. To process such a passage of text, the NLP community has put decades of effort into solving different tasks for various aspects of text understanding, including:
\begin{enumerate}[(a)]
    \item
        \tf{part-of-speech tagging}. It requires our machines to understand that, for example, in the first sentence \ti{Alyssa got to the beach after a long trip.}, \ti{Alyssa} is a proper noun, \ti{beach} and \ti{trip} are common nouns, \ti{got} is a verb in its past tense, \ti{long} is an adjective, \ti{after} is a preposition.
    \item
        \tf{named entity recognition}. Our machines also should understand that \ti{Alyssa}, \ti{Ellen}, \ti{Kristen} are the names of people in the story while \ti{Charlotte}, \ti{Atlanta} and \ti{Miami} are the names of locations.
    \item
        \tf{syntactic parsing}. To understand the meaning of each single sentence, our machines also need to understand the relationship between words, or the syntactical (grammatical) structure. Taking the first sentence \ti{Alyssa got to the beach after a long trip.} as an example again, the machines should understand that \ti{Alyssa} is the subject, and \ti{beach} is the object of the verb \ti{got}, while \ti{after a long trip} as a whole is a prepositional phrase which describes a temporal relationship with the verb.
    \item
        \tf{coreference resolution}. Furthermore, our machines even need to understand the interplay between sentences. For example, the mention \ti{She} in the sentence \ti{She's now in Miami} refers to \ti{Alyssa} mentioned in the first sentence, while the mention \ti{The girls} refers to \ti{Alyssa, Ellen, Kristen and Rachel} in the previous sentences.
\end{enumerate}

\begin{figure}[!t]
\center
\begin{tabular}{l p{13cm}}
\toprule
    &{\tf{Alyssa}} got to the beach after a long trip. She's from Charlotte. She traveled from Atlanta. She's now in Miami. She went to Miami to visit some friends. But she wanted some time to herself at the beach, so she went there first. After going swimming and laying out, she went to her friend \tf{Ellen}'s house. \tf{Ellen} greeted {\tf{Alyssa}} and they both had some lemonade to drink. {\tf{Alyssa}} called her friends \tf{Kristen} and \tf{Rachel} to meet at \tf{Ellen}'s house. The girls traded stories and caught up on their lives. It was a happy time for everyone. The girls went to a restaurant for dinner. The restaurant had a special on catfish. \tf{Alyssa} enjoyed the restaurant's special. \tf{Ellen} ordered a salad. \tf{Kristen} had soup. \tf{Rachel} had a steak. After eating, the ladies went back to \tf{Ellen}'s house to have fun. They had lots of fun. They stayed the night because they were tired. {\tf{Alyssa}} was happy to spend time with her friends again. \\
\midrule
  (a) & \tf{Question:} What city is Alyssa in? \\
  &\tf{Answer}: Miami \\
\vspace{0.25em}
  (b) &\tf{Question}: What did Alyssa eat at the restaurant? \\
  & \tf{Answer}: catfish \\
\vspace{0.25em}
  (c) &\tf{Question}: How many friends does Alyssa have in this story? \\
  & \tf{Answer}: 3 \\
\bottomrule
\end{tabular}
\longcaption{A sample story and comprehension questions from \sys{MCTest}}{\label{fig:mctest-example} A sample story and comprehension questions from the \sys{MCTest} dataset  \\ \cite{richardson2013mctest}.}
\end{figure}

Is there a comprehensive evaluation that can test all these aspects and probe even deeper levels of understanding? We argue that the task of \tf{reading comprehension} --- answering comprehension questions over a passage of text --- is a proper and important way to approach that. Just as we use reading comprehension tests to measure how well a human has understood a piece of text, we believe that it can play the same role for evaluating how well computer systems understand human language.

Let's take a closer look at some comprehension questions posed on the same passage (Figure~\ref{fig:mctest-example}):
\begin{enumerate}[(a)]
    \item
        To answer the first question \ti{What city is Alyssa in?}, our machines need to pick out the sentence \ti{She's now in Miami.}, and resolve the \ti{coreference resolution} problem that \ti{She} refers to \ti{Alyssa}, and then finally give the correct answer \ti{Miami}.
    \item
        For the second question \ti{What did Alyssa eat at the restaurant?}, they need to first locate the two sentences \ti{The restaurant had a special on catfish.} and \ti{Alyssa enjoyed the restaurant's special.} and understand the \ti{special} that \ti{Alyssa enjoyed} in the second sentence refers back to the first sentence. Based on the fact that \ti{catfish} modifies \ti{special}, the answer is hence \ti{catfish}.
    \item
        The last question is especially challenging. To arrive at the correct answer, the machines have to keep track of all the names of people mentioned in the text and their relationships, perform some arithmetic reasoning (counting), and finally give the answer \ti{3}.
\end{enumerate}

As we can see, our computer systems have to understand many different aspects of text to answer these questions correctly. Since questions can be designed to query the aspects that we care about, \ti{reading comprehension could be the most suitable task for evaluating language understanding}. This is a central theme of this thesis.

In this thesis, we study the problem of reading comprehension: how can we build computer systems to read a passage and answer these comprehension questions? In particular, we focus on \tf{neural reading comprehension}, a class of reading comprehension models built using deep neural networks, which have been proven much more effective than non-neural, feature-based models.

The field of reading comprehension has a long history --- as early as the 1970s, researchers already recognized that it is an important way to test the language understanding capabilities of computer programs~\cite{lehnert1977process}. However, the field has been neglected for decades and only recently, it has received a great deal of attention and rapid progress has been made (see Figure~\ref{fig:squad-progress} as an example), including our efforts that we will detail in this thesis. The recent success of reading comprehension can be attributed to two reasons: 1) the creation of large-scale supervised datasets in the form of (passage, question, answer) triples; 2) the development of neural reading comprehension models.

In this thesis, we will cover the essence of modern neural reading comprehension: the formulation of the problem, the building blocks and key ingredients of these systems, and understanding of where current neural reading comprehension systems can excel and where they still lag behind.

\begin{figure}[!t]
    \center
    \includegraphics[scale=0.5]{img/google_search.pdf}
    \longcaption{A search result on \sys{Google}}{\label{fig:google-search}A search result on \sys{Google}. It not only returns a list of search documents but gives more precise answers within the documents.}
\end{figure}

The second central theme of the thesis is that we deeply believe that, if we can build high-performing reading comprehension systems, \ti{they would be a crucial technology for applications such as question answering and dialogue systems}. Indeed, these language technologies are already very relevant to our daily lives. For example, today if we enter a search query into \sys{Google} ``How many people work at Stanford University?'' (Figure~\ref{fig:google-search}), \sys{Google} not only returns a list of search documents, but also attempts to read these Web documents and finally highlight the most plausible answers and display them at the top of the search results. We believe this is exactly where reading comprehension can help and thus can facilitate more intelligent search engines. Additionally, with the development of digital personal assistants such as Amazon's \sys{Alexa}, Apple's \sys{Siri}, \sys{Google Assistant} or Microsoft's \sys{Cortana}, more and more users engage with these devices by having conversations and asking informational questions.\footnote{A recent study \href{https://www.stonetemple.com/digital-personal-assistants-study/}{https://www.stonetemple.com/digital-personal-assistants-study/} reported that asking general questions is indeed the number one use for such digital personal assistants.} We believe that building machines which are able to read and comprehend text will also greatly improve the capabilities of these personal assistants.

\begin{figure}[!h]
\small
\center
\begin{tabular}{p{0.85\columnwidth}}
\midrule
Fort Lauderdale, Florida (CNN) -- Just taking a sip of water or walking to the bathroom is excruciatingly painful for 15-year-old Michael Brewer, who was burned over 65 percent of his body after being set on fire, allegedly by a group of teenagers. \\
``It hurts my heart to see him in pain, but it enlightens at the same time to know my son is strong enough to make it through on a daily basis,'' his mother, Valerie Brewer, told CNN on Wednesday. \\
Brewer and her husband, Michael Brewer, Sr., spoke to CNN's Tony Harris, a day after a 13-year-old boy who witnessed last month's attack publicly read a written statement: \\
``I want to express my deepest sympathy to Mikey and his family,'' Jeremy Jarvis said.``I will pray for Mikey to grow stronger every day and for Mikey's speedy recovery.'' \\
Jarvis' older brother has been charged in the October 12 attack in Deerfield Beach, Florida. When asked about the teen's statement, Valerie Brewer -- who knows the Jarvis family -- said she ``can't focus on that.'' \\
``I would really like to stay away from that because that brings negative energy to me and I don't need that right now,'' she said. \\
Her son remains in guarded condition at the University of Miami's Jackson Memorial Hospital Burn Center. He suffered second- and third-degree burns over about two-thirds of his body, according to the hospital's associate director, Dr. Carl Schulman. \\
The teen faces a lifelong recovery from his injuries, Schulman told CNN's Harris.  \\
\vspace{0em}
$Q_1$: What is the subject of the story? \\
$A_1$: Michael Brewer \\
\vspace{0em}
$Q_2$: What happened to him?\\
$A_2$: He was burned \\
\vspace{0em}
$Q_3$: How badly?\\
$A_3$: Over 65\% of his body \\
\vspace{0em}
$Q_4$: Do we know who caused the burns?\\
$A_4$: Yes \\
\bottomrule
\end{tabular}
\longcaption{A conversation from \sys{CoQA} based on an CNN article}{\label{fig:coqa-cnn-example} A conversation from \sys{CoQA} based on an CNN article.}
\end{figure}

Therefore, in this thesis, we are also interested in how we can build practical applications from the recent success of neural reading comprehension. We explore two research directions which employ neural reading comprehension as a key component:
\begin{description}
    \item \tf{Open-domain question answering} combines the challenges from both information retrieval and reading comprehension and aims to answer general questions from either the Web or a large encyclopedia (e.g., Wikipedia).
    \item \tf{Conversational question answering} combines the challenges from dialogue and reading comprehension, and tackles the problem of multi-turn question answering over a passage of text, like how users would engage with conversational agents. Figure~\ref{fig:coqa-cnn-example} demonstrates an example from our \sys{CoQA} dataset~\cite{reddy2019coqa}. In this example, a person can ask a series of interconnected questions based on the content of a \sys{CNN} article.
\end{description}


\section{Thesis Outline}

Following the two central themes that we just discussed, this thesis consists of two parts --- \sys{Part I Neural Reading Comprehension: Foundations} and \sys{Part II Neural Reading Comprehension: Applications}.

\sys{Part I} focuses on the task of reading comprehension, with an emphasis on close reading of a short paragraph so that computer systems are able to answer comprehension questions.
\begin{description}
    \item In Chapter~\ref{chapter:rc-overview}, we first give an overview of the history and recent development of the field of reading comprehension. Next we formally define the problem formulation and its main categories. We then briefly discuss the differences of reading comprehension and general question answering.  Finally, we argue that the recent success of neural reading comprehension is driven by both large-scale datasets and neural models.
    \item In Chapter~\ref{chapter:rc-models}, we present the family of neural reading comprehension models. We begin with describing non-neural, feature-based classifiers, and discuss how they differ from the end-to-end neural approaches. We then introduce a neural approach that we proposed named \sys{the Stanford Attentive Reader} and we describe its basic building blocks and extensions. We present experimental results on two representative reading comprehension datasets: \sys{CNN/Daily Mail} and \sys{SQuAD}, and more importantly, we conduct an in-depth analysis of the neural models to understand better what these models have actually learned. Finally, we summarize recent advances of neural reading comprehension models in different aspects. This chapter is based on our works \cite{chen2016thorough} and \cite{chen2017reading}.
    \item In Chapter~\ref{chapter:rc-future}, we discuss future work and open questions in this field. We first examine error cases of existing models despite their high accuracies on the current benchmarks. We then discuss future directions, in terms of both the datasets and the models. Finally, we review several important research questions in this field, which still remain as open questions and yet to be answered in the future.
\end{description}

\sys{Part II} views reading comprehension as an important building block for practical applications such as question answering systems and conversational agents. Detailedly,
\begin{description}
    \item In Chapter~\ref{chapter:openqa}, we address the problem of open domain question answering as an application of reading comprehension. We discuss how we can combine a high-performing neural reading comprehension system and effective information retrieval techniques, to build a new generation of open-domain question answering systems. We describe a system we built named \sys{DrQA}: its key components and how we create training data for it, and we then present a comprehensive evaluation on multiple question answering benchmarks. We discuss its current limitations and the future work in the end. This chapter is based on our work \cite{chen2017reading}.
    \item In Chapter~\ref{chapter:coqa}, we study the problem of conversational question answering, where a machine has to understand a text passage and answer a series of questions that appear in a conversation. We first briefly review the literature on dialogue and argue that conversational question answering is the key to building information-seeking dialogue agents. We introduce \sys{CoQA}: a novel dataset for building \tf{Co}nversational \tf{Q}uestion \tf{A}nswering systems, comprising 127k questions with answers, obtained from 8k conversations about text passages. We analyze the dataset in depth and build several competitive models on top of conversational and neural reading comprehension models and present the experimental results. We finally discuss the future work in this area. This chapter is based on our work \cite{reddy2019coqa}.
\end{description}
We will finally conclude in Chapter~\ref{chapter:conclusions}.

\section{Contributions}
The contributions of this thesis are summarized as follows:
\begin{itemize}
    \item
        We were among the first to research neural reading comprehension. In particular, we proposed the \sys{Stanford Attentive Reader} model, which has demonstrated superior performance on various modern reading comprehension tasks.
    \item
        We made the effort to understand better what neural reading comprehension models have actually learned, and what depth of language understanding is needed to solve current tasks. We concluded that neural models are better at learning lexical matches and paraphrases compared to conventional feature-based classifiers, while the reasoning capabilities of existing systems are still rather limited.
    \item
        We pioneered the research direction of employing neural reading comprehension as a core component of open domain question answering, and examined how to generalize the model for this case. In particular, we implemented this idea in the \sys{DrQA} system, a large-scale, factoid question answering system over English Wikipedia.
    \item
        Finally, we set out to tackle the conversational question answering problem, in which computer systems need to answer comprehension questions in a dialogue context, so each question needs to be understood with its conversation history. To tackle this, we proposed the \sys{CoQA} challenge and also built neural reading comprehension models adapted to this problem. We believe that this is a first but important step to building conversational QA agents.
\end{itemize}


================================================
FILE: macros.tex
================================================
% (tweaks)
\definecolor{darkred}{rgb}{0.5451, 0.0, 0.0}
\definecolor{darkgreen}{rgb}{0.0, 0.3922, 0.0}

\def\blue#1{\textcolor{blue}{#1}}
\def\darkblue#1{\textcolor{blue}{#1}}
\def\red#1{\textcolor{red}{#1}}
\def\darkred#1{\textcolor{darkred}{#1}}
\def\green#1{\textcolor{green}{#1}}
\def\darkgreen#1{\textcolor{darkgreen}{#1}}
\def\yellow#1{\textcolor{yellow}{#1}}
\definecolor{burntorange}{HTML}{BF5700}
\def\orange#1{\textcolor{burntorange}{#1}}
\def\gray#1{\textcolor{gray}{#1}}
\def\darkgray#1{\textcolor{darkgray}{#1}}

\newcommand\sys[1]{\textsc{#1}}
\newcommand\ti[1]{\textit{#1}}
\newcommand\tf[1]{\textbf{#1}}
\newcommand\mf[1]{\mathbf{#1}}
\newcommand{\indentitem}{\setlength\itemindent{25pt}}
\newcommand{\nth}{$^{\textrm{th}}$}

\newcommand\denote[1]{\ensuremath{\llbracket\ti{#1}\rrbracket}}

\newcommand\forward{\ensuremath{\sqsubseteq}}
\newcommand\nforward{\ensuremath{\not\sqsubseteq}}
\newcommand\reverse{\ensuremath{\sqsupseteq}}
\newcommand\alternate{\ensuremath{\downharpoonleft\hspace{-1.25mm}\upharpoonright}}
\newcommand\cover{\ensuremath{\smallsmile}}
\newcommand\equivalent{\ensuremath{\equiv}}
\newcommand\negate{\ensuremath{\curlywedge}}
\newcommand\independent{\ensuremath{\#}}
\newcommand\tagUp[1]{#1\ensuremath{^\uparrow}}
\newcommand\tagDown[1]{#1\ensuremath{^\downarrow}}
\newcommand\join{\ensuremath{\bowtie}}

\newcommand\h[1]{\textbf{#1}}
\newcommand\hh[1]{\textbf{\textcolor[rgb]{0.5,0,0}{#1}}}
\def\ent#1{\text{\small{\textsc{#1}}}}
\def\typ#1{\textit{#1}}
%\makeatletter
%\newcommand{\xRightarrow}[2][]{\ext@arrow 0359\Rightarrowfill@{#1}{#2}}
%\makeatother

\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
\newcommand{\xmark}{\textrm{\ding{55}}}
%\newcommand{\valid}{\ensuremath{\Rightarrow}}
%\newcommand{\invalid}{\ensuremath{\Rightarrow\lnot}}
\newcommand{\valid}{\ensuremath{\Leftarrow}}
\newcommand{\invalid}{\reflectbox{\ensuremath{\Rightarrow\lnot}}}
\newcommand{\noop}{\textcolor{white}{NOOP}}
\newcommand{\noopTab}{\begin{tabular}{c} \textcolor{white}{NOOP} \\ \textcolor{white}{NOOP} \end{tabular}}

\newcommand\true[1]{\darkgreen{\checkmark\textit{#1}}}
\newcommand\false[1]{\darkred{\xmark$~\,$\textit{#1}}}
\newcommand\unknown[1]{?\orange{\textit{#1}}}

\newcommand{\verticalcenter}[1]{\begingroup
\setbox0=\hbox{#1}%
\parbox{\wd0}{\box0}\endgroup}


%
% KBP MACROS
%
% A variable, abstracted (e.g., x)
\def\var#1{\ensuremath{\mathbf{{#1}}}}
% A variable instance (e.g., Obama)
\def\vari#1{\ensuremath{#1}}


% Aliases
\def\reverb{ReVerb}
\def\hydra{Stanford KBP}
\def\knowbot{\sys{Knowbot}}


% KBP Specific
% An entity
\def\ent#1{\text{\small{\textsc{#1}}}}

% An extraction, e.g., "Obama born_in Hawaii"
\newcommand\extr[3]{\mbox{\ent{#1}\ $~$\rel{#2}\ $~$\ent{#3}}}
\newcommand\triple[3]{(\mbox{\ent{#1}; $~$\rel{#2}; $~$\ent{#3}})}
% A clause in a logical form, e.g., "born_in(Obama, Hawaii)"
\newcommand\clause[3]{\mbox{\rel{#2}\ensuremath{(#1, #3)}}}

\newcommand\subj[1]{\textcolor{darkblue}{#1}}
\newcommand\obj[1]{\textcolor{burntorange}{#1}}
\newcommand\rel[1]{\textrm{#1}}

\def\blue#1{\textcolor{blue}{#1}}
\def\red#1{\textcolor{red}{#1}}
\def\green#1{\textcolor{green}{#1}}
\def\yellow#1{\textcolor{yellow}{#1}}

\newcommand\posterline{
  \begin{center}
    \noindent\rule{10cm}{0.4pt}
  \end{center}
}

\newcommand\entailmentExample[2]{
\vspace{0.5cm}
\noindent \hspace{0.5cm}\begin{tabular}{lp{0.80\textwidth}}
\textbf{P}: & \hspace*{-1mm}\textit{#1} \\
\textbf{H}: & \hspace*{-1mm}\textit{#2}
\end{tabular}
\vspace{0.5cm}
}

\tikzset{
    invisible/.style={opacity=0},
    visible on/.style={alt=#1{}{invisible}},
    alt/.code args={<#1>#2#3}{%
      \alt<#1>{\pgfkeysalso{#2}}{\pgfkeysalso{#3}} % \pgfkeysalso doesn't change the path
    },
  }

\newenvironment{lquote}{%
  \list{}{%
    \rightmargin0pt}%
    \item\relax
  }
{\endlist}

\newcommand\circled[1]{\tikz[baseline=(char.base)]{
            \node[shape=circle,draw=darkred,inner sep=2pt] (char) {#1};}
}
\newcommand\noncircled[1]{\tikz[baseline=(char.base)]{
            \node[shape=circle,draw=white,inner sep=2pt] (char) {#1};}
}


\newcommand{\hnode}[1]{|(#1)| \w{#1}}
\newcommand{\rnode}[2]{|(#1#2)| \w{\textcolor{darkblue}{\textbf{#1}}}\textcolor{white}{#2}}
\newcommand{\bnode}[2]{|(#1#2)| \w{\textcolor{darkblue}{\textcolor{darkred}{\textbf{#1}}}}\textcolor{white}{#2}}


\newcommand\longcaption[2]{\caption[#1]{#2}}


================================================
FILE: preface.tex
================================================
%!TEX root = thesis.tex

\prefacesection{Abstract}

Teaching machines to understand human language documents is one of the most elusive and long-standing challenges in Artificial Intelligence. This thesis tackles the problem of reading comprehension: how to build computer systems to read a passage of text and answer  comprehension questions. On the one hand, we think that reading comprehension is an important task for evaluating how well computer systems understand human language. On the other hand, if we can build high-performing reading comprehension systems, they would be a crucial technology for applications such as question answering and dialogue systems.

In this thesis, we focus on neural reading comprehension: a class of reading comprehension models built on top of deep neural networks. Compared to traditional sparse, hand-designed feature-based models, these end-to-end neural models have proven to be more effective in learning rich linguistic phenomena and improved performance on all the modern reading comprehension benchmarks by a large margin.

This thesis consists of two parts. In the first part, we aim to cover the essence of neural reading comprehension and present our efforts at building effective neural reading comprehension models, and more importantly, understanding what neural reading comprehension models have actually learned, and what depth of language understanding is needed to solve current tasks. We also summarize recent advances and discuss future directions and open questions in this field.

In the second part of this thesis, we investigate how we can build practical applications based on the recent success of neural reading comprehension. In particular, we pioneered two new research directions: 1) how we can combine information retrieval techniques with neural reading comprehension to tackle large-scale open-domain question answering; and 2) how we can build conversational question answering systems from current single-turn, span-based reading comprehension models. We implemented these ideas in the \sys{DrQA} and \sys{CoQA} projects and we demonstrate the effectiveness of these approaches. We believe that they hold great promise for future language technologies.


================================================
FILE: ref.bib
================================================
%%%%%%%%%%%%%%%%
% Bibliography
%%%%%%%%%%%%%%%%

@string{iclr = "International Conference on Learning Representations (ICLR)"}
@string{aaai = "Conference on Artificial Intelligence (AAAI)"}
@string{emnlp = "Empirical Methods in Natural Language Processing (EMNLP)"}
@string{acl = "Association for Computational Linguistics (ACL)"}
@string{acl_demo = "Association for Computational Linguistics (ACL): System Demonstrations"}
@string{aistats = "Artificial Intelligence and Statistics (AISTATS)"}
@string{nips = "Advances in Neural Information Processing Systems (NIPS)"}
@string{icml = "International Conference on Machine Learning (ICML)"}
@string{naacl = "North American Association for Computational Linguistics (NAACL)"}
@string{conll = "Computational Natural Language Learning (CoNLL)"}
@string{ijcnlp = "International Joint Conference on Natural Language Processing (IJCNLP)"}
@string{cvpr = "Conference on computer vision and pattern recognition (CVPR)"}
@string{iccv = "International Conference on Computer Vision (ICCV)"}
@string{acl_hlt = "Association for Computational Linguistics: Human Language Technologies (ACL-HLT)"}
@string{jmlr = "The Journal of Machine Learning Research (JMLR)"}
@string{tacl = "Transactions of the Association of Computational Linguistics (TACL)"}
@string{lrec = "International Conference on Language Resources and Evaluation (LREC)"}
@string{coling = "International Conference on Computational Linguistics (COLING)"}
@string{cl = "Computational Linguistics"}

@article{simmons1964indexing,
  title={Indexing and dependency logic for answering {English} questions},
  author={Simmons, Robert F and Klein, Sheldon and McConlogue, Keren},
  journal={American Documentation},
  volume={15},
  number={3},
  pages={196--204},
  year={1964}
}

@phdthesis{charniak1972toward,
  title={Toward a model of children's story comprehension},
  author={Charniak, Eugene},
  year={1972},
  school={Massachusetts Institute of Technology}
}

@book{schank1977scripts,
  title={Scripts, plans, goals and understanding: An inquiry into human knowledge structures},
  author={Schank, Roger C and Abelson, Robert P},
  year={1977},
  publisher={Lawrence Erlbaum}
}

@phdthesis{lehnert1977process,
  title={The process of question answering},
  author={Lehnert, Wendy Grace},
  year={1977},
  school={Yale University}
}

@article{hochreiter1997,
  title={Long short-term memory},
  author={Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
  journal={Neural Computation},
  volume={9},
  pages={1735--1780},
  year={1997}
}

@inproceedings{kupiec1993murax,
  title={{MURAX}: A robust linguistic approach for question answering using an on-line encyclopedia},
  author={Kupiec, Julian},
  booktitle={ACM SIGIR conference on Research and development in information retrieval},
  pages={181--190},
  year={1993}
}

@book{kintsch1998comprehension,
  title={Comprehension: A paradigm for cognition.},
  author={Kintsch, Walter},
  year={1998},
  publisher={Cambridge University Press}
}

@inproceedings{voorhees1999trec,
  title={The {TREC-8} Question Answering Track Report},
  author={Voorhees, Ellen M},
  booktitle={Text {RE}trieval Conference (TREC)},
  pages={77--82},
  year={1999}
}

@inproceedings{hirschman1999deep,
  title={Deep read: A reading comprehension system},
  author={Hirschman, Lynette and Light, Marc and Breck, Eric and Burger, John D},
  booktitle=acl,
  pages={325--332},
  year={1999}
}

@inproceedings{riloff2000rule,
  title={A rule-based question answering system for reading comprehension tests},
  author={Riloff, Ellen and Thelen, Michael},
  booktitle={ANLP/NAACL Workshop on Reading comprehension tests as evaluation for computer-based language understanding sytems},
  pages={13--19},
  year={2000}
}

@inproceedings{charniak2000reading,
  title={Reading comprehension programs in a statistical-language-processing class},
  author={Charniak, Eugene and Altun, Yasemin and Braz, Rodrigo de Salvo and Garrett, Benjamin and Kosmala, Margaret and Moscovich, Tomer and Pang, Lixin and Pyo, Changhee and Sun, Ye and Wy, Wei and others},
  booktitle={ANLP/NAACL Workshop on Reading comprehension tests as evaluation for computer-based language understanding sytems},
  pages={1--5},
  year={2000}
}

@inproceedings{moldovan2000structure,
  title={The structure and performance of an open-domain question answering system},
  author={Moldovan, Dan and Harabagiu, Sanda and Pasca, Marius and Mihalcea, Rada and Girju, Roxana and Goodrum, Richard and Rus, Vasile},
  booktitle=acl,
  pages={563--570},
  year={2000}
}

@inproceedings{brill2002askmsr,
  title={An analysis of the {AskMSR} question-answering system},
  author={Brill, Eric and Dumais, Susan and Banko, Michele},
  booktitle=emnlp,
  pages={257--264},
  year={2002}
}

@inproceedings{papineni2002bleu,
  title={{BLEU}: a method for automatic evaluation of machine translation},
  author={Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing},
  booktitle=acl,
  pages={311--318},
  year={2002}
}

@inproceedings{Ahn2004using,
  author = {Ahn, David and Jijkoun, Valentin and Mishne, Gilad and Müller, Karin and de Rijke, Maarten and Schlobach., Stefan},
  booktitle = {Text {RE}trieval Conference (TREC)},
  title = {Using {Wikipedia} at the {TREC} {QA} {Track}},
  year = {2004}
}

@article{lin2004rouge,
  title={{ROUGE}: A package for automatic evaluation of summaries},
  author={Lin, Chin-Yew},
  journal={Text Summarization Branches Out},
  year={2004}
}

@inproceedings{banerjee2005meteor,
  title={{METEOR}: An automatic metric for MT evaluation with improved correlation with human judgments},
  author={Banerjee, Satanjeev and Lavie, Alon},
  booktitle={ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization},
  pages={65--72},
  year={2005}
}

@inproceedings{buscaldi2006mining,
  title={Mining knowledge from {Wikipedia} for the question answering task},
  author={Buscaldi, Davide and Rosso, Paolo},
  booktitle=lrec,
  pages={727--730},
  year={2006}
}

@incollection{auer2007dbpedia,
  title={{DBpedia}: A nucleus for a web of open data},
  author={Auer, S{\"o}ren and Bizer, Christian and Kobilarov, Georgi and Lehmann, Jens and Cyganiak, Richard and Ives, Zachary},
  booktitle={The Semantic Web},
  pages={722--735},
  year={2007},
  publisher={Springer}
}

@inproceedings{bollacker2008freebase,
  title={Freebase: a collaboratively created graph database for structuring human knowledge},
  author={Bollacker, Kurt and Evans, Colin and Paritosh, Praveen and Sturge, Tim and Taylor, Jamie},
  booktitle={Proceedings of the 2008 ACM SIGMOD international conference on Management of data},
  pages={1247--1250},
  year={2008}
}

@inproceedings{mitchell2009populating,
  title={Populating the semantic web by macro-reading internet text},
  author={Mitchell, Tom M and Betteridge, Justin and Carlson, Andrew and Hruschka, Estevam and Wang, Richard},
  booktitle={International Semantic Web Conference (IWSC)},
  pages={998--1002},
  year={2009}
}

@inproceedings{mintz2009distant,
  author    = {Mintz, Mike  and  Bills, Steven  and  Snow, Rion  and  Jurafsky, Daniel},
  title     = {Distant supervision for relation extraction without labeled data},
  booktitle = acl,
  year      = {2009},
  pages     = {1003--1011}
}

@inproceedings{weinberger2009feature,
  title={Feature hashing for large scale multitask learning},
  author={Weinberger, Kilian and Dasgupta, Anirban and Langford, John and Smola, Alex and Attenberg, Josh},
  booktitle=icml,
  pages={1113--1120},
  year={2009}
}

@article{wu2010adapting,
  title={Adapting boosting for information retrieval measures},
  author={Wu, Qiang and Burges, Christopher JC and Svore, Krysta M and Gao, Jianfeng},
  journal={Information Retrieval},
  volume={13},
  number={3},
  pages={254--270},
  year={2010},
  publisher={Springer}
}

@article{ferrucci2010building,
  title={Building {Watson}: An overview of the {DeepQA} project},
  author={Ferrucci, David and Brown, Eric and Chu-Carroll, Jennifer and Fan, James and Gondek, David and Kalyanpur, Aditya A and Lally, Adam and Murdock, J William and Nyberg, Eric and Prager, John and others},
  journal={AI magazine},
  volume={31},
  number={3},
  pages={59--79},
  year={2010}
}

@inproceedings{krizhevsky2012imagenet,
  title={Imagenet classification with deep convolutional neural networks},
  author={Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E},
  booktitle=nips,
  pages={1097--1105},
  year={2012}
}

@inproceedings{graves2013speech,
  title={Speech recognition with deep recurrent neural networks},
  author={Graves, Alex and Mohamed, Abdel-rahman and Hinton, Geoffrey},
  booktitle={International Conference on Acoustics, Speech and Signal processing (ICASSP)},
  pages={6645--6649},
  year={2013}
}

@inproceedings{richardson2013mctest,
  author    = {Richardson, Matthew  and  Burges, Christopher J.C.  and  Renshaw, Erin},
  title     = {{MCTest}: A Challenge Dataset for the Open-Domain Machine Comprehension of Text},
  booktitle = emnlp,
  pages     = {193--203},
  year      = {2013}
}

@inproceedings{berant2013semantic,
  title={Semantic Parsing on {Freebase} from Question-Answer Pairs},
  author={Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy},
  booktitle=emnlp,
  pages={1533--1544},
  year={2013}
}

@inproceedings{mikolov2013distributed,
  title={Distributed representations of words and phrases and their compositionality},
  author={Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
  booktitle=nips,
  pages={3111--3119},
  year={2013}
}

@article{kingma2014adam,
  title={Adam: A method for stochastic optimization},
  author={Kingma, Diederik and Ba, Jimmy},
  journal={arXiv preprint arXiv:1412.6980},
  year={2014}
}


@inproceedings{manning2014stanford,
  title={The {Stanford} {CoreNLP} natural language processing toolkit},
  author={Manning, Christopher D and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J and McClosky, David},
  booktitle=acl_demo,
  pages={55--60},
  year={2014}
}

@inproceedings{fader2014open,
  title={Open Question Answering Over Curated and Extracted Knowledge Bases},
  author={Fader, Anthony and Zettlemoyer, Luke and Etzioni, Oren},
  booktitle={SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)},
  year={2014}
}

@inproceedings{yao2014freebase,
  title={Freebase {QA}: Information Extraction or Semantic Parsing?},
  author={Yao, Xuchen and Berant, Jonathan and Van Durme, Benjamin},
  booktitle={ACL 2014 Workshop on Semantic Parsing},
  pages={82--86},
  year={2014}
}

@inproceedings{berant2014modeling,
  author    = {Berant, Jonathan  and  Srikumar, Vivek  and  Chen, Pei-Chun  and  Vander Linden, Abby  and  Harding, Brittany  and  Huang, Brad  and  Clark, Peter  and  Manning, Christopher D.},
  title     = {Modeling Biological Processes for Reading Comprehension},
  booktitle = emnlp,
  year      = {2014},
  pages     = {1499--1510}
}

@inproceedings{cho2014learning,
  title={Learning Phrase Representations using {RNN} Encoder-Decoder for Statistical Machine Translation},
  author={Cho, Kyunghyun and Merrienboer, Bart and Gulcehre, Caglar and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua},
  booktitle=emnlp,
  year={2014},
  pages = {1724--1734}
}

@inproceedings{pennington2014glove,
  title={Glove: Global vectors for word representation},
  author={Pennington, Jeffrey and Socher, Richard and Manning, Christopher},
  booktitle=emnlp,
  pages={1532--1543},
  year={2014}
}

@inproceedings{kim2014convolutional,
  title={Convolutional Neural Networks for Sentence Classification},
  author={Kim, Yoon},
  booktitle=emnlp,
  pages={1746--1751},
  year={2014}
}

@article{ryu2014open,
  title={Open domain question answering using {Wikipedia-based} knowledge model},
  author={Ryu, Pum-Mo and Jang, Myung-Gil and Kim, Hyun-Ki},
  journal={Information Processing \& Management},
  volume={50},
  pages={683--692},
  year={2014},
  publisher={Elsevier}
}

@inproceedings{sutskever2014sequence,
  title={Sequence to sequence learning with neural networks},
  author={Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V},
  booktitle=nips,
  pages={3104--3112},
  year={2014}
}

@inproceedings{chelba2014one,
  title={One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling},
  author={Chelba, Ciprian and Mikolov, Tomas and Schuster, Mike and Ge, Qi and Brants, Thorsten and Koehn, Phillipp and Robinson, Tony},
  booktitle={Conference of the International Speech Communication Association (Interspeech)},
  year={2014}
}

@inproceedings{antol2015vqa,
  title={{VQA}: Visual {Q}uestion {A}nswering},
  author={Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Lawrence Zitnick, C and Parikh, Devi},
  booktitle=iccv,
  pages={2425--2433},
  year={2015}
}

@article{vinyals2015neural,
	title = {A Neural Conversational Model},
	journal = {arXiv preprint arXiv:1506.05869},
	author = {Vinyals, Oriol and Le, Quoc},
	year = {2015}
}

@inproceedings{pasupat2015compositional,
  title={Compositional Semantic Parsing on Semi-Structured Tables},
  author={Pasupat, Panupong and Liang, Percy},
  booktitle=acl,
  pages={1470--1480},
  year={2015}
}

@inproceedings{baudivs2015modeling,
  title={Modeling of the question answering task in the {YodaQA} system},
  author={Baudi{\v{s}}, Petr and {\v{S}}ediv{\`y}, Jan},
  booktitle={International Conference of the Cross-Language Evaluation Forum for European Languages},
  pages={222--228},
  year={2015},
  organization={Springer}
}

@inproceedings{baudivs2015yodaqa,
  title={{YodaQA}: a modular question answering system pipeline},
  author={Baudi{\v{s}}, Petr},
  booktitle={POSTER 2015---19th International Student Conference on Electrical Engineering},
  pages={1156--1165},
  year={2015}
}

@book{gormley2015elasticsearch,
  title={Elasticsearch: The Definitive Guide},
  author={Gormley, Clinton and Tong, Zachary},
  year={2015},
  publisher={O'Reilly Media, Inc}
}

@article{bordes2015large,
  title={Large-scale Simple Question Answering with Memory Networks},
  author={Bordes, Antoine and Usunier, Nicolas and Chopra, Sumit and Weston, Jason},
  journal={arXiv preprint arXiv:1506.02075},
  year={2015}
}

@inproceedings{weston2015memory,
  author = {Weston, Jason and Chopra, Sumit and Bordes, Antoine},
  title = {Memory Networks},
  booktitle = iclr,
  year = {2015}
}

@inproceedings{bahdanau2015neural,
  author    = {Dzmitry Bahdanau and Kyunghyun Cho and Yoshua Bengio},
  title     = {Neural Machine Translation by Jointly Learning to Align and Translate},
  booktitle = iclr,
  year={2015}
}

@inproceedings{hermann2015teaching,
  author = {Karl Moritz Hermann and Tom\'a\v{s} Ko\v{c}isk\'y and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom},
  title = {Teaching Machines to Read and Comprehend},
  booktitle = nips,
  pages={1693--1701},
  year = {2015},
}

@inproceedings{srivastava2015training,
  title={Training very deep networks},
  author={Srivastava, Rupesh K and Greff, Klaus and Schmidhuber, J{\"u}rgen},
  booktitle=nips,
  pages={2377--2385},
  year={2015}
}

@inproceedings{narasimhan2015machine,
  title={Machine comprehension with discourse relations},
  author={Narasimhan, Karthik and Barzilay, Regina},
  booktitle=acl,
  volume={1},
  pages={1253--1262},
  year={2015}
}

@inproceedings{sachan2015learning,
  title={Learning answer-entailing structures for machine comprehension},
  author={Sachan, Mrinmaya and Dubey, Kumar and Xing, Eric and Richardson, Matthew},
  booktitle=acl,
  volume={1},
  pages={239--249},
  year={2015}
}

@inproceedings{wang2015machine,
  title={Machine comprehension with syntax, frames, and semantics},
  author={Wang, Hai and Bansal, Mohit and Gimpel, Kevin and McAllester, David},
  booktitle=acl,
  volume={2},
  pages={700--706},
  year={2015}
}

@inproceedings{luong2015effective,
  title={Effective Approaches to Attention-based Neural Machine Translation},
  author={Luong, Thang and Pham, Hieu and Manning, Christopher D},
  booktitle=emnlp,
  pages={1412--1421},
  year={2015}
}

@inproceedings{sun2015open,
  title={Open domain question answering via semantic enrichment},
  author={Sun, Huan and Ma, Hao and Yih, Wen-tau and Tsai, Chen-Tse and Liu, Jingjing and Chang, Ming-Wei},
  booktitle={International Conference on World Wide Web (WWW)},
  pages={1045--1055},
  year={2015}
}

@article{cho2015natural,
  title={Natural language understanding with distributed representation},
  author={Cho, Kyunghyun},
  journal={arXiv preprint arXiv:1511.07916},
  year={2015}
}

@inproceedings{he2016deep,
  title={Deep residual learning for image recognition},
  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
  booktitle=cvpr,
  pages={770--778},
  year={2016}
}

@inproceedings{tapaswi2016movieqa,
  title={{MovieQA}: Understanding stories in movies through question-answering},
  author={Tapaswi, Makarand and Zhu, Yukun and Stiefelhagen, Rainer and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja},
  booktitle=cvpr,
  pages={4631--4640},
  year={2016}
}

@inproceedings{ranzato2016sequence,
  title={Sequence level training with recurrent neural networks},
  author={Ranzato, Marc'Aurelio and Chopra, Sumit and Auli, Michael and Zaremba, Wojciech},
  booktitle=iclr,
  year={2016}
}

@article{nguyen2016ms,
  title={{MS MARCO}: A human generated machine reading comprehension dataset},
  author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
  journal={arXiv preprint arXiv:1611.09268},
  year={2016}
}

@article{lee2016learning,
  title={Learning recurrent span representations for extractive question answering},
  author={Lee, Kenton and Salant, Shimi and Kwiatkowski, Tom and Parikh, Ankur and Das, Dipanjan and Berant, Jonathan},
  journal={arXiv preprint arXiv:1611.01436},
  year={2016}
}


@inproceedings{li2016diversity,
  title={A Diversity-Promoting Objective Function for Neural Conversation Models},
  author={Li, Jiwei and Galley, Michel and Brockett, Chris and Gao, Jianfeng and Dolan, Bill},
  booktitle=naacl,
  pages={110--119},
  year={2016}
}

@article{bajgar2016embracing,
  title={Embracing data abundance: {BookTest} dataset for reading comprehension},
  author={Bajgar, Ondrej and Kadlec, Rudolf and Kleindienst, Jan},
  journal={arXiv preprint arXiv:1610.00956},
  year={2016}
}

@inproceedings{chen2016thorough,
    title={A Thorough Examination of the {CNN/Daily Mail} Reading Comprehension Task},
    author={Chen, Danqi and Bolton, Jason and Manning, Christopher D},
    booktitle=acl,
    volume={1},
    year={2016},
    pages = {2358--2367},
}

@inproceedings{shen2016minimum,
  title={Minimum Risk Training for Neural Machine Translation},
  author={Shen, Shiqi and Cheng, Yong and He, Zhongjun and He, Wei and Wu, Hua and Sun, Maosong and Liu, Yang},
  booktitle=acl,
  volume={1},
  pages={1683--1692},
  year={2016}
}

@inproceedings{gu2016incorporating,
  author    = {Gu, Jiatao  and  Lu, Zhengdong  and  Li, Hang  and  Li, Victor O.K.},
  title     = {Incorporating Copying Mechanism in Sequence-to-Sequence Learning},
  booktitle = acl,
  year      = {2016},
  pages     = {1631--1640}
}

@inproceedings{lei2016rationalizing,
  title={Rationalizing Neural Predictions},
  author={Lei, Tao and Barzilay, Regina and Jaakkola, Tommi},
  booktitle=emnlp,
  pages={107--117},
  year={2016}
}

@inproceedings{rajpurkar2016squad,
  author = {Rajpurkar, Pranav  and  Zhang, Jian  and  Lopyrev, Konstantin  and  Liang, Percy},
  booktitle = emnlp,
  title = {{SQuAD}: 100,000+ Questions for Machine Comprehension of Text},
  year = {2016},
  pages = {2383--2392}
}

@inproceedings{andreas2016learning,
  title={Learning to Compose Neural Networks for Question Answering},
  author={Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan},
  booktitle=naacl,
  pages={1545--1554},
  year={2016}
}

@inproceedings{parikh2016decomposable,
  title={A Decomposable Attention Model for Natural Language Inference},
  author={Parikh, Ankur and T{\"a}ckstr{\"o}m, Oscar and Das, Dipanjan and Uszkoreit, Jakob},
  booktitle=emnlp,
  pages={2249--2255},
  year={2016}
}

@inproceedings{onishi2016did,
  title={Who did What: A Large-Scale Person-Centered Cloze Dataset},
  author={Onishi, Takeshi and Wang, Hai and Bansal, Mohit and Gimpel, Kevin and McAllester, David},
  booktitle=emnlp,
  pages={2230--2235},
  year={2016}
}

@inproceedings{miller2016key,
  title={Key-Value Memory Networks for Directly Reading Documents},
  author={Miller, Alexander and Fisch, Adam and Dodge, Jesse and Karimi, Amir-Hossein and Bordes, Antoine and Weston, Jason},
  booktitle=emnlp,
  pages={1400--1409},
  year={2016}
}

@inproceedings{liu2016not,
  title={How {NOT} To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation},
  author={Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian and Noseworthy, Mike and Charlin, Laurent and Pineau, Joelle},
  booktitle=emnlp,
  pages={2122--2132},
  year={2016}
}

@inproceedings{hill2016goldilocks,
  title={The {Goldilocks} {Principle}: Reading Children's Books with Explicit Memory Representations},
  author={Hill, Felix and Bordes, Antoine and Chopra, Sumit and Weston, Jason},
  booktitle=iclr,
  year={2016}
}

@inproceedings{hewlett2016wiki,
  author    = {Hewlett, Daniel  and  Lacoste, Alexandre  and  Jones, Llion  and  Polosukhin, Illia  and  Fandrianto, Andrew  and  Han, Jay  and  Kelcey, Matthew  and  Berthelot, David},
  title     = {WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia},
  booktitle = acl,
  pages     = {1535--1545},
  year      = {2016}
}

@inproceedings{gal2016theoretically,
  title={A theoretically grounded application of dropout in recurrent neural networks},
  author={Gal, Yarin and Ghahramani, Zoubin},
  booktitle=nips,
  pages={1019--1027},
  year={2016}
}

@book{goldberg2017neural,
  title={Neural network methods for natural language processing},
  author={Goldberg, Yoav},
  journal={Synthesis Lectures on Human Language Technologies},
  volume={10},
  number={1},
  pages={1--309},
  year={2017},
  publisher={Morgan \& Claypool Publishers}
}

@inproceedings{klein2017opennmt,
  title={{OpenNMT}: Open-Source Toolkit for Neural Machine Translation},
  author={Klein, Guillaume and Kim, Yoon and Deng, Yuntian and Senellart, Jean and Rush, Alexander},
  journal=acl_demo,
  pages={67--72},
  year={2017}
}

@inproceedings{das2017visual,
  title={Visual Dialog},
  author={Das, Abhishek and Kottur, Satwik and Gupta, Khushi and Singh, Avi and Yadav, Deshraj and Moura, Jose MF and Parikh, Devi and Batra, Dhruv},
  booktitle=cvpr,
  pages={1080--1089},
  year={2017}
}

@article{mikolov2017advances,
  title={Advances in pre-training distributed word representations},
  author={Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand},
  journal={arXiv preprint arXiv:1712.09405},
  year={2017}
}

@inproceedings{wang2017gated,
  title={Gated self-matching networks for reading comprehension and question answering},
  author={Wang, Wenhui and Yang, Nan and Wei, Furu and Chang, Baobao and Zhou, Ming},
  booktitle=acl,
  volume={1},
  pages={189--198},
  year={2017}
}

@inproceedings{yu2017learning,
  title={Learning to Skim Text},
  author={Yu, Adams Wei and Lee, Hongrae and Le, Quoc},
  booktitle=acl,
  volume={1},
  pages={1880--1890},
  year={2017}
}

@inproceedings{weissenborn2017making,
  title={Making Neural QA as Simple as Possible but not Simpler},
  author={Weissenborn, Dirk and Wiese, Georg and Seiffe, Laura},
  booktitle=conll,
  pages={271--280},
  year={2017}
}

@inproceedings{seo2017bidirectional,
  title={Bidirectional attention flow for machine comprehension},
  author={Seo, Minjoon and Kembhavi, Aniruddha and Farhadi, Ali and Hajishirzi, Hannaneh},
  booktitle=iclr,
  year={2017}
}

@inproceedings{xiong2017dynamic,
  title={Dynamic coattention networks for question answering},
  author={Xiong, Caiming and Zhong, Victor and Socher, Richard},
  booktitle=iclr,
  year={2017}
}

@inproceedings{wang2017machine,
  title={Machine Comprehension using {Match-LSTM} and Answer Pointer},
  author={Wang, Shuohang and Jiang, Jing},
  booktitle=iclr,
  year={2017}
}

@inproceedings{chen2017reading,
    title={Reading {Wikipedia} to Answer Open-Domain Questions},
    author={Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine},
    booktitle=acl,
    volume={1},
    year={2017},
    pages={1870--1879}
}

@inproceedings{sugawara2017evaluation,
  title={Evaluation metrics for machine reading comprehension: Prerequisite skills and readability},
  author={Sugawara, Saku and Kido, Yusuke and Yokono, Hikaru and Aizawa, Akiko},
  booktitle=acl,
  volume={1},
  pages={806--817},
  year={2017}
}

@inproceedings{kembhavi2017you,
  title={Are You Smarter Than a Sixth Grader? {Textbook} Question Answering for Multimodal Machine Comprehension},
  author={Kembhavi, Aniruddha and Seo, Minjoon and Schwenk, Dustin and Choi, Jonghyun and Farhadi, Ali and Hajishirzi, Hannaneh},
  booktitle=cvpr,
  pages={5376--5384},
  year={2017}
}

@inproceedings{see2017get,
  title={Get to the point: Summarization with pointer-generator networks},
  author={See, Abigail and Liu, Peter J and Manning, Christopher D},
  booktitle=acl,
  volume={1},
  year={2017},
  pages={1073--1083}
}

@inproceedings{joshi2017triviaqa,
  title={{TriviaQA}: A large scale distantly supervised challenge dataset for reading comprehension},
  author={Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke},
  booktitle=acl,
  volume={1},
  year={2017},
  pages={1601--1611}
}

@inproceedings{iyyer2017search,
  title={Search-based neural structured learning for sequential question answering},
  author={Iyyer, Mohit and Yih, Wen-tau and Chang, Ming-Wei},
  booktitle=acl,
  volume={1},
  pages={1821--1831},
  year={2017}
}

@inproceedings{xie2017constituent,
  title={A constituent-centric neural architecture for reading comprehension},
  author={Xie, Pengtao and Xing, Eric},
  booktitle=acl,
  volume={1},
  pages={1405--1414},
  year={2017}
}

@article{dhingra2017comparative,
  title={A comparative study of word embeddings for reading comprehension},
  author={Dhingra, Bhuwan and Liu, Hanxiao and Salakhutdinov, Ruslan and Cohen, William W},
  journal={arXiv preprint arXiv:1703.00993},
  year={2017}
}

@article{dhingra2017quasar,
  title={Quasar: Datasets for Question Answering by Search and Reading},
  author={Dhingra, Bhuwan and Mazaitis, Kathryn and Cohen, William W},
  journal={arXiv preprint arXiv:1707.03904},
  year={2017}
}

@inproceedings{miller2017parlai,
  title={{ParlAI}: A Dialog Research Software Platform},
  author={Miller, Alexander and Feng, Will and Batra, Dhruv and Bordes, Antoine and Fisch, Adam and Lu, Jiasen and Parikh, Devi and Weston, Jason},
  booktitle=emnlp,
  pages={79--84},
  year={2017}
}

@inproceedings{lai2017race,
  title={{RACE}: Large-scale ReAding Comprehension Dataset From Examinations},
  author={Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard},
  booktitle=emnlp,
  pages={785--794},
  year={2017}
}

@inproceedings{welbl2017crowdsourcing,
  title={Crowdsourcing Multiple Choice Science Questions},
  author={Welbl, Johannes and Liu, Nelson F and Gardner, Matt},
  booktitle={3rd Workshop on Noisy User-generated Text},
  pages={94--106},
  year={2017}
}

@inproceedings{jia2017adversarial,
  title={Adversarial Examples for Evaluating Reading Comprehension Systems},
  author={Jia, Robin and Liang, Percy},
  booktitle=emnlp,
  pages={2021--2031},
  year={2017}
}

@article{dunn2017searchqa,
  title={{SearchQA}: A new {Q\&A} dataset augmented with context from a search engine},
  author={Dunn, Matthew and Sagun, Levent and Higgins, Mike and Guney, V Ugur and Cirik, Volkan and Cho, Kyunghyun},
  journal={arXiv preprint arXiv:1704.05179},
  year={2017}
}

@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle=nips,
  pages={5998--6008},
  year={2017}
}

@inproceedings{coleman2017dawnbench,
  title={{DAWNBench}: An End-to-End Deep Learning Benchmark and Competition},
  author={Coleman, Cody and Narayanan, Deepak and Kang, Daniel and Zhao, Tian and Zhang, Jian and Nardi, Luigi and Bailis, Peter and Olukotun, Kunle and R{\'e}, Chris and Zaharia, Matei},
  booktitle={NIPS ML Systems Workshop},
  year={2017}
}


@inproceedings{mccann2017learned,
  title={Learned in translation: Contextualized word vectors},
  author={McCann, Bryan and Bradbury, James and Xiong, Caiming and Socher, Richard},
  booktitle=nips,
  pages={6297--6308},
  year={2017}
}

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal=tacl,
  volume={5},
  pages={135--146},
  year={2017}
}

@inproceedings{wang2018r,
  title={R\^{}3: Reinforced Reader-Ranker for Open-Domain Question Answering},
  author={Wang, Shuohang and Yu, Mo and Guo, Xiaoxiao and Wang, Zhiguo and Klinger, Tim and Zhang, Wei and Chang, Shiyu and Tesauro, Gerald and Zhou, Bowen and Jiang, Jing},
  booktitle=aaai,
  year={2018}
}

@inproceedings{wang2018evidence,
  title={Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering},
  author={Wang, Shuohang and Yu, Mo and Jiang, Jing and Zhang, Wei and Guo, Xiaoxiao and Chang, Shiyu and Wang, Zhiguo and Klinger, Tim and Tesauro, Gerald and Campbell, Murray},
  booktitle=iclr,
  year={2018}
}

@inproceedings{talmor2018web,
  title={The Web as a Knowledge-Base for Answering Complex Questions},
  author={Talmor, Alon and Berant, Jonathan},
  booktitle=naacl,
  volume={1},
  pages={641--651},
  year={2018}
}

@inproceedings{yu2018qanet,
  title={{QANet}: Combining Local Convolution with Global Self-Attention for Reading Comprehension},
  author={Yu, Adams Wei and Dohan, David and Luong, Minh-Thang and Zhao, Rui and Chen, Kai and Norouzi, Mohammad and Le, Quoc V},
  booktitle=iclr,
  year={2018}
}

@inproceedings{peters2018deep,
  title={Deep Contextualized Word Representations},
  author={Peters, Matthew and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  booktitle=naacl,
  volume={1},
  pages={2227--2237},
  year={2018}
}

@inproceedings{khashabi2018looking,
  title={Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences},
  author={Khashabi, Daniel and Chaturvedi, Snigdha and Roth, Michael and Upadhyay, Shyam and Roth, Dan},
  booktitle=naacl,
  volume={1},
  pages={252--262},
  year={2018}
}

@inproceedings{huang2018fusionnet,
  title={{FusionNet}: Fusing via Fully-aware Attention with Application to Machine Comprehension},
  author={Huang, Hsin-Yuan and Zhu, Chenguang and Shen, Yelong and Chen, Weizhu},
  booktitle=iclr,
  year={2018}
}

@inproceedings{zhang2018personalizing,
	title = {Personalizing Dialogue Agents: {I} have a dog, do you have pets too?},
	booktitle = acl,
	author = {Zhang, Saizheng and Dinan, Emily and Urbanek, Jack and Szlam, Arthur and Kiela, Douwe and Weston, Jason},
	year = {2018},
  volume={1},
  pages={2204--2213}
}


@inproceedings{fan2018hierarchical,
  title={Hierarchical Neural Story Generation},
  author={Fan, Angela and Lewis, Mike and Dauphin, Yann},
  booktitle=acl,
  volume={1},
  pages={889--898},
  year={2018}
}

@inproceedings{rajpurkar2018know,
  title={Know What You Don't Know: Unanswerable Questions for {SQuAD}},
  author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy},
  booktitle=acl,
  volume={2},
  pages={784--789},
  year={2018}
}

@inproceedings{chaganty2018price,
  title={The price of debiasing automatic metrics in natural language evaluation},
  author={Chaganty, Arun Tejasvi and Mussman, Stephen and Liang, Percy},
  booktitle=acl,
  volume={1},
  pages={643--653},
  year={2018}
}

@inproceedings{liu2018stochastic,
  title={Stochastic answer networks for machine reading comprehension},
  author={Liu, Xiaodong and Shen, Yelong and Duh, Kevin and Gao, Jianfeng},
  booktitle=acl,
  volume={1},
  pages={1694--1704},
  year={2018}
}

@inproceedings{lin2018denoising,
  title={Denoising distantly supervised open-domain question answering},
  author={Lin, Yankai and Ji, Haozhe and Liu, Zhiyuan and Sun, Maosong},
  booktitle=acl,
  volume={1},
  pages={1736--1745},
  year={2018}
}

@inproceedings{saha2018complex,
    title = {Complex Sequential Question Answering: Towards Learning to Converse Over Linked Question Answer Pairs with a Knowledge Graph},
    booktitle = aaai,
    author = {Saha, Amrita and Pahuja, Vardaan and Khapra, Mitesh M. and Sankaranarayanan, Karthik and Chandar, Sarath},
    year = {2018}
}

@inproceedings{clark2018simple,
  title={Simple and Effective Multi-Paragraph Reading Comprehension},
  author={Clark, Christopher and Gardner, Matt},
  booktitle=acl,
  volume={1},
  pages={845-855},
  year={2018}
}

@inproceedings{xiong2018dcn+,
  title={{DCN+}: Mixed objective and deep residual coattention for question answering},
  author={Xiong, Caiming and Zhong, Victor and Socher, Richard},
  booktitle=iclr,
  year={2018}
}

@inproceedings{seo2018neural,
  title={Neural Speed Reading via {Skim-RNN}},
  author={Seo, Minjoon and Min, Sewon and Farhadi, Ali and Hajishirzi, Hannaneh},
  booktitle=iclr,
  year={2018}
}

@inproceedings{guo2018dialog,
  title = {Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base},
  author = {Guo, Daya and Tang, Duyu and Duan, Nan and Zhou, Ming and Yin, Jian},
  booktitle = nips,
  pages = {2943--2952},
  year = {2018}
}

@inproceedings{choi2018quac,
	title = {{QuAC}: Question Answering in Context},
	booktitle = emnlp,
	author = {Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke},
  pages={2174--2184},
	year = {2018}
}

@inproceedings{saeidi2018interpretation,
  title={Interpretation of Natural Language Rules in Conversational Machine Reading},
  author={Saeidi, Marzieh and Bartolo, Max and Lewis, Patrick and Singh, Sameer and Rockt{\"a}schel, Tim and Sheldon, Mike and Bouchard, Guillaume and Riedel, Sebastian},
  booktitle=emnlp,
  pages={2087--2097},
  year={2018}
}

@inproceedings{yang2018hotpotqa,
  title={Hotpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering},
  author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D},
  booktitle=emnlp,
  pages={2369--2380},
  year={2018}
}

@inproceedings{sugawara2018what,
  author = 	{Sugawara, Saku and Inui, Kentaro and Sekine, Satoshi and Aizawa, Akiko},
  title = 	{What Makes Reading Comprehension Questions Easier?},
  booktitle = emnlp,
  year = 	{2018},
  pages = {4208--4219}
}

@inproceedings{kaushik2018how,
  author = {Kaushik, Divyansh and Lipton, Zachary C.},
  title = {How Much Reading Does Reading Comprehension Require? {A} Critical Investigation of Popular Benchmarks},
  booktitle = emnlp,
  year = 	{2018},
  pages = 	{5010--5015}
}

@inproceedings{lei2018simple,
  title={Simple recurrent units for highly parallelizable recurrence},
  author={Lei, Tao and Zhang, Yu and Wang, Sida I and Dai, Hui and Artzi, Yoav},
  booktitle=emnlp,
  pages={4470--4481},
  year={2018}
}

@article{kovcisky2018narrativeqa,
  title={The {NarrativeQA} reading comprehension challenge},
  author={Ko{\v{c}}isk{\`y}, Tom{\'a}{\v{s}} and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G{\'a}abor and Grefenstette, Edward},
  journal=tacl,
  volume={6},
  pages={317--328},
  year={2018}
}

@article{welbl2018constructing,
  title={Constructing Datasets for Multi-hop Reading Comprehension Across Documents},
  author={Welbl, Johannes and Stenetorp, Pontus and Riedel, Sebastian},
  journal={Transactions of the Association for Computational Linguistics},
  volume={6},
  pages={287--302},
  year={2018}
}

@article{reddy2019coqa,
     title={{CoQA}: A Conversational Question Answering Challenge},
     author={Reddy, Siva and Chen, Danqi and Manning, Christopher D},
     journal={Transactions of the Association of Computational Linguistics (TACL).},
     year={2019},
     note={accepted pending revisions}
}

@article{raison2018weaver,
  title={Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading},
  author={Raison, Martin and Mazar{\'e}, Pierre-Emmanuel and Das, Rajarshi and Bordes, Antoine},
  journal={arXiv preprint arXiv:1804.10490},
  year={2018}
}

@article{devlin2018bert,
  title={{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

@techreport{radford2018improving,
  title={Improving language understanding by generative pre-training},
  author={Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya},
  year={2018},
  institution={OpenAI}
}


@article{gao2018neural,
  title={Neural Approaches to Conversational {AI}},
  author={Gao, Jianfeng and Galley, Michel and Li, Lihong},
  journal={arXiv preprint arXiv:1809.08267},
  year={2018}
}

@article{huang2018flowqa,
  title={{FlowQA}: Grasping Flow in History for Conversational Machine Comprehension},
  author={Huang, Hsin-Yuan and Choi, Eunsol and Yih, Wen-tau},
  journal={arXiv preprint arXiv:1810.06683},
  year={2018}
}


================================================
FILE: std-macros.tex
================================================
% version 1.2 05/21/08
\newcommand\sa{\ensuremath{\mathcal{a}}}
\newcommand\sd{\ensuremath{\mathcal{d}}}
\newcommand\se{\ensuremath{\mathcal{e}}}
\newcommand\sg{\ensuremath{\mathcal{g}}}
\newcommand\sh{\ensuremath{\mathcal{h}}}
\newcommand\seye{\ensuremath{\mathcal{i}}}
\newcommand\sj{\ensuremath{\mathcal{j}}}
\newcommand\sk{\ensuremath{\mathcal{k}}}
\newcommand\sm{\ensuremath{\mathcal{m}}}
\newcommand\sn{\ensuremath{\mathcal{n}}}
\newcommand\sq{\ensuremath{\mathcal{q}}}
\newcommand\sr{\ensuremath{\mathcal{r}}}
\newcommand\su{\ensuremath{\mathcal{u}}}
\newcommand\sv{\ensuremath{\mathcal{v}}}
\newcommand\sw{\ensuremath{\mathcal{w}}}
\newcommand\sx{\ensuremath{\mathcal{x}}}
\newcommand\sy{\ensuremath{\mathcal{y}}}
\newcommand\sz{\ensuremath{\mathcal{z}}}
\newcommand\sA{\ensuremath{\mathcal{A}}}
\newcommand\sB{\ensuremath{\mathcal{B}}}
\newcommand\sC{\ensuremath{\mathcal{C}}}
\newcommand\sD{\ensuremath{\mathcal{D}}}
\newcommand\sE{\ensuremath{\mathcal{E}}}
\newcommand\sF{\ensuremath{\mathcal{F}}}
\newcommand\sG{\ensuremath{\mathcal{G}}}
\newcommand\sH{\ensuremath{\mathcal{H}}}
\newcommand\sI{\ensuremath{\mathcal{I}}}
\newcommand\sJ{\ensuremath{\mathcal{J}}}
\newcommand\sK{\ensuremath{\mathcal{K}}}
\newcommand\sL{\ensuremath{\mathcal{L}}}
\newcommand\sM{\ensuremath{\mathcal{M}}}
\newcommand\sN{\ensuremath{\mathcal{N}}}
\newcommand\sO{\ensuremath{\mathcal{O}}}
\newcommand\sP{\ensuremath{\mathcal{P}}}
\newcommand\sQ{\ensuremath{\mathcal{Q}}}
\newcommand\sR{\ensuremath{\mathcal{R}}}
\newcommand\sS{\ensuremath{\mathcal{S}}}
\newcommand\sT{\ensuremath{\mathcal{T}}}
\newcommand\sU{\ensuremath{\mathcal{U}}}
\newcommand\sV{\ensuremath{\mathcal{V}}}
\newcommand\sW{\ensuremath{\mathcal{W}}}
\newcommand\sX{\ensuremath{\mathcal{X}}}
\newcommand\sY{\ensuremath{\mathcal{Y}}}
\newcommand\sZ{\ensuremath{\mathcal{Z}}}
\newcommand\ba{\ensuremath{\mathbf{a}}}
\newcommand\bb{\ensuremath{\mathbf{b}}}
\newcommand\bc{\ensuremath{\mathbf{c}}}
\newcommand\bd{\ensuremath{\mathbf{d}}}
\newcommand\be{\ensuremath{\mathbf{e}}}
\newcommand\bef{\ensuremath{\mathbf{f}}}
\newcommand\bg{\ensuremath{\mathbf{g}}}
\newcommand\bh{\ensuremath{\mathbf{h}}}
\newcommand\bi{\ensuremath{\mathbf{i}}}
\newcommand\bj{\ensuremath{\mathbf{j}}}
\newcommand\bk{\ensuremath{\mathbf{k}}}
\newcommand\bl{\ensuremath{\mathbf{l}}}
\newcommand\bn{\ensuremath{\mathbf{n}}}
\newcommand\bo{\ensuremath{\mathbf{o}}}
\newcommand\bp{\ensuremath{\mathbf{p}}}
\newcommand\bq{\ensuremath{\mathbf{q}}}
\newcommand\br{\ensuremath{\mathbf{r}}}
\newcommand\bs{\ensuremath{\mathbf{s}}}
\newcommand\bt{\ensuremath{\mathbf{t}}}
\newcommand\bu{\ensuremath{\mathbf{u}}}
\newcommand\bv{\ensuremath{\mathbf{v}}}
\newcommand\bw{\ensuremath{\mathbf{w}}}
\newcommand\bx{\ensuremath{\mathbf{x}}}
\newcommand\by{\ensuremath{\mathbf{y}}}
\newcommand\bz{\ensuremath{\mathbf{z}}}
\newcommand\bA{\ensuremath{\mathbf{A}}}
\newcommand\bB{\ensuremath{\mathbf{B}}}
\newcommand\bC{\ensuremath{\mathbf{C}}}
\newcommand\bD{\ensuremath{\mathbf{D}}}
\newcommand\bE{\ensuremath{\mathbf{E}}}
\newcommand\bF{\ensuremath{\mathbf{F}}}
\newcommand\bG{\ensuremath{\mathbf{G}}}
\newcommand\bH{\ensuremath{\mathbf{H}}}
\newcommand\bI{\ensuremath{\mathbf{I}}}
\newcommand\bJ{\ensuremath{\mathbf{J}}}
\newcommand\bK{\ensuremath{\mathbf{K}}}
\newcommand\bL{\ensuremath{\mathbf{L}}}
\newcommand\bM{\ensuremath{\mathbf{M}}}
\newcommand\bN{\ensuremath{\mathbf{N}}}
\newcommand\bO{\ensuremath{\mathbf{O}}}
\newcommand\bP{\ensuremath{\mathbf{P}}}
\newcommand\bQ{\ensuremath{\mathbf{Q}}}
\newcommand\bR{\ensuremath{\mathbf{R}}}
\newcommand\bS{\ensuremath{\mathbf{S}}}
\newcommand\bT{\ensuremath{\mathbf{T}}}
\newcommand\bU{\ensuremath{\mathbf{U}}}
\newcommand\bV{\ensuremath{\mathbf{V}}}
\newcommand\bW{\ensuremath{\mathbf{W}}}
\newcommand\bX{\ensuremath{\mathbf{X}}}
\newcommand\bY{\ensuremath{\mathbf{Y}}}
\newcommand\bZ{\ensuremath{\mathbf{Z}}}
\newcommand\Ba{\ensuremath{\mathbb{a}}}
\newcommand\Bb{\ensuremath{\mathbb{b}}}
\newcommand\Bc{\ensuremath{\mathbb{c}}}
\newcommand\Bd{\ensuremath{\mathbb{d}}}
\newcommand\Be{\ensuremath{\mathbb{e}}}
\newcommand\Bf{\ensuremath{\mathbb{f}}}
\newcommand\Bg{\ensuremath{\mathbb{g}}}
\newcommand\Bh{\ensuremath{\mathbb{h}}}
\newcommand\Bi{\ensuremath{\mathbb{i}}}
\newcommand\Bj{\ensuremath{\mathbb{j}}}
\newcommand\Bk{\ensuremath{\mathbb{k}}}
\newcommand\Bl{\ensuremath{\mathbb{l}}}
\newcommand\Bm{\ensuremath{\mathbb{m}}}
\newcommand\Bn{\ensuremath{\mathbb{n}}}
\newcommand\Bo{\ensuremath{\mathbb{o}}}
\newcommand\Bp{\ensuremath{\mathbb{p}}}
\newcommand\Bq{\ensuremath{\mathbb{q}}}
\newcommand\Br{\ensuremath{\mathbb{r}}}
\newcommand\Bs{\ensuremath{\mathbb{s}}}
\newcommand\Bt{\ensuremath{\mathbb{t}}}
\newcommand\Bu{\ensuremath{\mathbb{u}}}
\newcommand\Bv{\ensuremath{\mathbb{v}}}
\newcommand\Bw{\ensuremath{\mathbb{w}}}
\newcommand\Bx{\ensuremath{\mathbb{x}}}
\newcommand\By{\ensuremath{\mathbb{y}}}
\newcommand\Bz{\ensuremath{\mathbb{z}}}
\newcommand\BA{\ensuremath{\mathbb{A}}}
\newcommand\BB{\ensuremath{\mathbb{B}}}
\newcommand\BC{\ensuremath{\mathbb{C}}}
\newcommand\BD{\ensuremath{\mathbb{D}}}
\newcommand\BE{\ensuremath{\mathbb{E}}}
\newcommand\BF{\ensuremath{\mathbb{F}}}
\newcommand\BG{\ensuremath{\mathbb{G}}}
\newcommand\BH{\ensuremath{\mathbb{H}}}
\newcommand\BI{\ensuremath{\mathbb{I}}}
\newcommand\BJ{\ensuremath{\mathbb{J}}}
\newcommand\BK{\ensuremath{\mathbb{K}}}
\newcommand\BL{\ensuremath{\mathbb{L}}}
\newcommand\BM{\ensuremath{\mathbb{M}}}
\newcommand\BN{\ensuremath{\mathbb{N}}}
\newcommand\BO{\ensuremath{\mathbb{O}}}
\newcommand\BP{\ensuremath{\mathbb{P}}}
\newcommand\BQ{\ensuremath{\mathbb{Q}}}
\newcommand\BR{\ensuremath{\mathbb{R}}}
\newcommand\BS{\ensuremath{\mathbb{S}}}
\newcommand\BT{\ensuremath{\mathbb{T}}}
\newcommand\BU{\ensuremath{\mathbb{U}}}
\newcommand\BV{\ensuremath{\mathbb{V}}}
\newcommand\BW{\ensuremath{\mathbb{W}}}
\newcommand\BX{\ensuremath{\mathbb{X}}}
\newcommand\BY{\ensuremath{\mathbb{Y}}}
\newcommand\BZ{\ensuremath{\mathbb{Z}}}
\newcommand\balpha{\ensuremath{\mbox{\boldmath$\alpha$}}}
\newcommand\bbeta{\ensuremath{\mbox{\boldmath$\beta$}}}
\newcommand\btheta{\ensuremath{\mbox{\boldmath$\theta$}}}
\newcommand\bphi{\ensuremath{\mbox{\boldmath$\phi$}}}
\newcommand\bpi{\ensuremath{\mbox{\boldmath$\pi$}}}
\newcommand\bpsi{\ensuremath{\mbox{\boldmath$\psi$}}}
\newcommand\bmu{\ensuremath{\mbox{\boldmath$\mu$}}}
% Basic
\newcommand\T{\text}
\newcommand\sign{\text{sign}}
\newcommand\tr{\text{tr}}
\newcommand\fig[1]{\begin{center} \includegraphics{#1} \end{center}}
\newcommand\Fig[5]{\begin{figure}[tb] \begin{center} \includegraphics[scale=#2]{#1} \end{center} \longcaption{#4}{\label{fig:#3} #5} \end{figure}}
\newcommand\FigTop[4]{\begin{figure}[t] \begin{center} \includegraphics[scale=#2]{#1} \end{center} \caption{\label{fig:#3} #4} \end{figure}}
\newcommand\FigStar[4]{\begin{figure*}[tb] \begin{center} \includegraphics[scale=#2]{#1} \end{center} \caption{\label{fig:#3} #4} \end{figure*}}
\newcommand\aside[1]{\quad\text{[#1]}}
\newcommand\homework[3]{\title{#1} \author{#2} \date{#3} \maketitle}
% Math
\newcommand\argmin{\mathop{\text{argmin}}}
\newcommand\argmax{\mathop{\text{argmax}}}
\newcommand\p[1]{\ensuremath{\left( #1 \right)}} % Parenthesis ()
\newcommand\pb[1]{\ensuremath{\left[ #1 \right]}} % []
\newcommand\pc[1]{\ensuremath{\left\{ #1 \right\}}} % {}
\newcommand\eval[2]{\ensuremath{\left. #1 \right|_{#2}}} % Evaluation
\newcommand\inv[1]{\ensuremath{\frac{1}{#1}}}
\newcommand\half{\ensuremath{\frac{1}{2}}}
\newcommand\R{\ensuremath{\mathbb{R}}} % Real numbers
\newcommand\Z{\ensuremath{\mathbb{Z}}} % Integers
\newcommand\inner[2]{\ensuremath{\left< #1, #2 \right>}} % Inner product
\newcommand\mat[2]{\ensuremath{\left(\begin{array}{#1}#2\end{array}\right)}} % Matrix
\newcommand\eqn[1]{\begin{eqnarray} #1 \end{eqnarray}} % Equation (array)
\newcommand\eqnl[2]{\begin{eqnarray} \label{eqn:#1} #2 \end{eqnarray}} % Equation (array) with label
\newcommand\eqdef{\ensuremath{\stackrel{\rm def}{=}}} % Equal by definition
%\newcommand{\1}{\mathbb{I}} % Indicator (don't use \mathbbm{1} because bbm is not TrueType though)
\newcommand{\1}{\ensuremath{\mathbbm{1}}}
\newcommand{\bone}{\mathbf{1}} % for vector one
\newcommand{\bzero}{\mathbf{0}} % for vector zero
\newcommand\refeqn[1]{(\ref{eqn:#1})}
\newcommand\refeqns[2]{(\ref{eqn:#1}) and (\ref{eqn:#2})}
\newcommand\refchp[1]{Chapter~\ref{chp:#1}}
\newcommand\refsec[1]{Section~\ref{sec:#1}}
\newcommand\refsecs[2]{Sections~\ref{sec:#1} and~\ref{sec:#2}}
\newcommand\reffig[1]{Figure~\ref{fig:#1}}
\newcommand\reffigs[2]{Figures~\ref{fig:#1} and~\ref{fig:#2}}
\newcommand\reffigss[3]{Figures~\ref{fig:#1},~\ref{fig:#2}, and~\ref{fig:#3}}
\newcommand\reffigsss[4]{Figures~\ref{fig:#1},~\ref{fig:#2},~\ref{fig:#3}, and~\ref{fig:#4}}
\newcommand\reftab[1]{Table~\ref{tab:#1}}
\newcommand\refapp[1]{Appendix~\ref{sec:#1}}
\newcommand\refthm[1]{Theorem~\ref{thm:#1}}
\newcommand\refthms[2]{Theorems~\ref{thm:#1} and~\ref{thm:#2}}
\newcommand\reflem[1]{Lemma~\ref{lem:#1}}
\newcommand\reflems[2]{Lemmas~\ref{lem:#1} and~\ref{lem:#2}}
\newcommand\refprop[1]{Proposition~\ref{prop:#1}}
\newcommand\refdef[1]{Definition~\ref{def:#1}}
\newcommand\refcor[1]{Corollary~\ref{cor:#1}}
\newcommand\refalg[1]{Algorithm~\ref{alg:#1}}

\newcommand\Chapter[2]{\chapter{#2}\label{chp:#1}}
\newcommand\Section[2]{\section{#2}\label{sec:#1}}
\newcommand\Subsection[2]{\subsection{#2}\label{sec:#1}}
\newcommand\Subsubsection[2]{\subsubsection{#2}\label{sec:#1}}
%\newtheorem{definition}{Definition}
%\newtheorem{assumption}{Assumption}
%\newtheorem{proposition}{Proposition}
%\newtheorem{theorem}{Theorem}
%\newtheorem{lemma}{Lemma}
%\newtheorem{corollary}{Corollary}
% Probability
\newcommand\cv{\ensuremath{\to}} % Convergence
\newcommand\cvL{\ensuremath{\xrightarrow{\mathcal{L}}}} % Convergence in law
\newcommand\cvd{\ensuremath{\xrightarrow{d}}} % Convergence in distribution
\newcommand\cvP{\ensuremath{\xrightarrow{P}}} % Convergence in probability
\newcommand\cvas{\ensuremath{\xrightarrow{a.s.}}} % Convergence almost surely
\newcommand\eqdistrib{\ensuremath{\stackrel{d}{=}}} % Equal in distribution
\newcommand\E[1]{\ensuremath{\mathbb{E}{\left[#1\right]}}} % Expectation
\newcommand\Ex[2]{\ensuremath{\mathbb{E}_{#1}\left[#2\right]}} % Expectation
%\newcommand\var{\ensuremath{\text{var}}} % Variance
\newcommand\cov{\ensuremath{\text{cov}}} % Covariance
\newcommand\diag{\ensuremath{\text{diag}}} % Diagnonal matrix
\newcommand\cE[2]{\ensuremath{\E \left( #1 \mid #2 \right)}} % Conditional expectation
\newcommand\KL[2]{\ensuremath{\T{KL}\left( #1 \,||\, #2 \right)}} % KL-divergence
\newcommand\D[2]{\ensuremath{\bD\left( #1 \,||\, #2 \right)}} % KL-divergence

% Utilities
\newcommand\lte{\leq}
\newcommand\gte{\geq}
\newcommand\lone[1]{\ensuremath{\|#1\|_1}}
\newcommand\ltwo[1]{\ensuremath{\|#1\|_2^2}}
\newcommand\naive{na\"{\i}ve}
\newcommand\Naive{Na\"{\i}ve}

% Debug
\usepackage{color}
\newcommand{\tred}[1]{\textcolor{red}{#1}}
\newcommand{\hly}[1]{\hl{yellow}{#1}}
% \def\todo#1{\hl{{\bf TODO:} #1}{yellow}}
\def\needcite{\hl{{$^{\tt\small[citation\ needed]}$}}{blue}}
\def\needfig{\hl{Figure X}{green}}
\def\needtab{\hl{Table Y}{green}}
\def\note#1{\hl{{\bf NOTE:} #1}{yellow}}
\def\dome{\hl{{\bf TODO:} write me!}{yellow}}


================================================
FILE: suthesis.sty
================================================
% Stanford University PhD thesis style -- modifications to the report style
% This is unofficial so you should always double check against the
% Registrar's office rules
%
% People are free to borrow as long as they change the name and date
% in the \typeout lines, the name of the file, and acknowledge the
% work that has been done by previous people.  Ideally they should
% comment their changes.

% Original version by Joseph Pallas back in 1989
% Modified by Emma Pease 5/7/92
%   added singlespace environment from doublespace.sty
%   added switches for variant title pages
%   modified the figure environment according to changes in latex.tex
%   corrected the signature page due to University rule changes
%   added an optional third reader to signature page
% Corrected a spacing problem with style changes 5/14/92 - Emma
% Modified by Emma Pease 1/10/95
% Modified for latex2e  5/17/95
%   changed \@xfloat and \@footnotetext to reflect latex2e changes
% Modified for latex2e 6/22/95 (Emma Pease)
%   changed singlespace environment so it would work (taken from doublespace.sty)
% Modified 9/8/95 (Emma Pease)
%   removed doublespace.sty commands and explicitely inputted
%   doublespace
% Modified 12/17/96 (Emma Pease)
%   added optional \coprincipaladvisor (\coprincipaladviser)
% Modified 5/29/98
%   replaced the required doublespace.sty by setspace.sty
% Modified 8/21/98
%   added a \businessthesis for the school of business
% Modified 8/22/98
%   added a \lawthesis
% Modified 8/23/98
%   spelling error in \businessthesis def corrected
% Modified 5/14/1999 by Emma Pease
%   'By' dropped from title page
% Modified 7/26/1999 by Emma Pease
%   copyright page fixed
% Modified 9/28/99 by Emma Pease
%   more copyright page fixings
% Modified 10/28/99 by Emma Pease
%   and more copyright page fixings plus a minor mod on bibliography
%   need to start thinking of overhauling to standard package format
% Modified 11/26/99
%   fixed copyrightyear so that all Fall quarter theses are next
%   year's copyright

% Modified 5/31/01 by Emma Pease
%   fixed certification statement.  Setup for twoside option.
% Modified 6/4/01 by Emma Pease
%   emphasized that it is unofficial
% Modified 8/3/01 by Emma Pease
%   setup so that on twoside if the intro material (page numbered with
%   roman numerals) ends on an odd page an extra blank page is included so
%   the main body (page numbered with arabic numbers) starts on an odd
%   absolute page  (explanation modified 5/28/02)
% Modified 5/28/02 by Emma Pease
%   made first and second reader optional (not that the first reader
%   should ever be missing but someone managed to avoid a second reader)
%   If they aren't defined, they won't appear
% Modified 7/13/2003 by Emma Pease
%   dropped signature line for ``Approved for the University Committee
%   on Graduate Studies'' on signature page.  Also made sure the next
%   section starts on an odd page if two sided.
% Modified 11/19/03 by Emma Pease
%   fixed the bibliography so the addcontentsline works correctly with
%   hyperref.  Thanks to Peter Sturdza for pointing out this
% Modified 2/14/04 by Emma Pease
%   Changed documentation on how to change line spacing
% Modified 6/29/04 by Emma Pease
%   Correction to humanitiesthesis definition

% Modified 11/9/2004 by Emma Pease
%   Reformatted Signature Page to fit requirements
%   Reformatted Title page to fit requirements

% Modified 8/26/2005 by Emma Pease
%   Modified \language to \languagemajor so as not to interfere with
%   babel.

% Modifed 10/31/2005 by Emma Pease
%   added an optional fourth reader to signature page (Biology)
%   added a length \signaturespace

% Modified 8/23/2006 by Emma Pease
%   added () around names on signature page

% Modified 5/7/2007 by Emma Pease
%   redefined \@endpart so that blank page after part has page number
%   as per thesis office requirements

% Modified 9/17/2008 by Emma Pease
%   changed copyright year calculations so September theses are summer

% November 2009 by Emma Pease
%   changing to use online or hardcopy options for the new online
%   submission possibility.

% Modified May 2010 by Emma Pease
%   added command \onlinesignature which creates a signatue page
%   for the online version.  This should be the last command before the
%   \end{document}

% Modified May 2014 by Emma Pease
%   fixed error in the signature page (Stanford University Committee not just University Committee)

%%%%%
%%%%%   PRELIMS
%%%%%

\ProvidesPackage{suthesis-2e}[2014/05/26]


%%\typeout{Document Style Option `suthesis' for latex2e <$Date: 9/17/2008 $>.}
\typeout{Note that this tries to fulfill the Stanford Thesis
  requirements but it is unofficial}

% First thing we do is make sure that report has been loaded.  A
% common error is to try to use suthesis as a documentstyle.
\@ifundefined{chapter}{\@latexerr{The `suthesis' option should be used
with the `report' document style}{You should probably read the
suthesis documentation.}}{}

%%%%%
%%%%%   SETUP DOUBLESPACING
%%%%%

% include doublespace.sty for some of the stuff below

\RequirePackage{setspace}

% default to hardcopy submission
\newif\ifonline
\onlinetrue
\DeclareOption{online}{\onlinetrue}
\DeclareOption{hardcopy}{\onlinefalse}
\ProcessOptions


% Use 1.3 times the normal baseline-to-baseline skip
\setstretch{1.3}


%%%%%
%%%%%   DOCUMENTATION
%%%%%

\long\def\comment#1{}
\comment{

  Example of use:
    \documentclass{report}

\usepackage{suthesis-2e}
\dept{Computer Science}


    \begin{document}
    \title{How to Write Theses\\
            With Two Line Titles}
    \author{John Henry Candidate}
    \principaladviser{John Parker}
    \firstreader{John Green}
    \secondreader{John BigBooty}
    \thirdreader{Jane Supernumerary} %if needed
    \fourthreader{Severus Snape} %if needed

    \beforepreface
    \prefacesection{Preface}
        This thesis tells you all you need to know about...
    \prefacesection{Acknowledgments}
        I would like to thank...
    \afterpreface

    \chapter{Introduction}
         ...
    \chapter{Conclusions}
         ...
    \appendix
    \chapter{A Long Proof}
         ...
    \bibliographystyle{plain}
    \bibliography{mybib}
    \end{document}

Documentation:
    This style file modifies the standard report style to follow the
    Graduate Degree Support Section of the Registrar's Office's
    "Directions for Preparing Doctoral Dissertations".  It sets the
    margins and interline spacing and disallows page breaks at
    hyphens.

    The \beforepreface command creates the title page, a copyright page
    (optionally), and a signature page.  Then the user should put
    preface section(s), using the \prefacesection{section title}
    command.  The \afterpreface command then produces the tables of
    contents, tables and figures, and sets things up to start
    the main body (on arabic page 1).

    The following commands can control what goes in the front matter
    material:

        \title{thesis title}
        \author{author's name}
        \dept{author's department}
                - Computer Science if omitted
The following switches allow for special title pages (not all are current)
        \committeethesis - for a thesis in a committee (no dept.)
                           use \dept{committee name}
        \programthesis - for a thesis in a program (no dept.)
                           use \dept{program name}
        \educationthesis - for the School of Education. \dept doesn't matter
        \businessthesis - for the GraduateSchool of Business. \dept doesn't matter
        \lawthesis - for the School of law. \dept doesn't matter
        \humanitiesthesis - for a thesis also submitted to the Graduate
                            Program in Humanities
        \specialthesis  - for a Graduate Special thesis
        \industrialthesis - for a thesis in Industrial Engineering
        \dualthesis     - for a thesis in a dual language department.
                          Also define \languagemajor{language}.
                          e.g., \dept{French and Italian}
                          \languagemajor{Italian}
         \principaladviser{the principal advisor's name}
           (or \principaladvisor, if you prefer advisor spelled with o)
        \coprincipaladviser{optional second principal advisor's name}
           (or \coprincipaladvisor, use only if you have two principal
           advisors only for the second one)
        \firstreader{the first reader's name}
        \secondreader{the second reader's name}
        \thirdreader{optional third reader's name}
        \fourthreader{optional fourth reader's name}
        \setlength{\signaturespace}{.5in}
                - default is .5in, can be adjusted to fit all
                signatures in one page
        \submitdate{month year in which submitted to GPO}
                - date LaTeX'd if omitted
        \copyrightyear{year degree conferred (next year if submitted
          in Fall quarter)}
                - year LaTeX'd (or next year, in December) if omitted
        \copyrighttrue or \copyrightfalse
                - produce or don't produce a copyright page (true by default)
        \thesiscopyrighttrue or \thesiscopyrightfalse
                - produces the style of copyright page listed by the
                Thesis Office or the style that everyone else uses
                (Thesis office by default).
        \figurespagetrue or \figurespagefalse
                - produce or don't produce a List of Figures page
                  (true by default)
        \tablespagetrue or \tablespagefalse
                - produce or don't produce a List of Tables page
                  (true by default)

This style uses interline spacing that is 1.3 times normal, except
in the figure and table environments where normal spacing is used.
That can be changed by doing:
    \setstretch{1.6}
(or whatever you want instead of 1.6)

This command should be put before the \begin{document} command but
after loading the packages

You can also set any particular section in singlespacing mode by using
the singlespace environment.  For example

\begin{quote}
\begin{singlespace}
...
\end{singlespace}
\end{quote}

makes the quote singlespaced.  See the documentation for setspace.sty
for more information.

The example at the beginning shows the 12pt substyle being used.  This
seems to give acceptable looking results, but it may be omitted to get
smaller print.

}


%%%%%
%%%%%   SETUP MARGINS AND PENALTIES NEEDED FOR STANFORD THESIS
%%%%%

% We need 1" margins except on the binding edge, where it is 1 1/2"
% Theses may be either single or double sided
  \if@twoside
     \setlength\oddsidemargin   {36.1\p@}
     \setlength\evensidemargin  {0\p@}
     \setlength\marginparwidth {40\p@}
  \else
     \setlength\oddsidemargin   {36.1\p@}
     \setlength\evensidemargin  {36.1\p@}
     \setlength\marginparwidth  {40\p@}
  \fi

\marginparsep 10pt
%\oddsidemargin 0.5in \evensidemargin 0in
%\marginparwidth 40pt


\topmargin 0pt \headsep .5in
\textheight 8.1in \textwidth 6in

% Disallow page breaks at hyphens (this will give some underfull vbox's,
% so an alternative is to use \brokenpenalty=100 and manually search
% for and fix such page breaks)
\brokenpenalty=10000

%%%%%
%%%%%   SETUP COMMANDS PECULIAR TO THESES
%%%%%

% \author, \title are defined in report; here are the rest of the
% front matter defining macros
\def\dept#1{\gdef\@dept{#1}}
\def\advis@r{Adviser} % default spelling
\def\principaladviser#1{\gdef\@principaladviser{#1}}
\def\principaladvisor#1{\gdef\@principaladviser{#1}\gdef\advis@r{Advisor}}
\def\coprincipaladvisor#1{\gdef\@coprincipaladviser{#1}\gdef\advis@r{Co-Advisor}}
\def\coprincipaladviser#1{\gdef\@coprincipaladviser{#1}\gdef\advis@r{Co-Adviser}}
\def\firstreader#1{\gdef\@firstreader{#1}}
\def\secondreader#1{\gdef\@secondreader{#1}}
\def\thirdreader#1{\gdef\@thirdreader{#1}}
\def\fourthreader#1{\gdef\@fourthreader{#1}}
\def\submitdate#1{\gdef\@submitdate{#1}}
\def\copyrightyear#1{\gdef\@copyrightyear{#1}} % \author, \title in report
% needed only for dual language departments to choose the language
\def\languagemajor#1{\gdef\@languagemajor{#1}} \def\@language{babel}
\def\jointprogram#1{\gdef\@jointprogram{#1}}
\def\@title{}\def\@author{}\def\@dept{computer science}
\def\@principaladviser{}\def\@firstreader{*}\def\@secondreader{*}
\def\@coprincipaladviser{*}
\def\@thirdreader{*}
\def\@fourthreader{*}
\def\@submitdate{\ifcase\the\month\or
  January\or February\or March\or April\or May\or June\or
  July\or August\or September\or October\or November\or December\fi
  \space \number\the\year}
% Stanford says that Fall quarter theses should have the next year as the
% copyright year
\ifnum\month>9
    \@tempcnta=\year \advance\@tempcnta by 1
    \edef\@copyrightyear{\number\the\@tempcnta}
\else
    \def\@copyrightyear{\number\the\year}
\fi
\newif\ifcopyright \newif\iffigurespage \newif\iftablespage
\newif\ifthesiscopyright


\copyrighttrue
\thesiscopyrighttrue

\figurespagetrue \tablespagetrue


\def\@standardsub{submitted to the department of \uppercase\expandafter{\@dept}\\
                and the committee on graduate studies}
\def\@standardend{}

\def\committeethesis{\let\@whichsub=\@committeesub}
\def\programthesis{\let\@whichsub=\@programsub}
\def\educationthesis{\let\@whichsub=\@educationsub}
\def\businessthesis{\let\@whichsub=\@businesssub}
\def\lawthesis{\let\@whichsub=\@lawsub}
\def\humanitiesthesis{\let\@whichsub=\@humanitiessub%
\let\@whichend=\@humanitiesend}
\def\specialthesis{\let\@whichsub=\@specialsub%
\let\@whichend=\@specialend}
\def\industrialthesis{\let\@whichsub=\@industrialsub%
\let\@whichend=\@industrialend}
\def\dualthesis{\let\@whichsub=\@dualsub%
\let\@whichend=\@dualend}


\def\@committeesub{SUBMITTED TO THE COMMITTEE ON \uppercase\expandafter{\@dept}\\
                AND THE COMMITTEE ON GRADUATE STUDIES}
\def\@programsub{SUBMITTED TO THE PROGRAM IN \uppercase\expandafter{\@dept}\\
                AND THE COMMITTEE ON GRADUATE STUDIES}
\def\@educationsub{SUBMITTED TO THE GRADUATE SCHOOL OF EDUCATION\\
                AND THE COMMITTEE ON GRADUATE STUDIES}
\def\@businesssub{SUBMITTED TO THE GRADUATE SCHOOL OF BUSINESS\\ AND THE
  COMMITTEE ON GRADUATE STUDIES}
\def\@lawsub{SUBMITTED TO THE GRADUATE SCHOOL OF LAW\\
                AND THE COMMITTEE ON GRADUATE STUDIES}

\def\@humanitiessub{SUBMITTED TO THE DEPARTMENT OF\\ \uppercase\expandafter{\@dept}
                                AND THE\\ COMMITTEE ON\\ GRADUATE STUDIES}
\def\@humanitiesend{\\IN\\ \uppercase\expandafter{\@jointprogram} AND HUMANITIES}

\def\@specialsub{SUBMITTED TO THE COMMITTEE ON GRADUATE STUDIES}
\def\@specialend{\\IN\\ \uppercase\expandafter{\@dept}}


\def\@dualsub{SUBMITTED TO THE DEPARTMENT OF \uppercase\expandafter{\@dept}\\
AND THE COMMITTEE ON GRADUATE STUDIES}
\def\@dualend{\\IN\\ \uppercase\expandafter{\@languagemajor}}


\let\@whichend=\@standardend
\let\@whichsub=\@standardsub


\def\titlep{%
        \thispagestyle{empty}%
        \null\vskip1in%
        \begin{center}
                \large\uppercase\expandafter{\@title}
        \end{center}
        \vfill
        \begin{center}
\large
%                \sc a dissertation\\
%                \lowercase\expandafter{\@whichsub}\\
%                of stanford university\\
%                in partial fulfillment of the requirements\\
%                for the degree of\\
%                doctor of philosophy \uppercase\expandafter{\@whichend}
                A DISSERTATION\\
                \uppercase\expandafter{\@whichsub}\\
                OF STANFORD UNIVERSITY\\
                IN PARTIAL FULFILLMENT OF THE REQUIREMENTS\\
                FOR THE DEGREE OF\\
                DOCTOR OF PHILOSOPHY \uppercase\expandafter{\@whichend}
        \end{center}
        \vfill
        \begin{center}
                \rm \@author\\
                \@submitdate\\
        \end{center}\vskip.5in\newpage}

\def\thesiscopyrightpage{%
        \null\vfill
        \begin{center}
                \large
                \copyright\ Copyright\ by \@author\ \@copyrightyear\\
                All Rights Reserved
        \end{center}
        \vfill\newpage}

\def\tradcopyrightpage{%
        \null\vfill
        \begin{center}
                \large
                Copyright\ \copyright\ \@copyrightyear\ by \@author\\
                All Rights Reserved
        \end{center}
        \vfill\newpage}


\newlength{\signaturespace}
\setlength{\signaturespace}{.5in}


\long\def\signature#1{%
\begin{flushright}
\begin{minipage}{5in}
\parindent=0pt
I certify that I have read this dissertation and that, in my opinion,
it is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
\par
\vspace{\signaturespace}
%\hbox to 4in{\hfil\shortstack{\vrule width 3in height 0.4pt\\ #1}}
\hbox to 5in{\hfil\begin{tabular}{@{}l@{}}\vrule width 3in height
    0.4pt depth 0pt\\ #1\end{tabular}}
\end{minipage}
\end{flushright}}

\long\def\ucgssignature{%
\begin{flushright}
\begin{minipage}{5in}
\parindent=0pt
\hfill Approved for the Stanford University Committee on Graduate Studies
\par
\vspace{\signaturespace}
\hbox to 5in{\hfil\begin{tabular}{@{}l@{}}\vrule width 3in height
    0.4pt depth 0pt\end{tabular}}
\end{minipage}
\end{flushright}}


\def\signaturepage{%
\ifonline
\setcounter{page}{0}
\def\thepage{}
\thispagestyle{myheadings}
\markboth{\rm \@author}{\rm \@author}\fi
\signature{(\@principaladviser)\quad Principal \advis@r}
  \vfill
% if second principal advisor
        \if*\@coprincipaladviser \else
        \signature{(\@coprincipaladviser)\quad Principal \advis@r}
        \vfill\fi
        \if*\@firstreader \else
        \signature{(\@firstreader)}
        \vfill\fi
        \if*\@secondreader \else
        \signature{(\@secondreader)}
        \vfill\fi
% if thirdreader then do \signature\@thirdreader \vfill
        \if*\@thirdreader \else
        \signature{(\@thirdreader)}
        \vfill\fi
% if fourthreader then do \signature\@fourthreader \vfill
        \if*\@fourthreader \else
        \signature{(\@fourthreader)}
        \vfill\fi
\ucgssignature
}

\def\onlinesignature{
\cleardoublepage
\@twosidetrue
\signaturepage
}

\def\beforepreface{
        \pagenumbering{roman}
        \pagestyle{plain}
        \titlep
% online version has no copyright or signature pages but page counter
% must be incremented
% signature page should come at end
        \ifonline\setcounter{page}{4}\else
        \ifcopyright\ifthesiscopyright\thesiscopyrightpage\else\tradcopyrightpage\fi\fi
        \signaturepage\fi
        \cleardoublepage}


\def\prefacesection#1{%
        \chapter*{#1}
        \addcontentsline{toc}{chapter}{#1}}

\def\afterpreface{\newpage
        \tableofcontents
        \iftablespage
                \listoftables
        \fi
        \iffigurespage
                \listoffigures
        \fi
        \cleardoublepage
        \pagenumbering{arabic}
        \pagestyle{headings}}

% Redefine \thebibliography to go to a new page and put an entry in the
% table of contents
\let\@ldthebibliography\thebibliography
\renewcommand{\thebibliography}[1]{\newpage
                \@ldthebibliography{#1}%
\addcontentsline{toc}{chapter}{\bibname}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                        PART                          %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\def\part{\cleardoublepage   % Starts new page.
   \thispagestyle{plain}%    % Page style of part page is 'plain'
  \if@twocolumn              % IF two-column style
     \onecolumn              %  THEN \onecolumn
     \@tempswatrue           %       @tempswa := true
    \else \@tempswafalse     %  ELSE @tempswa := false
  \fi                        % FI
  \hbox{}\vfil               % Add fil glue to center title
%%  \bgroup  \centering      % BEGIN centering %% Removed 19 Jan 88
  \secdef\@part\@spart}

\def\@part[#1]#2{\ifnum \c@secnumdepth >-2\relax  % IF secnumdepth > -2
        \refstepcounter{part}%                    %   THEN step
                                                  %         part counter
        \addcontentsline{toc}{part}{\thepart      %        add toc line
        \hspace{1em}#1}\else                      %   ELSE add
                                                  %         unnumb. line
        \addcontentsline{toc}{part}{#1}\fi        % FI
   \markboth{}{}%
   {\centering                       % %% added 19 Jan 88
    \interlinepenalty \@M            %% RmS added 11 Nov 91
    \ifnum \c@secnumdepth >-2\relax  % IF secnumdepth > -2
      \huge\bfseries \partname~\thepart    %   THEN Print '\partname' and
    \par                             %         number in \huge bold.
    \vskip 20\p@\fi                  %        Add space before title.
    \Huge \bfseries                        % FI
    #2\par}\@endpart}                % Print Title in \Huge bold.
                                     % Bug Fix 13 Nov 89: #1 -> #2

% redefine \@endpart so the blank page after part has a page number
\def\@endpart{\vfil\newpage
              \if@twoside
               \if@openright
                \null
                \thispagestyle{plain}%
                \newpage
               \fi
              \fi
              \if@tempswa
                \twocolumn
              \fi}


% Start out normal
\pagestyle{headings}


================================================
FILE: thesis.tex
================================================
\documentclass[12pt]{report}
\usepackage{suthesis}
%\documentstyle[12pt,suthesis]{report}

% -- Imports --
% (general libraries)
\usepackage{times,latexsym,amsfonts,amssymb,amsmath,graphicx,url,bbm,rotating}
\usepackage{multirow,hhline,stmaryrd,bussproofs,mathtools,siunitx}
\usepackage{booktabs,xcolor,csquotes,calligra}
% (custom libraries)
\usepackage{afterpage}
\usepackage{longtable}
\usepackage{fitch}
% (inline references)
\usepackage{natbib}
\usepackage{tabularx}
\usepackage[hidelinks]{hyperref}
\hypersetup{
    colorlinks=true,
    citecolor=.,
    linkcolor=.,
    urlcolor=blue
}

\usepackage{epigraph}
\renewcommand{\epigraphsize}{\normalsize}
\setlength{\epigraphwidth}{0.9\textwidth}

% (tikz)
\usepackage{soul}
\definecolor{light-yellow}{RGB}{255, 255, 153}
\sethlcolor{light-yellow}
\usepackage{tikz}
\usepackage{tikz-dependency,pifont}
\usetikzlibrary{shapes.arrows,chains,positioning,automata,trees,calc}
\usetikzlibrary{patterns,matrix}
\usetikzlibrary{decorations.pathmorphing,decorations.markings}
% (print algorithms)
\usepackage[ruled,lined,linesnumbered]{algorithm2e}
% (custom)
\input std-macros.tex
\input macros.tex

% (paper compilation hacks)
\def\newcite#1{\citet{#1}}
\def\cite#1{\citep{#1}}
%\def\newcite#1{\textcite{#1}}
%\def\cite#1{\autocite{#1}}
\definecolor{darkblue}{rgb}{0.0,0.0,0.4}


% Common hyphenations
\hyphenation{Text-Runner}
\hyphenation{Verb-Ocean}
\hyphenation{Raj-pur-kar}

%\bibliographystyle{plainnat}


% Comments
\usepackage{xspace}
\usepackage{xargs} % commandx
\usepackage[colorinlistoftodos,prependcaption,textsize=tiny]{todonotes}
\usepackage{marginnote}
\usepackage{color}
\definecolor{darkgreen}{RGB}{0,100,0}

% Inline comments useful for tables and figures.
\newcommandx{\icmtl}[2][1=]{\todo[inline]{DC: #2}\xspace}
\newcommandx{\icmtm}[2][1=]{\todo[inline]{CM: #2}\xspace}

% Comments for other places.
\newcommandx{\cmtl}[2][1=]{\todo[linecolor=blue,backgroundcolor=blue!10,bordercolor=blue,#1]{DC: #2}\xspace}
\newcommandx{\cmtm}[2][1=]{\todo[linecolor=red,backgroundcolor=red!10,bordercolor=red,#1]{CM: #2}\xspace}

\newcommand\cmb[1]{\marginpar{\tiny\raggedright\textcolor{blue}{\textsf{ DC\@: #1}}}}
\newcommand\cmm[1]{\marginpar{\tiny\raggedright\textcolor{red}{\textsf{\bfseries CM\@: #1}}}}

\usepackage{enumerate}

\setcounter{secnumdepth}{3}

\usepackage{footnote}
\makesavenoteenv{tabular}
\makesavenoteenv{table}

\usepackage{xpinyin}

% -- Document --
\begin{document}

% Title
\title{Neural Reading Comprehension and Beyond}
\author{Danqi Chen}
\principaladviser{Christopher D. Manning}
\firstreader{Dan Jurafsky}
\secondreader{Percy Liang}
\thirdreader{Luke Zettlemoyer}

% Preface
\beforepreface
\input preface.tex
\input ack.tex
\afterpreface
\hypersetup{linkcolor=magenta}


% -- Sections --
% Introduction
\chapter{Introduction}
\label{chapter:intro}
\input intro.tex

\part{Neural Reading Comprehension: Foundations}

\chapter{An Overview of Reading Comprehension}
\label{chapter:rc-overview}
\input chapters/rc_overview/intro.tex
\input chapters/rc_overview/history.tex
\input chapters/rc_overview/task.tex
\input chapters/rc_overview/discussions.tex

\chapter{Neural Reading Comprehension Models}
\label{chapter:rc-models}
\input chapters/rc_models/intro.tex
\input chapters/rc_models/feature_classifier.tex
\input chapters/rc_models/sar.tex
\input chapters/rc_models/experiments.tex
\input chapters/rc_models/advances.tex

\chapter{The Future of Reading Comprehension}
\label{chapter:rc-future}
\input chapters/rc_future/overview.tex
\input chapters/rc_future/datasets.tex
\input chapters/rc_future/models.tex
\input chapters/rc_future/questions.tex

\part{Neural Reading Comprehension: Applications}

\chapter{Open Domain Question Answering}
\label{chapter:openqa}
\input chapters/openqa/intro.tex
\input chapters/openqa/related_work.tex
\input chapters/openqa/system.tex
\input chapters/openqa/evaluation.tex
\input chapters/openqa/future.tex
% \input chapters/openqa/future.tex

\chapter{Conversational Question Answering}
\label{chapter:coqa}
\input chapters/coqa/intro.tex
\input chapters/coqa/related_work.tex
\input chapters/coqa/dataset.tex
\input chapters/coqa/models.tex
\input chapters/coqa/experiments.tex
\input chapters/coqa/discussions.tex

% Conclusion
\chapter{Conclusions}
\label{chapter:conclusions}
\input conclude.tex

% Bibliography
\bibliographystyle{acl_natbib_nourl}
\bibliography{ref}

\end{document}