Repository: danqi/thesis Branch: master Commit: 897881b16c98 Files: 43 Total size: 414.9 KB Directory structure: gitextract_ge445pyb/ ├── .gitignore ├── Makefile ├── README.md ├── ack.tex ├── acl_natbib_nourl.bst ├── chapters/ │ ├── coqa/ │ │ ├── dataset.tex │ │ ├── discussions.tex │ │ ├── experiments.tex │ │ ├── intro.tex │ │ ├── models.tex │ │ └── related_work.tex │ ├── openqa/ │ │ ├── evaluation.tex │ │ ├── future.tex │ │ ├── intro.tex │ │ ├── related_work.tex │ │ └── system.tex │ ├── rc_future/ │ │ ├── datasets.tex │ │ ├── models.tex │ │ ├── overview.tex │ │ └── questions.tex │ ├── rc_models/ │ │ ├── advances.tex │ │ ├── experiments.tex │ │ ├── feature_classifier.tex │ │ ├── intro.tex │ │ └── sar.tex │ └── rc_overview/ │ ├── discussions.tex │ ├── history.tex │ ├── intro.tex │ └── task.tex ├── conclude.tex ├── fitch.sty ├── img/ │ └── scripts/ │ ├── gen_cnn_analysis.py │ ├── gen_qa_stat.py │ ├── gen_squad_progress.py │ ├── gen_timeline.py │ └── squad_leaderboard.txt ├── intro.tex ├── macros.tex ├── preface.tex ├── ref.bib ├── std-macros.tex ├── suthesis.sty └── thesis.tex ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .DS_Store pages/ *.fdb_latexmk *.bbl *.aux *.out *.toc *.fls *.blg *.log *.lot *.lof *.synctex.gz ================================================ FILE: Makefile ================================================ thesis.pdf: $(wildcard *.tex) $(wildcard chapters/natlog/*.tex) $(wildcard chapters/naturalli/*.tex) $(wildcard chapters/openie/*.tex) $(wildcard chapters/qa/*.tex) Makefile macros.tex std-macros.tex ref.bib @pdflatex thesis @bibtex thesis @pdflatex thesis @pdflatex thesis clean: rm -f *.aux *.log *.bbl *.blg present.pdf *.bak *.ps *.dvi *.lot *.bcf thesis.pdf dist: thesis.pdf @pdflatex --file-line-errors thesis default: thesis.pdf ================================================ FILE: README.md ================================================ ## Danqi Chen's Thesis ### Reference ``` @phdthesis{chen2018neural, title={Neural Reading Comprehension and Beyond}, author={Chen, Danqi}, year={2018}, school={Stanford University} } ``` ### Acknowledgement This thesis is built on top of [Gabor Angeli's thesis template](https://github.com/gangeli/thesis). ### Contact If you have any comments or questions about the thesis, please use pull requests or email . ================================================ FILE: ack.tex ================================================ %!TEX root = thesis.tex \prefacesection{Acknowledgments} The past six years at Stanford have been an unforgettable and invaluable experience to me. When I first started my PhD in 2012, I could barely speak fluent English (I was required to take five English courses at Stanford), knew little about this country and had never heard of the term ``natural language processing''. It is unbelievable that over the following years I have actually been doing research about language and training computer systems to understand human languages (English in most cases), as well as training myself to speak and write in English. At the same time, 2012 is the year that deep neural networks (also called deep learning) started to take off and dominate almost all the AI applications we are seeing today. I witnessed how fast Artificial Intelligence has been developing from the beginning of the journey and feel quite excited —-- and occasionally panicked —-- to be a part of this trend. I would not have been able to make this journey without the help and support of many, many people and I feel deeply indebted to them. First and foremost, my greatest thanks go to my advisor Christopher Manning. I really didn't know Chris when I first came to Stanford --- only after a couple of years that I worked with him and learned about NLP, did I realize how privileged I am to get to work with one of the most brilliant minds in our field. He always has a very insightful, high-level view about the field while he is also uncommonly detail oriented and understands the nature of the problems very well. More importantly, Chris is an extremely kind, caring and supportive advisor that I could not have asked for more. He is like an older friend of mine (if he doesn't mind me saying so) and I can talk with him about everything. He always believes in me even though I am not always that confident about myself. I am forever grateful to him and I have already started to miss him. I would like to thank Dan Jurafsky and Percy Liang --- the other two giants of the Stanford NLP group --- for being on my thesis committee and for a lot of guidance and help throughout my PhD studies. Dan is an extremely charming, enthusiastic and knowledgeable person and I always feel my passion getting ignited after talking to him. Percy is a superman and a role model for all the NLP PhD students (at least myself). I never understand how one can accomplish so many things at the same time and a big part of this dissertation is built on top of his research. I want to thank Chris, Dan and Percy, for setting up the Stanford NLP Group, my home at Stanford, and I will always be proud to be a part of this family. It is also my great honor to have Luke Zettlemoyer on my thesis committee. The work presented in this dissertation is very relevant to his research and I learned a lot from his papers. I look forward to working with him in the near future. I also would like to thank Yinyu Ye for his time chairing my thesis defense. During my PhD, I have done two wonderful internships at Microsoft Research and Facebook AI Research. I thank my mentors Kristina Toutanova, Antoine Bordes and Jason Weston when I worked at these places. My internship project at Facebook eventually leads to the \sys{DrQA} project and a part of this dissertation. I also would like to thank Microsoft and Facebook for providing me with fellowships. Collaboration is a big lesson that I learned, and also a fun part of graduate school. I thank my fellow collaborators: Gabor Angeli, Jason Bolton, Arun Chaganty, Adam Fisch, Jon Gauthier, Shayne Longpre, Jesse Mu, Siva Reddy, Richard Socher, Yuhao Zhang, Victor Zhong, and others. In particular, Richard --- with him I finished my first paper in graduate school. He had very clear sense about how to define an impactful research project while I had little experience at the time. Adam and Siva --- with them I finished the \sys{DrQA} and \sys{CoQA} projects respectively. Not only am I proud of these two projects, but also I greatly enjoyed the collaborations. We have become good friends afterwards. The KBP team, especially Yuhao, Gabor and Arun --- I enjoyed the teamwork during those two summers. Jon, Victor, Shayne and Jesse, the younger people that I got to work with, although I wish I could have done a better job. I also want to thank the two teaching teams (7 and 25 people respectively) for the NLP class that I've worked on and that was a very unique and rewarding experience for me. I thank the whole Stanford NLP Group, especially Sida Wang, Will Monroe, Angel Chang, Gabor Angeli, Siva Reddy, Arun Chaganty, Yuhao Zhang, Peng Qi, Jacob Steinhardt, Jiwei Li, He He, Robin Jia and Ziang Xie, who gave me a lot of support at various times. I am even not sure if there could be another research group in the world better than our group (I hope I can create a similar one in the future). The NLP retreat, NLP BBQ and those paper swap nights were among my most vivid memories in graduate school. Outside of the NLP group, I have been extremely lucky to be surrounded by many great friends. Just to name a few (and forgive me for not being able to list all of them): Yanting Zhao, my close friend for many years, who keeps pulling me out from my stressful PhD life, and I share a lot of joyous moments with her. Xueqing Liu, my classmate and roommate in college who started her PhD at UIUC in the same year and she is the person that I can keep talking to and exchanging my feelings and thoughts with, especially on those bad days. Tao Lei, a brilliant NLP PhD and my algorithms ``teacher'' in high school and I keep learning from him and getting inspired from every discussion. Thanh-Vy Hua, my mentor and ``elder sister'' who always makes sure that I am still on the right track of my life and taught me many meta-skills to survive this journey (even though we have only met 3 times in the real world). Everyone in the ``\pinyin{cao3yu2}'' group, I am so happy that I have spent many Friday evenings with you. During the past year, I visited a great number of U.S. universities seeking an academic job position. There are so many people I want to thank for assistance along the way —-- I either received great help and advice from them, or I felt extremely welcomed during my visit —-- including Sanjeev Arora, Yoav Artzi, Regina Barzilay, Chris Callison-Burch, Kai-Wei Chang, Kyunghyun Cho, William Cohen, Michael Collins, Chris Dyer, Jacob Eisenstein, Julia Hirschberg, Julia Hockenmaier, Tengyu Ma, Andrew McCallum, Kathy McKeown, Rada Mihalcea, Tom Mitchell, Ray Mooney, Karthik Narasimhan, Graham Neubig, Christos Papadimitriou, Nanyun Peng, Drago Radev, Sasha Rush, Fei Sha, Yulia Tsvetkov, Luke Zettlemoyer and many others. These people are really a big part of the reasons that I love our research community so much, therefore I want to follow their paths and dedicate myself to an academic career. I hope to continue to contribute to our research community in the future. A special thanks to Andrew Chi-Chih Yao for creating the Special Pilot CS Class where I did my undergraduate studies. I am super proud of being a part of the ``Yao class'' family. I also thank Weizhu Chen, Qiang Yang and Haixun Wang, with them I received my very first research experience. With their support, I was very fortunate to have the opportunity to come to Stanford for my PhD. I thank my parents: Zhi Chen and Hongmei Wang. Like most Chinese students in my generation, I am the only child of my family and I have a very close relationship with them --- even if they are living 16 (or 15) hours ahead of me and I can only spare 2--3 weeks staying with them every year. My parents made me who I am today and I never know how to pay them back. I hope that they are at least a little proud of me for what I have been through so far. Lastly, I would like to thank Huacheng for his love and support (we got married 4 months before this dissertation was submitted). I was fifteen when I first met Huacheng and we have been experiencing almost everything together since then: from high-school programming competitions, to our wonderful college time at Tsinghua University and we both made it to the Stanford CS PhD program in 2012. For over ten years in the past, he is not only my partner, my classmate, my best friend, but also the person I admire most, for his modesty, intelligence, concentration and hard work. Without him, I would not have come to Stanford. Without him, I would also not have taken the job at Princeton. I thank him for everything he has done for me. \newpage \begin{flushright} To my parents and Huacheng, for their unconditional love. \end{flushright} ================================================ FILE: acl_natbib_nourl.bst ================================================ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % BibTeX style file acl_natbib_nourl.bst % % intended as input to urlbst script % % adapted from compling.bst % in order to mimic the style files for ACL conferences prior to 2017 % by making the following three changes: % - for @incollection, page numbers now follow volume title. % - for @inproceedings, address now follows conference name. % (address is intended as location of conference, % not address of publisher.) % - for papers with three authors, use et al. in citation % Dan Gildea 2017/06/08 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % BibTeX style file compling.bst % % Intended for the journal Computational Linguistics (ACL/MIT Press) % Created by Ron Artstein on 2005/08/22 % For use with for author-year citations. % % I created this file in order to allow submissions to the journal % Computational Linguistics using the package for author-year % citations, which offers a lot more flexibility than , CL's % official citation package. This file adheres strictly to the official % style guide available from the MIT Press: % % http://mitpress.mit.edu/journals/coli/compling_style.pdf % % This includes all the various quirks of the style guide, for example: % - a chapter from a monograph (@inbook) has no page numbers. % - an article from an edited volume (@incollection) has page numbers % after the publisher and address. % - an article from a proceedings volume (@inproceedings) has page % numbers before the publisher and address. % % Where the style guide was inconsistent or not specific enough I % looked at actual published articles and exercised my own judgment. % I noticed two inconsistencies in the style guide: % % - The style guide gives one example of an article from an edited % volume with the editor's name spelled out in full, and another % with the editors' names abbreviated. I chose to accept the first % one as correct, since the style guide generally shuns abbreviations, % and editors' names are also spelled out in some recently published % articles. % % - The style guide gives one example of a reference where the word % "and" between two authors is preceded by a comma. This is most % likely a typo, since in all other cases with just two authors or % editors there is no comma before the word "and". % % One case where the style guide is not being specific is the placement % of the edition number, for which no example is given. I chose to put % it immediately after the title, which I (subjectively) find natural, % and is also the place of the edition in a few recently published % articles. % % This file correctly reproduces all of the examples in the official % style guide, except for the two inconsistencies noted above. I even % managed to get it to correctly format the proceedings example which % has an organization, a publisher, and two addresses (the conference % location and the publisher's address), though I cheated a bit by % putting the conference location and month as part of the title field; % I feel that in this case the conference location and month can be % considered as part of the title, and that adding a location field % is not justified. Note also that a location field is not standard, % so entries made with this field would not port nicely to other styles. % However, if authors feel that there's a need for a location field % then tell me and I'll see what I can do. % % The file also produces to my satisfaction all the bibliographical % entries in my recent (joint) submission to CL (this was the original % motivation for creating the file). I also tested it by running it % on a larger set of entries and eyeballing the results. There may of % course still be errors, especially with combinations of fields that % are not that common, or with cross-references (which I seldom use). % If you find such errors please write to me. % % I hope people find this file useful. Please email me with comments % and suggestions. % % Ron Artstein % artstein [at] essex.ac.uk % August 22, 2005. % % Some technical notes. % % This file is based on a file generated with the package % by Patrick W. Daly (see selected options below), which was then % manually customized to conform with certain CL requirements which % cannot be met by . Departures from the generated file % include: % % Function inbook: moved publisher and address to the end; moved % edition after title; replaced function format.chapter.pages by % new function format.chapter to output chapter without pages. % % Function inproceedings: moved publisher and address to the end; % replaced function format.in.ed.booktitle by new function % format.in.booktitle to output the proceedings title without % the editor. % % Functions book, incollection, manual: moved edition after title. % % Function mastersthesis: formatted title as for articles (unlike % phdthesis which is formatted as book) and added month. % % Function proceedings: added new.sentence between organization and % publisher when both are present. % % Function format.lab.names: modified so that it gives all the % authors' surnames for in-text citations for one, two and three % authors and only uses "et. al" for works with four authors or more % (thanks to Ken Shan for convincing me to go through the trouble of % modifying this function rather than using unreliable hacks). % % Changes: % % 2006-10-27: Changed function reverse.pass so that the extra label is % enclosed in parentheses when the year field ends in an uppercase or % lowercase letter (change modeled after Uli Sauerland's modification % of nals.bst). RA. % % % The preamble of the generated file begins below: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% This is file `compling.bst', %% generated with the docstrip utility. %% %% The original source files were: %% %% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss') %% ---------------------------------------- %% *** Intended for the journal Computational Linguistics *** %% %% Copyright 1994-2002 Patrick W Daly % =============================================================== % IMPORTANT NOTICE: % This bibliographic style (bst) file has been generated from one or % more master bibliographic style (mbs) files, listed above. % % This generated file can be redistributed and/or modified under the terms % of the LaTeX Project Public License Distributed from CTAN % archives in directory macros/latex/base/lppl.txt; either % version 1 of the License, or any later version. % =============================================================== % Name and version information of the main mbs file: % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)] % For use with BibTeX version 0.99a or later %------------------------------------------------------------------- % This bibliography style file is intended for texts in ENGLISH % This is an author-year citation style bibliography. As such, it is % non-standard LaTeX, and requires a special package file to function properly. % Such a package is natbib.sty by Patrick W. Daly % The form of the \bibitem entries is % \bibitem[Jones et al.(1990)]{key}... % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}... % The essential feature is that the label (the part in brackets) consists % of the author names, as they should appear in the citation, with the year % in parentheses following. There must be no space before the opening % parenthesis! % With natbib v5.3, a full list of authors may also follow the year. % In natbib.sty, it is possible to define the type of enclosures that is % really wanted (brackets or parentheses), but in either case, there must % be parentheses in the label. % The \cite command functions as follows: % \citet{key} ==>> Jones et al. (1990) % \citet*{key} ==>> Jones, Baker, and Smith (1990) % \citep{key} ==>> (Jones et al., 1990) % \citep*{key} ==>> (Jones, Baker, and Smith, 1990) % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2) % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990) % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32) % \citeauthor{key} ==>> Jones et al. % \citeauthor*{key} ==>> Jones, Baker, and Smith % \citeyear{key} ==>> 1990 %--------------------------------------------------------------------- ENTRY { address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year } {} { label extra.label sort.label short.list } INTEGERS { output.state before.all mid.sentence after.sentence after.block } FUNCTION {init.state.consts} { #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 'after.block := } STRINGS { s t} FUNCTION {output.nonnull} { 's := output.state mid.sentence = { ", " * write$ } { output.state after.block = { add.period$ write$ newline$ "\newblock " write$ } { output.state before.all = 'write$ { add.period$ " " * write$ } if$ } if$ mid.sentence 'output.state := } if$ s } FUNCTION {output} { duplicate$ empty$ 'pop$ 'output.nonnull if$ } FUNCTION {output.check} { 't := duplicate$ empty$ { pop$ "empty " t * " in " * cite$ * warning$ } 'output.nonnull if$ } FUNCTION {fin.entry} { add.period$ write$ newline$ } FUNCTION {new.block} { output.state before.all = 'skip$ { after.block 'output.state := } if$ } FUNCTION {new.sentence} { output.state after.block = 'skip$ { output.state before.all = 'skip$ { after.sentence 'output.state := } if$ } if$ } FUNCTION {add.blank} { " " * before.all 'output.state := } FUNCTION {date.block} { new.block } FUNCTION {not} { { #0 } { #1 } if$ } FUNCTION {and} { 'skip$ { pop$ #0 } if$ } FUNCTION {or} { { pop$ #1 } 'skip$ if$ } FUNCTION {new.block.checkb} { empty$ swap$ empty$ and 'skip$ 'new.block if$ } FUNCTION {field.or.null} { duplicate$ empty$ { pop$ "" } 'skip$ if$ } FUNCTION {emphasize} { duplicate$ empty$ { pop$ "" } { "\emph{" swap$ * "}" * } if$ } FUNCTION {tie.or.space.prefix} { duplicate$ text.length$ #3 < { "~" } { " " } if$ swap$ } FUNCTION {capitalize} { "u" change.case$ "t" change.case$ } FUNCTION {space.word} { " " swap$ * " " * } % Here are the language-specific definitions for explicit words. % Each function has a name bbl.xxx where xxx is the English word. % The language selected here is ENGLISH FUNCTION {bbl.and} { "and"} FUNCTION {bbl.etal} { "et~al." } FUNCTION {bbl.editors} { "editors" } FUNCTION {bbl.editor} { "editor" } FUNCTION {bbl.edby} { "edited by" } FUNCTION {bbl.edition} { "edition" } FUNCTION {bbl.volume} { "volume" } FUNCTION {bbl.of} { "of" } FUNCTION {bbl.number} { "number" } FUNCTION {bbl.nr} { "no." } FUNCTION {bbl.in} { "in" } FUNCTION {bbl.pages} { "pages" } FUNCTION {bbl.page} { "page" } FUNCTION {bbl.chapter} { "chapter" } FUNCTION {bbl.techrep} { "Technical Report" } FUNCTION {bbl.mthesis} { "Master's thesis" } FUNCTION {bbl.phdthesis} { "Ph.D. thesis" } MACRO {jan} {"January"} MACRO {feb} {"February"} MACRO {mar} {"March"} MACRO {apr} {"April"} MACRO {may} {"May"} MACRO {jun} {"June"} MACRO {jul} {"July"} MACRO {aug} {"August"} MACRO {sep} {"September"} MACRO {oct} {"October"} MACRO {nov} {"November"} MACRO {dec} {"December"} MACRO {acmcs} {"ACM Computing Surveys"} MACRO {acta} {"Acta Informatica"} MACRO {cacm} {"Communications of the ACM"} MACRO {ibmjrd} {"IBM Journal of Research and Development"} MACRO {ibmsj} {"IBM Systems Journal"} MACRO {ieeese} {"IEEE Transactions on Software Engineering"} MACRO {ieeetc} {"IEEE Transactions on Computers"} MACRO {ieeetcad} {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"} MACRO {ipl} {"Information Processing Letters"} MACRO {jacm} {"Journal of the ACM"} MACRO {jcss} {"Journal of Computer and System Sciences"} MACRO {scp} {"Science of Computer Programming"} MACRO {sicomp} {"SIAM Journal on Computing"} MACRO {tocs} {"ACM Transactions on Computer Systems"} MACRO {tods} {"ACM Transactions on Database Systems"} MACRO {tog} {"ACM Transactions on Graphics"} MACRO {toms} {"ACM Transactions on Mathematical Software"} MACRO {toois} {"ACM Transactions on Office Information Systems"} MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"} MACRO {tcs} {"Theoretical Computer Science"} FUNCTION {bibinfo.check} { swap$ duplicate$ missing$ { pop$ pop$ "" } { duplicate$ empty$ { swap$ pop$ } { swap$ pop$ } if$ } if$ } FUNCTION {bibinfo.warn} { swap$ duplicate$ missing$ { swap$ "missing " swap$ * " in " * cite$ * warning$ pop$ "" } { duplicate$ empty$ { swap$ "empty " swap$ * " in " * cite$ * warning$ } { swap$ pop$ } if$ } if$ } STRINGS { bibinfo} INTEGERS { nameptr namesleft numnames } FUNCTION {format.names} { 'bibinfo := duplicate$ empty$ 'skip$ { 's := "" 't := #1 'nameptr := s num.names$ 'numnames := numnames 'namesleft := { namesleft #0 > } { s nameptr duplicate$ #1 > { "{ff~}{vv~}{ll}{, jj}" } { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author % { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author if$ format.name$ bibinfo bibinfo.check 't := nameptr #1 > { namesleft #1 > { ", " * t * } { numnames #2 > { "," * } 'skip$ if$ s nameptr "{ll}" format.name$ duplicate$ "others" = { 't := } { pop$ } if$ t "others" = { " " * bbl.etal * } { bbl.and space.word * t * } if$ } if$ } 't if$ nameptr #1 + 'nameptr := namesleft #1 - 'namesleft := } while$ } if$ } FUNCTION {format.names.ed} { 'bibinfo := duplicate$ empty$ 'skip$ { 's := "" 't := #1 'nameptr := s num.names$ 'numnames := numnames 'namesleft := { namesleft #0 > } { s nameptr "{ff~}{vv~}{ll}{, jj}" format.name$ bibinfo bibinfo.check 't := nameptr #1 > { namesleft #1 > { ", " * t * } { numnames #2 > { "," * } 'skip$ if$ s nameptr "{ll}" format.name$ duplicate$ "others" = { 't := } { pop$ } if$ t "others" = { " " * bbl.etal * } { bbl.and space.word * t * } if$ } if$ } 't if$ nameptr #1 + 'nameptr := namesleft #1 - 'namesleft := } while$ } if$ } FUNCTION {format.key} { empty$ { key field.or.null } { "" } if$ } FUNCTION {format.authors} { author "author" format.names } FUNCTION {get.bbl.editor} { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ } FUNCTION {format.editors} { editor "editor" format.names duplicate$ empty$ 'skip$ { "," * " " * get.bbl.editor * } if$ } FUNCTION {format.note} { note empty$ { "" } { note #1 #1 substring$ duplicate$ "{" = 'skip$ { output.state mid.sentence = { "l" } { "u" } if$ change.case$ } if$ note #2 global.max$ substring$ * "note" bibinfo.check } if$ } FUNCTION {format.title} { title duplicate$ empty$ 'skip$ { "t" change.case$ } if$ "title" bibinfo.check } FUNCTION {format.full.names} {'s := "" 't := #1 'nameptr := s num.names$ 'numnames := numnames 'namesleft := { namesleft #0 > } { s nameptr "{vv~}{ll}" format.name$ 't := nameptr #1 > { namesleft #1 > { ", " * t * } { s nameptr "{ll}" format.name$ duplicate$ "others" = { 't := } { pop$ } if$ t "others" = { " " * bbl.etal * } { numnames #2 > { "," * } 'skip$ if$ bbl.and space.word * t * } if$ } if$ } 't if$ nameptr #1 + 'nameptr := namesleft #1 - 'namesleft := } while$ } FUNCTION {author.editor.key.full} { author empty$ { editor empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { editor format.full.names } if$ } { author format.full.names } if$ } FUNCTION {author.key.full} { author empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { author format.full.names } if$ } FUNCTION {editor.key.full} { editor empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { editor format.full.names } if$ } FUNCTION {make.full.names} { type$ "book" = type$ "inbook" = or 'author.editor.key.full { type$ "proceedings" = 'editor.key.full 'author.key.full if$ } if$ } FUNCTION {output.bibitem} { newline$ "\bibitem[{" write$ label write$ ")" make.full.names duplicate$ short.list = { pop$ } { * } if$ "}]{" * write$ cite$ write$ "}" write$ newline$ "" before.all 'output.state := } FUNCTION {n.dashify} { 't := "" { t empty$ not } { t #1 #1 substring$ "-" = { t #1 #2 substring$ "--" = not { "--" * t #2 global.max$ substring$ 't := } { { t #1 #1 substring$ "-" = } { "-" * t #2 global.max$ substring$ 't := } while$ } if$ } { t #1 #1 substring$ * t #2 global.max$ substring$ 't := } if$ } while$ } FUNCTION {word.in} { bbl.in capitalize " " * } FUNCTION {format.date} { year "year" bibinfo.check duplicate$ empty$ { } 'skip$ if$ extra.label * before.all 'output.state := after.sentence 'output.state := } FUNCTION {format.btitle} { title "title" bibinfo.check duplicate$ empty$ 'skip$ { emphasize } if$ } FUNCTION {either.or.check} { empty$ 'pop$ { "can't use both " swap$ * " fields in " * cite$ * warning$ } if$ } FUNCTION {format.bvolume} { volume empty$ { "" } { bbl.volume volume tie.or.space.prefix "volume" bibinfo.check * * series "series" bibinfo.check duplicate$ empty$ 'pop$ { swap$ bbl.of space.word * swap$ emphasize * } if$ "volume and number" number either.or.check } if$ } FUNCTION {format.number.series} { volume empty$ { number empty$ { series field.or.null } { series empty$ { number "number" bibinfo.check } { output.state mid.sentence = { bbl.number } { bbl.number capitalize } if$ number tie.or.space.prefix "number" bibinfo.check * * bbl.in space.word * series "series" bibinfo.check * } if$ } if$ } { "" } if$ } FUNCTION {format.edition} { edition duplicate$ empty$ 'skip$ { output.state mid.sentence = { "l" } { "t" } if$ change.case$ "edition" bibinfo.check " " * bbl.edition * } if$ } INTEGERS { multiresult } FUNCTION {multi.page.check} { 't := #0 'multiresult := { multiresult not t empty$ not and } { t #1 #1 substring$ duplicate$ "-" = swap$ duplicate$ "," = swap$ "+" = or or { #1 'multiresult := } { t #2 global.max$ substring$ 't := } if$ } while$ multiresult } FUNCTION {format.pages} { pages duplicate$ empty$ 'skip$ { duplicate$ multi.page.check { bbl.pages swap$ n.dashify } { bbl.page swap$ } if$ tie.or.space.prefix "pages" bibinfo.check * * } if$ } FUNCTION {format.journal.pages} { pages duplicate$ empty$ 'pop$ { swap$ duplicate$ empty$ { pop$ pop$ format.pages } { ":" * swap$ n.dashify "pages" bibinfo.check * } if$ } if$ } FUNCTION {format.vol.num.pages} { volume field.or.null duplicate$ empty$ 'skip$ { "volume" bibinfo.check } if$ number "number" bibinfo.check duplicate$ empty$ 'skip$ { swap$ duplicate$ empty$ { "there's a number but no volume in " cite$ * warning$ } 'skip$ if$ swap$ "(" swap$ * ")" * } if$ * format.journal.pages } FUNCTION {format.chapter} { chapter empty$ 'skip$ { type empty$ { bbl.chapter } { type "l" change.case$ "type" bibinfo.check } if$ chapter tie.or.space.prefix "chapter" bibinfo.check * * } if$ } FUNCTION {format.chapter.pages} { chapter empty$ 'format.pages { type empty$ { bbl.chapter } { type "l" change.case$ "type" bibinfo.check } if$ chapter tie.or.space.prefix "chapter" bibinfo.check * * pages empty$ 'skip$ { ", " * format.pages * } if$ } if$ } FUNCTION {format.booktitle} { booktitle "booktitle" bibinfo.check emphasize } FUNCTION {format.in.booktitle} { format.booktitle duplicate$ empty$ 'skip$ { word.in swap$ * } if$ } FUNCTION {format.in.ed.booktitle} { format.booktitle duplicate$ empty$ 'skip$ { editor "editor" format.names.ed duplicate$ empty$ 'pop$ { "," * " " * get.bbl.editor ", " * * swap$ * } if$ word.in swap$ * } if$ } FUNCTION {format.thesis.type} { type duplicate$ empty$ 'pop$ { swap$ pop$ "t" change.case$ "type" bibinfo.check } if$ } FUNCTION {format.tr.number} { number "number" bibinfo.check type duplicate$ empty$ { pop$ bbl.techrep } 'skip$ if$ "type" bibinfo.check swap$ duplicate$ empty$ { pop$ "t" change.case$ } { tie.or.space.prefix * * } if$ } FUNCTION {format.article.crossref} { word.in " \cite{" * crossref * "}" * } FUNCTION {format.book.crossref} { volume duplicate$ empty$ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$ pop$ word.in } { bbl.volume capitalize swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word * } if$ " \cite{" * crossref * "}" * } FUNCTION {format.incoll.inproc.crossref} { word.in " \cite{" * crossref * "}" * } FUNCTION {format.org.or.pub} { 't := "" address empty$ t empty$ and 'skip$ { t empty$ { address "address" bibinfo.check * } { t * address empty$ 'skip$ { ", " * address "address" bibinfo.check * } if$ } if$ } if$ } FUNCTION {format.publisher.address} { publisher "publisher" bibinfo.warn format.org.or.pub } FUNCTION {format.organization.address} { organization "organization" bibinfo.check format.org.or.pub } FUNCTION {article} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block crossref missing$ { journal "journal" bibinfo.check emphasize "journal" output.check format.vol.num.pages output } { format.article.crossref output.nonnull format.pages output } if$ new.block format.note output fin.entry } FUNCTION {book} { output.bibitem author empty$ { format.editors "author and editor" output.check editor format.key output } { format.authors output.nonnull crossref missing$ { "author and editor" editor either.or.check } 'skip$ if$ } if$ format.date "year" output.check date.block format.btitle "title" output.check format.edition output crossref missing$ { format.bvolume output new.block format.number.series output new.sentence format.publisher.address output } { new.block format.book.crossref output.nonnull } if$ new.block format.note output fin.entry } FUNCTION {booklet} { output.bibitem format.authors output author format.key output format.date "year" output.check date.block format.title "title" output.check new.block howpublished "howpublished" bibinfo.check output address "address" bibinfo.check output new.block format.note output fin.entry } FUNCTION {inbook} { output.bibitem author empty$ { format.editors "author and editor" output.check editor format.key output } { format.authors output.nonnull crossref missing$ { "author and editor" editor either.or.check } 'skip$ if$ } if$ format.date "year" output.check date.block format.btitle "title" output.check format.edition output crossref missing$ { format.bvolume output format.number.series output format.chapter "chapter" output.check new.sentence format.publisher.address output new.block } { format.chapter "chapter" output.check new.block format.book.crossref output.nonnull } if$ new.block format.note output fin.entry } FUNCTION {incollection} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block crossref missing$ { format.in.ed.booktitle "booktitle" output.check format.edition output format.bvolume output format.number.series output format.chapter.pages output new.sentence format.publisher.address output } { format.incoll.inproc.crossref output.nonnull format.chapter.pages output } if$ new.block format.note output fin.entry } FUNCTION {inproceedings} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block crossref missing$ { format.in.booktitle "booktitle" output.check format.bvolume output format.number.series output format.pages output address "address" bibinfo.check output new.sentence organization "organization" bibinfo.check output publisher "publisher" bibinfo.check output } { format.incoll.inproc.crossref output.nonnull format.pages output } if$ new.block format.note output fin.entry } FUNCTION {conference} { inproceedings } FUNCTION {manual} { output.bibitem format.authors output author format.key output format.date "year" output.check date.block format.btitle "title" output.check format.edition output organization address new.block.checkb organization "organization" bibinfo.check output address "address" bibinfo.check output new.block format.note output fin.entry } FUNCTION {mastersthesis} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block bbl.mthesis format.thesis.type output.nonnull school "school" bibinfo.warn output address "address" bibinfo.check output month "month" bibinfo.check output new.block format.note output fin.entry } FUNCTION {misc} { output.bibitem format.authors output author format.key output format.date "year" output.check date.block format.title output new.block howpublished "howpublished" bibinfo.check output new.block format.note output fin.entry } FUNCTION {phdthesis} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.btitle "title" output.check new.block bbl.phdthesis format.thesis.type output.nonnull school "school" bibinfo.warn output address "address" bibinfo.check output new.block format.note output fin.entry } FUNCTION {proceedings} { output.bibitem format.editors output editor format.key output format.date "year" output.check date.block format.btitle "title" output.check format.bvolume output format.number.series output new.sentence publisher empty$ { format.organization.address output } { organization "organization" bibinfo.check output new.sentence format.publisher.address output } if$ new.block format.note output fin.entry } FUNCTION {techreport} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block format.tr.number output.nonnull institution "institution" bibinfo.warn output address "address" bibinfo.check output new.block format.note output fin.entry } FUNCTION {unpublished} { output.bibitem format.authors "author" output.check author format.key output format.date "year" output.check date.block format.title "title" output.check new.block format.note "note" output.check fin.entry } FUNCTION {default.type} { misc } READ FUNCTION {sortify} { purify$ "l" change.case$ } INTEGERS { len } FUNCTION {chop.word} { 's := 'len := s #1 len substring$ = { s len #1 + global.max$ substring$ } 's if$ } FUNCTION {format.lab.names} { 's := "" 't := s #1 "{vv~}{ll}" format.name$ s num.names$ duplicate$ #2 > { pop$ " " * bbl.etal * } { #2 < 'skip$ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" = { " " * bbl.etal * } { bbl.and space.word * s #2 "{vv~}{ll}" format.name$ * } if$ } if$ } if$ } FUNCTION {author.key.label} { author empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { author format.lab.names } if$ } FUNCTION {author.editor.key.label} { author empty$ { editor empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { editor format.lab.names } if$ } { author format.lab.names } if$ } FUNCTION {editor.key.label} { editor empty$ { key empty$ { cite$ #1 #3 substring$ } 'key if$ } { editor format.lab.names } if$ } FUNCTION {calc.short.authors} { type$ "book" = type$ "inbook" = or 'author.editor.key.label { type$ "proceedings" = 'editor.key.label 'author.key.label if$ } if$ 'short.list := } FUNCTION {calc.label} { calc.short.authors short.list "(" * year duplicate$ empty$ short.list key field.or.null = or { pop$ "" } 'skip$ if$ * 'label := } FUNCTION {sort.format.names} { 's := #1 'nameptr := "" s num.names$ 'numnames := numnames 'namesleft := { namesleft #0 > } { s nameptr "{ll{ }}{ ff{ }}{ jj{ }}" format.name$ 't := nameptr #1 > { " " * namesleft #1 = t "others" = and { "zzzzz" * } { t sortify * } if$ } { t sortify * } if$ nameptr #1 + 'nameptr := namesleft #1 - 'namesleft := } while$ } FUNCTION {sort.format.title} { 't := "A " #2 "An " #3 "The " #4 t chop.word chop.word chop.word sortify #1 global.max$ substring$ } FUNCTION {author.sort} { author empty$ { key empty$ { "to sort, need author or key in " cite$ * warning$ "" } { key sortify } if$ } { author sort.format.names } if$ } FUNCTION {author.editor.sort} { author empty$ { editor empty$ { key empty$ { "to sort, need author, editor, or key in " cite$ * warning$ "" } { key sortify } if$ } { editor sort.format.names } if$ } { author sort.format.names } if$ } FUNCTION {editor.sort} { editor empty$ { key empty$ { "to sort, need editor or key in " cite$ * warning$ "" } { key sortify } if$ } { editor sort.format.names } if$ } FUNCTION {presort} { calc.label label sortify " " * type$ "book" = type$ "inbook" = or 'author.editor.sort { type$ "proceedings" = 'editor.sort 'author.sort if$ } if$ #1 entry.max$ substring$ 'sort.label := sort.label * " " * title field.or.null sort.format.title * #1 entry.max$ substring$ 'sort.key$ := } ITERATE {presort} SORT STRINGS { last.label next.extra } INTEGERS { last.extra.num number.label } FUNCTION {initialize.extra.label.stuff} { #0 int.to.chr$ 'last.label := "" 'next.extra := #0 'last.extra.num := #0 'number.label := } FUNCTION {forward.pass} { last.label label = { last.extra.num #1 + 'last.extra.num := last.extra.num int.to.chr$ 'extra.label := } { "a" chr.to.int$ 'last.extra.num := "" 'extra.label := label 'last.label := } if$ number.label #1 + 'number.label := } FUNCTION {reverse.pass} { next.extra "b" = { "a" 'extra.label := } 'skip$ if$ extra.label 'next.extra := extra.label duplicate$ empty$ 'skip$ { year field.or.null #-1 #1 substring$ chr.to.int$ #65 < { "{\natexlab{" swap$ * "}}" * } { "{(\natexlab{" swap$ * "})}" * } if$ } if$ 'extra.label := label extra.label * 'label := } EXECUTE {initialize.extra.label.stuff} ITERATE {forward.pass} REVERSE {reverse.pass} FUNCTION {bib.sort.order} { sort.label " " * year field.or.null sortify * " " * title field.or.null sort.format.title * #1 entry.max$ substring$ 'sort.key$ := } ITERATE {bib.sort.order} SORT FUNCTION {begin.bib} { preamble$ empty$ 'skip$ { preamble$ write$ newline$ } if$ "\begin{thebibliography}{" number.label int.to.str$ * "}" * write$ newline$ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi" write$ newline$ } EXECUTE {begin.bib} EXECUTE {init.state.consts} ITERATE {call.type$} FUNCTION {end.bib} { newline$ "\end{thebibliography}" write$ newline$ } EXECUTE {end.bib} %% End of customized bst file %% %% End of file `compling.bst'. ================================================ FILE: chapters/coqa/dataset.tex ================================================ %!TEX root = ../../thesis.tex \section{\sys{CoQA}: A Conversational QA Challenge} \label{sec:coqa-dataset} In this section, we introduce \sys{CoQA}, a novel dataset for building \tf{Co}nversational \tf{Q}uestion \tf{A}nswering systems. We develop \sys{CoQA} with three main goals in mind. The first concerns the nature of questions in a human conversation. As an example seen in Figure~\ref{fig:coqa-example}, in this conversation, every question after the first is dependent on the conversation history. At present, there are no large scale reading comprehension datasets which contain questions that depend on a conversation history and this is what \sys{CoQA} is mainly developed for.\footnote{Concurrent with our work, \newcite{choi2018quac} also created a conversational dataset with a similar goal, but it differs in many key design decisions. We will discuss it in Section~\ref{sec:coqa-future}.} The second goal of \sys{CoQA} is to ensure the naturalness of answers in a conversation. As we discussed in the earlier chapters, most existing reading comprehension datasets either restrict answers to a contiguous span in a given passage, or allow free-form answers with a low human agreement (e.g., \sys{NarrativeQA}). Our desiderata are 1) the answers should not be only span-based so that anything can be asked and the conversation can flow naturally. For example, there is no extractive answer for $Q_4$ \ti{How many?} in Figure~\ref{fig:coqa-example}. 2) It still supports reliable automatic evaluation with a a strong human performance. Therefore, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for $Q_4$ is simply \ti{Three} while its rationale is spanned across multiple sentences. The third goal of \sys{CoQA} is to enable building QA systems that perform robustly across domains. The current reading comprehension datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models. Hence we collect our dataset from seven different domains --- children's stories, literature, middle and high school English exams, news, Wikipedia, science articles and Reddit. The last two are used for out-of-domain evaluation. \subsection{Task Definition} \label{sec:coqa-task} \begin{figure}[!t] \begin{tabular}{p{\columnwidth}} \toprule The Virginia governor's race, billed as the marquee battle of an otherwise anticlimactic 2013 election cycle, is shaping up to be a foregone conclusion. Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn't trailed in a poll since May. Barring a political miracle, Republican Ken Cuccinelli will be delivering a concession speech on Tuesday evening in Richmond. In recent ...\\ \\ $Q_1$: What are the candidates {\bf \color{magenta} running} for?\\ $A_1$: Governor\\ $R_1$: The Virginia governor's race\\ \vspace{0em} $Q_2$: {\bf \color{magenta} Where}?\\ $A_2$: Virginia \\ $R_2$: The Virginia governor's race\\ \vspace{0em} $Q_3$: Who is the democratic candidate?\\ \vspace{-0.6em}{\bf \color{blue} A$_3$}: {\bf \color{orange} Terry McAuliffe} \\ $R_3$: Democrat Terry McAuliffe\\ \vspace{0em} $Q_4$: Who is {\bf \color{orange} his} opponent?\\ \vspace{-0.6em}{\bf \color{blue} A$_4$}: {\bf \color{red} Ken Cuccinelli} \\ $R_4$ Republican Ken Cuccinelli\\ \vspace{0em} $Q_5$: What party does {\bf \color{red} he} belong to?\\ $A_5$: Republican \\ $R_5$: Republican Ken Cuccinelli\\ \vspace{0em} $Q_6$: Which of {\bf \color{blue} them} is winning?\\ $A_6$: Terry McAuliffe \\ $R_6$: Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn't trailed in a poll since May\\ \bottomrule \end{tabular} \longcaption{Another example in \sys{CoQA} with entity of focus changes}{\label{fig:coqa-example2}A conversation showing coreference chains in colors. The entity of focus changes in $Q_4$, $Q_5$, $Q_6$.} \end{figure} We first define the task formally. Given a passage $P$, a conversation consists of $n$ turns, and each turn consists of $(Q_i, A_i, R_i), i = 1, \ldots n$, where $Q_i$ and $A_i$ denote the question and the answer in the $i$-th turn, and $R_i$ is the rationale which supports the answer $A_i$ and must be a single span of the passage. The task is defined as to answer the next question $Q_i$ given the conversation so far: $Q_1, A_1, \ldots, Q_{i -1}, A_{i - 1}$. It is worth noting that we collect $R_i$ with the hope that they can help understand how answers are derived and improve training our models, while \ti{they are not provided during evaluation}. For the example in Figure~\ref{fig:coqa-example2}, the conversation begins with question $Q_1$. We answer $Q_1$ with $A_1$ based on the evidence $R_1$ from the passage. In this example, the answerer wrote only the \ti{Governor} as the answer but selected a longer rationale \ti{The Virginia governor's race}. When we come to $Q_2$ \ti{Where?}, we must refer back to the conversation history since otherwise its answer could be \ti{Virginia} or \ti{Richmond} or something else. In our task, conversation history is indispensable for answering many questions. We use conversation history $Q_1$ and $A_1$ to answer $Q_2$ with $A_2$ based on the evidence $R_2$. For an unanswerable question, we give \ti{unknown} as the final answer and do not highlight any rationale. In this example, we observe that the entity of focus changes as the conversation progresses. The questioner uses \ti{his} to refer to \ti{Terry} in $Q_4$ and \ti{he} to \ti{Ken} in $Q_5$. If these are not resolved correctly, we end up with incorrect answers. The conversational nature of questions requires us to reason from multiple sentences (the current question and the previous questions or answers, and sentences from the passage). It is common that a single question may require a rationale spanned across multiple sentences (e.g., $Q_1$ $Q_4$ and $Q_5$ in Figure~\ref{fig:coqa-example}). We describe additional question and answer types in \ref{sec:coqa-data-analysis}. \subsection{Dataset Collection} We detail our dataset collection process as follows. For each conversation, we employ two annotators, a questioner and an answerer. This setup has several advantages over using a single annotator to act both as a questioner and an answerer: 1) when two annotators chat about a passage, their dialogue flow is natural compared to chatting with oneself; 2) when one annotator responds with a vague question or an incorrect answer, the other can raise a flag which we use to identify bad workers; and 3) the two annotators can discuss guidelines (through a separate chat window) when they have disagreements. These measures help to prevent spam and to obtain high agreement data.\footnote{Due to AMT terms of service, we allowed a single worker to act both as a questioner and an answerer after a minute of waiting. This constitutes around 12\% of the data.} \begin{figure}[!t] \center \includegraphics[scale=0.18]{img/coqa_questioner.png} \longcaption{The questioner interface of \sys{CoQA}}{\label{fig:coqa-questioner}The questioner interface of our \sys{CoQA} dataset.} \end{figure} \begin{figure}[!t] \center \includegraphics[scale=0.18]{img/coqa_answerer.png} \longcaption{The answerer interface of \sys{CoQA}}{\label{fig:coqa-answerer}The answerer interface of our \sys{CoQA} dataset.} \end{figure} We use the Amazon Mechanical Turk (AMT) to pair workers on a on a passage for which we use the ParlAI Mturk API \cite{miller2017parlai}. On average, each passage costs 3.6 USD for conversation collection and another 4.5 USD for collecting three additional answers for development and test data. \paragraph{Collection interface.} We have different interfaces for a questioner and an answerer (Figure~\ref{fig:coqa-questioner} and Figure~\ref{fig:coqa-answerer}). A questioner's role is to ask questions, and an answerer's role is to answer questions in addition to highlighting rationales. We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible. For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78\% of the answers have at least one edit such as changing a word's case or adding a punctuation. \paragraph{Passage selection.} We select passages from seven diverse domains: children's stories from MCTest \cite{richardson2013mctest}, literature from Project Gutenberg\footnote{Project Gutenberg \url{https://www.gutenberg.org}}, middle and high school English exams from RACE \cite{lai2017race}, news articles from CNN \cite{hermann2015teaching}, articles from Wikipedia, science articles from AI2 Science Questions \cite{welbl2017crowdsourcing} and Reddit articles from the Writing Prompts dataset \cite{fan2018hierarchical}. Not all passages in these domains are equally good for generating interesting conversations. A passage with just one entity often result in questions that entirely focus on that entity. We select passages with multiple entities, events and pronominal references using Stanford \sys{CoreNLP} \cite{manning2014stanford}. We truncate long articles to the first few paragraphs that result in around 200 words. Table~\ref{tab:coqa-domains} shows the distribution of domains. We reserve the Science and Reddit domains for out-of-domain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. In contrast, for each out-of-domain dataset, we just have 100 passages in the test set without any passages in the training or the development sets. \begin{table} \centering \begin{tabular}{lrrrr} \toprule \tf{Domain} & \tf{\# Passages} & \tf{\# Q/A} & \tf{Passage} & \tf{\# Turns per} \\ & & \tf{pairs} & \tf{length} & \tf{passage} \\ \midrule Children's Stories & 750 & 10.5k & 211 & 14.0 \\ Literature & 1,815 & 25.5k & 284 & 15.6 \\ Mid/High School Exams & 1,911 & 28.6k & 306 & 15.0 \\ News & 1,902 & 28.7k & 268 & 15.1 \\ Wikipedia & 1,821 & 28.0k & 245 & 15.4 \\ \midrule \multicolumn{5}{c}{Out of domain} \\ \midrule Science & 100 & 1.5k & 251 & 15.3\\ Reddit & 100 & 1.7k & 361 & 16.6 \\ \midrule Total & 8,399 & 127k & 271 & 15.2 \\ \bottomrule \end{tabular} \longcaption{Distribution of domains in \sys{CoQA}.}{\label{tab:coqa-domains} Distribution of domains in \sys{CoQA}.} \end{table} \paragraph{Collecting multiple answers.} Some questions in \sys{CoQA} may have multiple valid answers. For example, another answer for Q$_4$ in Figure~\ref{fig:coqa-example2} is \ti{A Republican candidate}. In order to account for answer variations, we collect three additional answers for all questions in the development and test data. Since our data is conversational, questions influence answers which in turn influence the follow-up questions. In the previous example, if the original answer was \ti{A Republican Candidate}, then the following question \ti{Which party does he belong to?} would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer. We achieve this by turning the answer collection task into a game of predicting original answers. First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original. For the next question, we ask her to guess the original answer and verify again. We repeat this process until the conversation is complete. In our pilot experiment, the human F1 score increased by 5.4\% when we use this verification setup. \subsection{Dataset Analysis} \label{sec:coqa-data-analysis} What makes the \sys{CoQA} dataset conversational compared to existing reading comprehension datasets like \sys{SQuAD}? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in \sys{CoQA} exhibit? We answer these questions below. \paragraph{Comparison with \sys{SQuAD 2.0}.} \begin{figure}[ht] \begin{center} \includegraphics[height=8cm]{img/coqa_squad_comparison.pdf} \end{center} \longcaption{A comparison of questions in \sys{CoQA} and \sys{SQuAD 2.0} }{\label{fig:coqa-squad-comparison} Distribution of trigram prefixes of questions in \sys{SQuAD 2.0} and \sys{CoQA}.} \end{figure} In the following, we perform an in-depth comparison of \sys{CoQA} and \sys{SQuAD 2.0}~\cite{rajpurkar2018know}. Figure~\ref{fig:coqa-squad-comparison} shows the distribution of frequent trigram prefixes. While coreferences are non-existent in \sys{SQuAD 2.0}, almost every sector of \sys{CoQA} contains coreferences (\ti{he, him, she, it, they}) indicating \sys{CoQA} is highly conversational. Because of the free-form nature of answers, we expect a richer variety of questions in \sys{CoQA} than \sys{SQuAD 2.0}. While nearly half of \sys{SQuAD 2.0} questions are dominated by \ti{what} questions, the distribution of \sys{CoQA} is spread across multiple question types. Several sectors indicated by prefixes \ti{did, was, is, does, and} are frequent in \sys{CoQA} but are completely absent in \sys{SQuAD 2.0}. Since a conversation is spread over multiple turns, we expect conversational questions and answers to be shorter than in a standalone interaction. In fact, questions in \sys{CoQA} can be made up of just one or two words (\ti{who?}, \ti{when?}, \ti{why?}). As seen in Table~\ref{tab:squad-coqa-length}, on average, a question in \sys{CoQA} is only 5.5 words long while it is 10.1 for \sys{SQuAD}. The answers are also usually shorter in \sys{CoQA} than \sys{SQuAD 2.0}. Table~\ref{tab:squad-coqa-answers} provides insights into the type of answers in \sys{SQuAD 2.0} and \sys{CoQA}. While the original version of \sys{SQuAD 2.0} \cite{rajpurkar2016squad} does not have any unanswerable questions, \sys{SQuAD 2.0} \cite{rajpurkar2018know} focuses solely on obtaining them resulting in higher frequency than in \sys{CoQA}. \sys{SQuAD 2.0} has 100\% extractive answers by design, whereas in \sys{CoQA}, 66.8\% answers can be classified as extractive after ignoring punctuation and case mismatches.\footnote{If punctuation and case are not ignored, only 37\% of the answers are extractive.} This is higher than we anticipated. Our conjecture is that human factors such as wage may have influenced workers to ask questions that elicit faster responses by selecting text. It is worth noting that \sys{CoQA} has 11.1\% and 8.7\% questions with \ti{yes} or \ti{no} as answers whereas \sys{SQuAD 2.0} has 0\%. Both datasets have a high number of named entities and noun phrases as answers. \begin{table}[h] \centering \begin{tabular}{p{3cm} r r} \toprule & \bf \sys{SQuAD 2.0} & \bf \sys{CoQA} \\ \midrule Passage Length & 117 & 271 \\ Question Length & 10.1 & 5.5 \\ Answer Length & 3.2 & 2.7 \\ \midrule \end{tabular} \longcaption{Data statistics in \sys{SQuAD 2.0} and \sys{CoQA}}{\label{tab:squad-coqa-length} Average number of words in passage, question and answer in \sys{SQuAD 2.0} and \sys{CoQA}.} \end{table} \begin{table}[h] \centering \begin{tabular}{p{3.5cm} r r} \toprule & \bf \sys{SQuAD 2.0} & \bf \sys{CoQA} \\ \midrule Answerable & 66.7\% & 98.7\% \\ Unanswerable & 33.3\% & 1.3\% \\ \midrule Extractive & 100.0\% & 66.8\% \\ Abstractive & 0.0\% & 33.2\% \\ \midrule Named Entity & 35.9\% & 28.7\% \\ Noun Phrase & 25.0\% & 19.6\% \\ Yes & 0.0\% & 11.1\% \\ No & 0.1\% & 8.7\% \\ Number & 16.5\% & 9.8\% \\ Date/Time & 7.1\% & 3.9\% \\ Other & 15.5\% & 18.1\% \\ \bottomrule \end{tabular} \longcaption{Distribution of answer types in \sys{SQuAD 2.0} and \sys{CoQA}}{\label{tab:squad-coqa-answers} Distribution of answer types in \sys{SQuAD 2.0} and \sys{CoQA}.} \end{table} \paragraph{Conversation flow.} A coherent conversation must have smooth transitions between turns. We expect the narrative structure of the passage to influence our conversation flow. We split the passage into 10 uniform chunks, and identify chunks of interest of a given turn and its transition based on rationale spans. \begin{figure}[!t] \begin{center} \includegraphics[height=9cm]{img/coqa_conversation_flow.pdf} \end{center} \longcaption{Conversation Flow in \sys{CoQA}}{\label{fig:coqa-conversation-flow} Chunks of interests as a conversation progresses. The x-axis indicates the turn number and the y-axis indicates the passage chunk containing the rationale. The height of a chunk indicates the concentration of conversation in that chunk. The width of the bands is proportional to the frequency of transition between chunks from one turn to the other.} \end{figure} Figure~\ref{fig:coqa-conversation-flow} portrays the conversation flow of the top 10 turns. The starting turns tend to focus on the first few chunks and as the conversation advances, the focus shifts to the later chunks. Moreover, the turn transitions are smooth, with the focus often remaining in the same chunk or moving to a neighbouring chunk. Most frequent transitions happen to the first and the last chunks, and likewise these chunks have diverse outward transitions. \paragraph{Linguistic phenomena.} \begin{table}[!t] \centering \small \begin{tabular}{lp{7cm}c} \toprule \bf Phenomenon & \bf Example & \bf Percentage \\ \midrule \multicolumn{3}{c}{Relationship between a question and its passage} \\ \midrule Lexical match & Q: Who had to rescue her?& 29.8\% \\ & A: the coast guard \\ & R: Outen was rescued by the coast guard \\ Paraphrasing & Q: Did the wild dog approach? & 43.0\% \\ & A: Yes \\ & R: he drew cautiously closer \\ Pragmatics & Q: Is Joey a male or female? & 27.2\% \\ & A: Male \\ & R: it looked like a stick man so she kept \textbf{him}. She named her new noodle friend Joey \\ \midrule \multicolumn{3}{c}{Relationship between a question and its conversation history} \\ \midrule No coreference & Q: What is IFL? & 30.5\% \\ Explicit coreference & Q: Who had Bashti forgotten? & 49.7\% \\ & A: the puppy \\ & Q: What was \textbf{his} name? \\ Implicit coreference & Q: When will Sirisena be sworn in? & 19.8\% \\ & A: 6 p.m local time \\ & Q: \textbf{Where}?\\ \bottomrule \end{tabular} \longcaption{Linguistic phenomena in \sys{CoQA} questions}{\label{tab:ling-phenomena}Linguistic phenomena in \sys{CoQA} questions.} \end{table} We further analyze the questions for their relationship with the passages and the conversation history. We sample 150 questions in the development set and annotate various phenomena as shown in Table~\ref{tab:ling-phenomena}. If a question contains at least one content word that appears in the passage, we classify it as \ti{lexical match}. These comprise around 29.8\% of the questions. If it has no lexical match but is a paraphrase of the rationale, we classify it as \ti{paraphrasing}. These questions contain phenomena such as synonymy, antonymy, hypernymy, hyponymy and negation. These constitute a large portion of questions, around 43.0\%. The rest, 27.2\%, have no lexical cues, and we classify them under \ti{pragmatics}. These include phenomena like common sense and presupposition. For example, the question \ti{Was he loud and boisterous?} is not a direct paraphrase of the rationale \ti{he dropped his feet with the lithe softness of a cat} but the rationale combined with world knowledge can answer this question. For the relationship between a question and its conversation history, we classify questions into whether they are dependent or independent on the conversation history. If dependent, whether the questions contain an explicit marker or not. As a result, around 30.5\% questions do not rely on coreference with the conversational history and are answerable on their own. Almost half of the questions (49.7\%) contain explicit coreference markers such as \ti{he, she, it}. These either refer to an entity or an event introduced in the conversation. The remaining 19.8\% do not have explicit coreference markers but refer to an entity or event implicitly. ================================================ FILE: chapters/coqa/discussions.tex ================================================ %!TEX root = ../../thesis.tex \section{Discussion} \label{sec:coqa-future} So far, we have discussed the \sys{CoQA} dataset and several competitive baselines based on conversational models and reading comprehension models. We hope that our efforts can enable the first step to building conversational QA agents. On the one hand, we think there is ample room for further improving performance on \sys{CoQA}: our hybrid system obtains an F1 score of 65.1\%, which is still 23.7 points behind the human performance (88.8\%). We encourage our research community to work on this dataset and push the limits of conversational question answering models. We think there are several directions for further improvement: \begin{itemize} \item All the baseline models we built only use the conversation history by simply concatenating the previous questions and answers with the current question. We think that there should be better ways to connect the history and the current question. For the questions in Table~\ref{tab:ling-phenomena}, we should build models to actually understand that \ti{his} in the question \ti{What was his name?} refers to \ti{the puppy}, and the question \ti{Where?} means \ti{Where will Sirisena be sworn in?}. Indeed, a recent model \sys{FlowQA}~\cite{huang2018flowqa} proposed a solution to effectively stack single-turn models along the conversational flow and demonstrated a state-of-the-art performance on \sys{CoQA}. \item Our hybrid model aims to combine the advantages from the span prediction reading comprehension models and the pointer-generator network model to address the nature of abstractive answers. However, we implemented it as a pipeline model so the performance of the second component depends on whether the reading comprehension model can extract the right piece of evidence from the passage. We think that it is desirable to build an end-to-end model which can extract rationales while also rewriting the rationale into the final answer. \item We think the rationales that we collected can be better leveraged into training models. \end{itemize} On the other hand, \sys{CoQA} certainly has its limitations and we should explore more challenging and more useful datasets in the future. One clear limitation is that the conversations in \sys{CoQA} are only turns of question and answer pairs. That means the answerer is only responsible for answering questions while she can't ask any clarification questions or communicate with the questioner through conversations. Another problem is that \sys{CoQA} has very few (1.3\%) unanswerable questions, which we think are crucial in practical conversational QA systems. In parallel to our work, \newcite{choi2018quac} also created a dataset of conversations in the form of questions and answers on text passages. In our interface, we show a passage to both the questioner and the answerer, whereas their interface only shows a title to the questioner and the full passage to the answerer. Since their setup encourages the answerer to reveal more information for the following questions, their answers are as long as 15.1 words on average (ours is 2.7). While the human performance on our test set is 88.8 F1, theirs is 74.6 F1. Moreover, while \sys{CoQA}'s answers can be abstractive, their answers are restricted to only extractive text spans. Our dataset contains passages from seven diverse domains, whereas their dataset is built only from Wikipedia articles about people. Also, concurrently, \newcite{saeidi2018interpretation} created a conversational QA dataset for regulatory text such as tax and visa regulations. Their answers are limited to \textit{yes} or \textit{no} along with a positive characteristic of permitting to ask clarification questions when a given question cannot be answered. ================================================ FILE: chapters/coqa/experiments.tex ================================================ %!TEX root = ../../thesis.tex \section{Experiments} \label{sec:coqa-experiments} \subsection{Setup} For the \sys{seq2seq} and \sys{PGNet} experiments, we use the \sys{OpenNMT} toolkit \cite{klein2017opennmt}. For the reading comprehension experiments, we use the same implementation that we used for \sys{SQuAD}~\cite{chen2017reading}. We tune the hyperparameters on the development data: the number of turns to use from the conversation history, the number of layers, number of each hidden units per layer and dropout rate. We initialize the word projection matrix with \sys{GloVe} \cite{pennington2014glove} for conversational models and \sys{fastText} \cite{bojanowski2017enriching} for reading comprehension models, based on empirical performance. We update the projection matrix during training in order to learn embeddings for delimiters such as $\mathrm{<}q\mathrm{>}$. For all the experiments of \sys{seq2seq} and \sys{PGNet}, we use the default settings of \sys{OpenNMT}: 2-layers of LSTMs with $500$ hidden units for both the encoder and the decoder. The models are optimized using SGD, with an initial learning rate of $1.0$ and a decay rate of $0.5$. A dropout rate of $0.3$ is applied to all layers. For all the reading comprehension experiments, the best configuration we find is 3 layers of LSTMs with $300$ hidden units for each layer. A dropout rate of $0.4$ is applied to all LSTM layers and a dropout rate of $0.5$ is applied to word embeddings. \subsection{Experimental Results} Table~\ref{tab:coqa-results} presents the results of the models on the development and the test data. Considering the results on the test set, the \sys{seq2seq} model performs the worst, generating frequently occurring answers irrespective of whether these answers appear in the passage or not, a well known behavior of conversational models \cite{li2016diversity}. \sys{PGNet} alleviates the frequent response problem by focusing on the vocabulary in the passage and it outperforms \sys{seq2seq} by 17.8 points. However, it still lags behind \sys{Stanford Attentive Reader} by 8.5 points. A reason could be that \sys{PGNet} has to memorize the whole passage before answering a question, a huge overhead which \sys{Stanford Attentive Reader} avoids. But \sys{Stanford Attentive Reader} fails miserably in answering questions with free-form answers (see row \textit{Abstractive} in Table ~\ref{tab:error-analysis}). When the \sys{Stanford Attentive Reader} is fed into \sys{PGNet}, we empower both \sys{Stanford Attentive Reader} and \sys{PGNet} --- \sys{Stanford Attentive Reader} in producing free-form answers; \sys{PGNet} in focusing on the rationale instead of the passage. This combination outperforms the \sys{PGNet} and the \sys{Stanford Attentive Reader} models by 21.0 and 12.5 points respectively. \begin{table} \small \centering \begin{tabular}{l | c c c c c | c c | c} \hline & \multicolumn{5}{c|}{\tf{In-domain}} & \multicolumn{2}{c|}{\tf{Out-of-domain}} & \tf{Overall} \\ & Children & Literature & Exam & News & Wikipedia & Reddit & Science & \\ \hline \multicolumn{9}{c}{\tf{Development data}}\\ \hline \sys{seq2seq} & 30.6 & 26.7 & 28.3 & 26.3 & 26.1 & N/A & N/A & 27.5 \\ \sys{PGNet} & 49.7 & 42.4 & 44.8 & 45.5 & 45.0 & N/A & N/A & 45.4 \\ \sys{SAR} & 52.4 & 52.6 & 51.4 & 56.8 & 60.3 & N/A & N/A & 54.7 \\ \sys{Hybrid} & \bf 64.5 & \bf 62.0 & \bf 63.8 & \bf 68.0 & \bf 72.6 & N/A & N/A & \bf 66.2 \\ \sys{Human} & 90.7 & 88.3 & 89.1 & 89.9 & 90.9 & N/A & N/A & 89.8 \\ \hline \multicolumn{9}{c}{\tf{Test data}}\\ \hline \sys{seq2seq} & 32.8 & 25.6 & 28.0 & 27.0 & 25.3 & 25.6 & 20.1 & 26.3 \\ \sys{PGNet} & 49.0 & 43.3 & 47.5 & 47.5 & 45.1 & 38.6 & 38.1 & 44.1 \\ \sys{SAR} & 46.7 & 53.9 & 54.1 & 57.8 & 59.4 & 45.0 & 51.0 & 52.6 \\ \sys{Hybrid} & \bf 64.2 & \bf 63.7 & \bf 67.1 & \bf 68.3 & \bf 71.4 & \bf 57.8 & \bf 63.1 & \bf 65.1 \\ \sys{Human} & 90.2 & 88.4 & 89.8 & 88.6 & 89.9 & 86.7 & 88.1 & 88.8 \\ \hline \end{tabular} \longcaption{Models and human performance on \sys{CoQA}}{\label{tab:coqa-results}Models and human performance (F1 score) on the development and the test data. \sys{SAR}: \sys{Stanford Attentive Reader}.} \end{table} \paragraph{Models vs. Humans.} The human performance on the test data is 88.8 F1, a strong agreement indicating that the \sys{CoQA}'s questions have concrete answers. Our best model is 23.7 points behind humans, suggesting that the task is difficult to accomplish with current models. We anticipate that using a state-of-the-art reading comprehension model \cite{devlin2018bert} may improve the results by a few points. \paragraph{In-domain~vs.~Out-of-domain.} All models perform worse on out-of-domain datasets compared to in-domain datasets. The best model drops by 6.6 points. For in-domain results, both the best model and humans find the literature domain harder than the others since literature's vocabulary requires proficiency in English. For out-of-domain results, the Reddit domain is apparently harder. This could be because Reddit requires reasoning on longer passages (see Table~\ref{tab:coqa-domains}). While humans achieve high performance on children's stories, models perform poorly, probably due to the fewer training examples in this domain compared to others.\footnote{We collect children's stories from MCTest which contains only 660 passages in total, of which we use 200 stories for development and test.} Both humans and models find Wikipedia easy. \subsection{Error Analysis} \begin{table}[!t] \centering \begin{tabular}{p{4cm}ccccc} \toprule \tf{Type} & \sys{seq2seq} & \sys{PGNet} & \sys{SAR} & \sys{Hybrid} & \sys{Human}\\ \midrule \multicolumn{6}{c}{\tf{Answer Type}} \\ \midrule Answerable & 27.5 & 45.4 & 54.7 & 66.3 & 89.9 \\ Unanswerable & 33.9 & 38.2 & 55.0 & 51.2 & 72.3 \\ \midrule Extractive & 20.2 & 43.6 & 69.8 & 70.5 & 91.1 \\ Abstractive & 43.1 & 49.0 & 22.7 & 57.0 & 86.8 \\ \midrule Named Entity & 21.9 & 43.0 & 72.6 & 72.2 & 92.2 \\ Noun Phrase & 17.2 & 37.2 & 64.9 & 64.1 & 88.6 \\ Yes & 69.6 & 69.9 & 7.9\; & 72.7 & 95.6 \\ No & 60.2 & 60.3 & 18.4 & 58.7 & 95.7 \\ Number & 15.0 & 48.6 & 66.3 & 71.7 & 91.2 \\ Date/Time & 13.7\; & 50.2 & 79.0 & 79.1 & 91.5 \\ Other & 14.1 & 33.7 & 53.5 & 55.2 & 80.8 \\ \midrule \multicolumn{6}{c}{\tf{Question Type}} \\ \midrule Lexical Matching & 20.7 & 40.7 & 57.2 & 65.7 & 91.7 \\ Paraphrasing & 23.7 & 33.9 & 46.9 & 64.4 & 88.8 \\ Pragmatics & 33.9 & 43.1 & 57.4 & 60.6 & 84.2 \\ \midrule No coreference & 16.1 & 31.7 & 54.3 & 57.9 & 90.3 \\ Explicit coreference & 30.4 & 42.3 & 49.0 & 66.3 & 87.1 \\ Implicit coreference & 31.4 & 39.0 & 60.1 & 66.4 & 88.7 \\ \bottomrule \end{tabular} \longcaption{Error anlaysis on \sys{CoQA}}{\label{tab:error-analysis} Fine-grained results of different question and answer types in the development set. For the question type results, we only analyze 150 questions as described in Section~\ref{sec:coqa-data-analysis}.} \end{table} Table~\ref{tab:error-analysis} presents fine-grained results of models and humans on the development set. We observe that humans have the highest disagreement on the unanswerable questions. Sometimes, people guess an answer even when it is not present in the passage, e.g., one can guess the age of \textit{Annie} in Figure~\ref{fig:coqa-example} based on her \textit{grandmother}'s age. The human agreement on abstractive answers is lower than on extractive answers. This is expected because our evaluation metric is based on word overlap rather than on the meaning of words. For the question \textit{did Jenny like her new room?}, human answers \textit{she loved it} and \textit{yes} are both accepted. Finding the perfect evaluation metric for abstractive responses is still a challenging problem \cite{liu2016not} and beyond the scope of our work. For our models' performance, \sys{seq2seq} and \sys{PGNet} perform well on the questions with abstractive answers, and \sys{Stanford Attentive Reader} performs well on the questions with extractive answers, due to their respective designs. The combined model improves on both categories. Among the lexical question types, humans find the questions with lexical matches the easiest followed by paraphrasing, and the questions with pragmatics the hardest --- this is expected since questions with lexical matches and paraphrasing share some similarity with the passage, thus making them relatively easier to answer than pragmatic questions. The best model also follows the same trend. While humans find the questions without coreferences easier than those with coreferences (explicit or implicit), the models behave sporadically. It is not clear why humans find implicit coreferences easier than explicit coreferences. A conjecture is that implicit coreferences depend directly on the previous turn whereas explicit coreferences may have long distance dependency on the conversation. \paragraph{Importance of conversation history.} Finally, we examine how important the conversation history is for the dataset. Table \ref{tab:ablations} presents the results with a varied number of previous turns used as conversation history. All models succeed at leveraging history but only up to a history of one previous turn (except \sys{PGNet}). It is surprising that using more turns could decrease the performance. We also perform an experiment on humans to measure the trade-off between their performance and the number of previous turns shown. Based on the heuristic that short questions likely depend on the conversation history, we sample 300 one or two word questions, and collect answers to these varying the number of previous turns shown. When we do not show any history, human performance drops to 19.9 F1 as opposed to 86.4 F1 when full history is shown. When the previous question and answer is shown, their performance boosts to 79.8 F1, suggesting that the previous turn plays an important role in making sense of the current question. If the last two questions and answers are shown, they reach up to 85.3 F1, almost close to the performance when the full history is shown. This suggests that most questions in a conversation have a limited dependency within a bound of two turns. \begin{table}[!t] \centering \begin{tabular}{ccccc} \toprule \tf{history size} & \sys{seq2seq} & \sys{PGNet} & \sys{SAR} & \sys{Hybrid} \\ \midrule 0 & 24.0 & 41.3 & 50.4 & 61.5 \\ 1 & 27.5 & 43.9 & 54.7 & 66.2 \\ 2 & 21.4 & 44.6 & 54.6 & 66.0 \\ all & 21.0 & 45.4 & 52.3 & 64.3 \\ \bottomrule \end{tabular} \longcaption{\sys{CoQA} results on the development set with different history sizes}{\label{tab:ablations} Results on the development set with different history sizes. History size indicates the number of previous turns prepended to the current question. Each turn contains a question and its answer. \sys{SAR}: \sys{Stanford Attentive Reader}. } \end{table} ================================================ FILE: chapters/coqa/intro.tex ================================================ %!TEX root = ../../thesis.tex % \section{Introduction} In the last chapter, we discussed how we built a general-knowledge question-answering system from neural reading comprehension. However, most current QA systems are limited to answering isolated questions, i.e., every time we ask a question, the systems return an answer without the ability to consider any context. In this chapter, we set out to tackle another challenging problem \ti{Conversational Question Answering}, where a machine has to understand a text passage and answer a series of questions that appear in a conversation. Humans gather information by engaging in conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. Figure~\ref{fig:coqa-example} shows a conversation between two humans who are reading a passage, one acting as a questioner and the other as an answerer. In this conversation, every question after the first is dependent on the conversation history. For instance, $Q_5$ \ti{Who?} is only a single word and is impossible to answer without knowing what has already been said. Posing short questions is an effective human conversation strategy, but such questions are really difficult for machines to parse. Therefore, conversational question answering combines the challenges from both dialogue and reading comprehension. We believe that building systems which are able to answer such conversational questions will play a crucial role in our future conversational AI systems. To approach this problem, we need to build effective \ti{datasets} and conversational QA \ti{models} and we will describe both of them in this chapter. \begin{figure}[!t] \begin{tabular}{p{0.9\columnwidth}} \midrule Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80. Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well. Jessica had $\ldots$\\ \\ $Q_1$: Who had a birthday? \\ $A_1$: Jessica \\ $R_1$: Jessica went to sit in her rocking chair. Today was her birthday and she was turning 80.\\ \vspace{0em} $Q_2$: How old would she be?\\ $A_2$: 80 \\ $R_2$: she was turning 80 \\ \vspace{0em} $Q_3$: Did she plan to have any visitors?\\ $A_3$: Yes \\ $R_3$: Her granddaughter Annie was coming over \\ \vspace{0em} $Q_4$: How many?\\ $A_4$: Three \\ $R_4$: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well. \\ \vspace{0em} $Q_5$: Who?\\ $A_5$: Annie, Melanie and Josh \\ $R_5$: Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie's husband Josh were coming as well.\\ \bottomrule \end{tabular} \longcaption{A conversation from \sys{CoQA}}{\label{fig:coqa-example} A conversation from our \sys{CoQA} dataset. Each turn contains a question ($Q_i$), an answer ($A_i$) and a rationale ($R_i$) that supports the answer.} \end{figure} This chapter is organized as follows. We first discuss related work in Section~\ref{sec:coqa-rw} and then we introduce \sys{CoQA}~\cite{reddy2019coqa} in Section~\ref{sec:coqa-dataset}, a \textbf{Co}nversational \textbf{Q}uestion \textbf{A}nswering challenge for measuring the ability of machines to participate in a question-answering style conversation.\footnote{We launch \sys{CoQA} as a challenge to the community at \href{https://stanfordnlp.github.io/coqa/}{https://stanfordnlp.github.io/coqa/}.} Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. We define our task and describe the dataset collection process. We also analyze the dataset in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. Next we describe several strong conversational and reading comprehension models we built for \sys{CoQA} in Section~\ref{sec:coqa-models} and present experimental results in Section~\ref{sec:coqa-experiments}. Finally, we discuss future work of conversational question answering (Section~\ref{sec:coqa-future}). ================================================ FILE: chapters/coqa/models.tex ================================================ %!TEX root = ../../thesis.tex \section{Models} \label{sec:coqa-models} Given a passage $p$, the conversation history \{$q_1, a_1, \ldots q_{i-1}, a_{i-1}$\} and a question $q_i$, the task is to predict the answer ${a_i}$. Our task can be modeled as either a conversational response generation problem or a reading comprehension problem. We evaluate strong baselines from each class of models and a combination of the two on \sys{CoQA}. \subsection{Conversational Models} \begin{figure}[!t] \begin{center} \includegraphics[height=9.5cm]{img/coqa_pgnet.pdf} \end{center} \longcaption{The pointer-generator network used for conversational question answering}{\label{fig:coqa-pgnet} The pointer-generator network used for conversational question answering. The figure is adapted from \newcite{see2017get}.} \end{figure} The basic goal of conversational models is to predict the next utterance based on its conversation history. Sequence-to-sequence (seq2seq) models~\cite{sutskever2014sequence} have shown promising results for generating conversational responses \cite{vinyals2015neural,li2016diversity,zhang2018personalizing}. Motivated by their success, we use a standard sequence-to-sequence model with an attention mechanism for generating answers. We append the passage, the conversation history (the question/answer pairs in the last $n$ turns) and the current question as, $p\; \mathrm{<}q\mathrm{>}\; q_{i-n} \;\mathrm{<}a\mathrm{>}\; a_{i-n}\; \ldots$ $\mathrm{<}q\mathrm{>}\; q_{i-1} \;\mathrm{<}a\mathrm{>}\; a_{i-1}\;$ $\mathrm{<}q\mathrm{>}\;q_i$, and feed it into a bidirectional LSTM encoder, where $\mathrm{<}q\mathrm{>}$ and $\mathrm{<}a\mathrm{>}$ are special tokens used as delimiters. We then generate the answer using a LSTM decoder which attends to the encoder states. Moreover, as the answer words are likely to appear in the original passage, we adopt a copy mechanism in the decoder proposed for summarization tasks \cite{gu2016incorporating,see2017get}, which allows to (optionally) copy a word from the passage and the conversation history. We call this model the Pointer-Generator network~\cite{see2017get}, \sys{PGNet}. Figure~\ref{fig:coqa-pgnet} illustrates a full model of \sys{PGNet}. Formally, we denote the encoder hidden vectors by $\{\tilde{\mf{h}}_i\}$, the decoder state at timestep $t$ by $\mf{h}_t$ and the input vector by $\mf{x}_t$, an attention function is computed based on $\{\tilde{\mf{h}}_i\}$ and $\mf{h}_t$ as $\alpha_i$ (Equation~\ref{eq:attention}) and the context vector is computed as $\mf{c} = \sum_{i}{\alpha_i \tilde{\mf{h}}_i}$ (Equation~\ref{eq:context-vector}). For a copy mechanism, it first computes the \ti{generation probability} $p_{\text{gen}} \in [0, 1]$ which controls the probability that it generates a word from the full vocabulary $\mathcal{V}$ (rather than copying a word) as: \begin{equation} p_{\text{gen}} = \sigma\left({\mf{w}^{(c)}}^{\intercal}\mf{c} + {\mf{w}^{(x)}}^{\intercal}\mf{x}_t + {\mf{w}^{(h)}}^{\intercal}\mf{h}_t + b\right). \end{equation} The final probability distribution of generating word $w$ is computed as: \begin{equation} P(w) = p_{\text{gen}}P_{\text{vocab}}(w) + (1 - p_{\text{gen}})\sum_{i: w_i = w}\alpha_i, \end{equation} where $P_{\text{vocab}}(w)$ is the original probability distribution (computed based on $\mf{c}$ and $\mf{h}_t$) and $\{w_i\}$ refers to all the words in the passage and the dialogue history. For more details, we refer readers to \cite{see2017get}. \subsection{Reading Comprehension Models} The second class of models we evaluate is the neural reading comprehension models. In particular, the models for the span prediction problems can't be applied directly, as a large portion of the \sys{CoQA} questions don't have a single span in the passage as their answer, e.g., $Q_3$, $Q_4$ and $Q_5$ in Figure~\ref{fig:coqa-example}. Therefore, we modified the \sys{Stanford Attentive Reader} model we described in Section~\ref{sec:sar} for this problem. Since the model requires text spans as answers during training, we select the span which has the highest lexical overlap (F1 score) with the original answer as the gold answer. If the answer appears multiple times in the story we use the rationale to find the correct one. If any answer word does not appear in the passage, we fall back to an additional \textit{unknown} token as the answer (about 17\%). We prepend each question with its past questions and answers to account for conversation history, similar to the conversational models. \subsection{A Hybrid Model} The last model we build is a \ti{hybrid} model, which combines the advantages of the aforementioned two models. The reading comprehension models can predict a text span as an answer, while they can't produce answers that do not overlap with the passage. Therefore, we combine \sys{Stanford Attentive Reader} with \sys{PGNet} to address this problem since \sys{PGNet} can generate free-form answers effectively. In this hybrid model, we use the reading comprehension model to first point to the answer evidence in text, and \sys{PGNet} naturalizes the evidence into the final answer. For example, for Q$_5$ in Figure~\ref{fig:coqa-example}, we expect that the reading comprehension model first predicts the rationale R$_5$ \ti{Her granddaughter Annie was coming over in the afternoon and Jessica was very excited to see her. Her daughter Melanie and Melanie’s husband Josh were coming as well.}, and then \sys{PGNet} generates A$_5$ \ti{Annie, Melanie and Josh} from R$_5$. We make a few changes to both models based on empirical performance. For the \sys{Stanford Attentive Reader} model, we only use rationales as answers for the questions with an non-extractive answer. For \sys{PGNet}, we only provide current question and span predictions from the the \sys{Stanford Attentive Reader} model as input to the encoder. During training, we feed the oracle spans into \sys{PGNet}. ================================================ FILE: chapters/coqa/related_work.tex ================================================ %!TEX root = ../../thesis.tex \section{Related Work} \label{sec:coqa-rw} Conversational question answering is directly related to \tf{dialogue}. Building conversational agents, or dialogue systems to converse with humans in natural language is one of the major goals of natural language understanding. The two most common classes of dialogue systems are: \ti{task-oriented}, and \ti{chit-chat} (or \ti{chatbot}) dialogue agents. Task-oriented dialogue systems are designed for a particular task and set up to have short conversations (e.g., booking a flight or making a restaurant reservation). They are evaluated based on task-completion rate or time to task completion. In contrast, chit-chat dialogue systems are designed for extended, casual conversations, without a specific goal. Usually, the longer the user engagement and interaction, the better these systems are. Answering questions is also a core task of dialogue systems, because one of the most common needs for humans to interact with dialogue agents is to seek information and ask questions of various topics. QA-based dialogue techniques have been developed extensively in automated personal assistant systems such as Amazon's \sys{Alexa}, Apple's \sys{Siri} or \sys{Google Assistant}, either based on structured knowledge bases, or unstructured text collections. Modern dialogue systems are mostly built on top of deep neural networks. For a comprehensive survey of neural approaches to different types of dialogue systems, we refer readers to \cite{gao2018neural}. \begin{figure}[!t] \center \includegraphics[scale=0.45]{img/other_coqa_tasks.pdf} \longcaption{Other conversational question answering tasks on images and KBs}{\label{fig:other-coqa-tasks}Other conversational question answering tasks on images (left) and KBs (right). Images courtesy: \cite{das2017visual} and \cite{guo2018dialog} with modifications.} \end{figure} Our work is closely related to the \ti{Visual Dialog} task of \cite{das2017visual} and the \ti{Complex Sequential Question Answering} task of \cite{saha2018complex}, which perform conversational question answering on images and a knowledge graph (e.g. \sys{WikiData}) respectively, with the latter focusing on questions obtained by paraphrasing templates. Figure~\ref{fig:other-coqa-tasks} demonstrates an example from each task. We focus on conversations over a passage of text, which requires the ability of reading comprehension. Another related line of research is \ti{sequential question answering}~\cite{iyyer2017search,talmor2018web}, in which a complex question is decomposed into a sequence of simpler questions. For example, a question \ti{What super hero from Earth appeared most recently?} can be decomposed into the following three questions: 1) \ti{Who are all of the super heroes?}, 2) \ti{Which of them come from Earth?}, and 3) \ti{Of those, who appeared most recently?}. Therefore, their focus is how to answer a complex question via sequential question answering, while we are more interested in a natural conversation of a variety of topics while the questions can be dependent on the dialogue history. ================================================ FILE: chapters/openqa/evaluation.tex ================================================ %!TEX root = ../../thesis.tex \section{Evaluation} \label{sec:drqa-eval} We have all the basic elements of our \sys{DrQA} systems and let's take a look at the evaluation. \subsection{Question Answering Datasets} The first question is which question answering datasets we should evaluate on. As we discussed, \sys{SQuAD} is one of the largest general purpose QA datasets currently available for question answering but it is very different from open-domain QA setting. We propose to train and evaluate our system on other datasets developed for open-domain QA that have been constructed in different ways. We hence adopt the following three datasets: \paragraph{TREC} This dataset is based on the benchmarks from the TREC QA tasks that have been curated by \newcite{baudivs2015modeling}. We use the large version, which contains a total of 2,180 questions extracted from the datasets from TREC 1999, 2000, 2001 and 2002.\footnote{This dataset is available at \url{https://github.com/brmson/dataset-factoid-curated}.} Note that for this dataset, all the answers are written in regular expressions, for example, the answer is \texttt{Sept(ember)?|Feb(ruary)?} to the question \ti{When is Fashion week in NYC?}, so answers \ti{Sept}, \ti{September}, \ti{Feb}, \ti{February} are all judged as correct. \paragraph{WebQuestions} Introduced in \newcite{berant2013semantic}, this dataset is built to answer questions from the Freebase KB. It was created by crawling questions through the \sys{Google Suggest} API, and then obtaining answers using Amazon Mechanical Turk. We convert each answer to text by using entity names so that the dataset does not reference Freebase IDs and is purely made of plain text question-answer pairs. \paragraph{WikiMovies} This dataset, introduced in \newcite{miller2016key}, contains 96k question-answer pairs in the domain of movies. Originally created from the \sys{OMDb} and \sys{MovieLens} databases, the examples are built such that they can also be answered by using a subset of Wikipedia as the knowledge source (the title and the first section of articles from the movie domain). We would like to emphasize that these datasets are not necessarily collected in the context of answering from Wikipedia. The \sys{TREC} dataset was designed for text-based question answering (the primary TREC document sets consist mostly of newswire articles), while \sys{WebQuestions} and \sys{WikiMovies} were mainly collected for knowledge-based question answering. We put all these resources in one unified framework, and test how well our system can answer all the questions --- hoping that it can reflect the performance of general-knowledge QA. Table~\ref{tab:qa-data-stats} and Figure~\ref{fig:qa-data-stats} give detailed statistics of these QA datasets. As we can see that, the distribution of \sys{SQuAD} examples is quite different from that of the other QA datasets. Due to the construction method, \sys{SQuAD} has longer questions (10.4 tokens vs 6.7--7.5 tokens on average). Also, all these datasets have short answers (although the answers in \sys{SQuAD} are slightly longer) and most of them are factoid. Note that there are might be multiple answers for many of the questions in these QA datasets (see the \ti{\# answers} column of Table~\ref{tab:qa-data-stats}). For example, there are two valid answers: \ti{English} and \ti{Urdu} to the question \ti{What language do people speak in Pakistan?} on \sys{WebQuestions}. As our system is designed to return one answer, our evaluation considers the prediction as correct if it gives any of the gold answers. \begin{figure}[h] \center \includegraphics[scale=0.7]{img/qa_stat.png} \longcaption{The average length of questions and answers in our QA datasets}{\label{fig:qa-data-stats}The average length of questions and answers in our QA datasets. All the statistics are computed based on the training sets.} \end{figure} \begin{table}[t] \begin{center} \begin{tabular}{l | r r | r | r} \toprule \tf{Dataset} & \tf{\# Train} & \tf{\# DS Train} & \tf{\# Test} & \tf{\# answers} \\ \midrule \sys{SQuAD} & 87,599 & 71,231 & N/A & 1.0 \\ \midrule \sys{TREC} & 1,486$^{\dagger}$ & 3,464 & 694 & 3.2\footnote{As all the answer strings are regex expressions, it is difficult to estimate \# of answers. We only simply list the number of alternation symbols \texttt{|} in the answer.} \\ \sys{WebQuestions} & 3,778$^{\dagger}$ & 4,602 & 2,032 & 2.4 \\ \sys{WikiMovies} & 96,185$^{\dagger}$ & 36,301 & 9,952 & 1.9 \\ \bottomrule \end{tabular} \end{center} \longcaption{Statistics of the QA datasets used for \sys{DrQA}.}{\label{tab:qa-data-stats} Statistics of the QA datasets used for \sys{DrQA}. DS Train: distantly supervised training data. $^{\dagger}$: These training sets are not used as is because no passage is associated with each question.} \end{table} \subsection{Implementation Details} \subsubsection{Processing Wikipedia} We use the 2016-12-21 dump\footnote{\url{https://dumps.wikimedia.org/enwiki/latest}} of English Wikipedia for all of our full-scale experiments as the knowledge source used to answer questions. For each page, only the plain text is extracted and all structured data sections such as lists and figures are stripped.\footnote{We use the WikiExtractor script: \url{https://github.com/attardi/wikiextractor}.} After discarding internal disambiguation, list, index, and outline pages, we retain 5,075,182 articles consisting of 9,008,962 unique uncased token types. \subsubsection{Distantly-supervised data} We use the following process for each question-answer pair from the training portion of each dataset to build our distantly-supervised training examples. First, we run our \sys{Document Retriever} on the question to retrieve the top 5 Wikipedia articles. All paragraphs from those articles without an exact match of the known answer are directly discarded. All paragraphs shorter than 25 or longer than 1500 characters are also filtered out. If any named entities are detected in the question, we remove any paragraph that does not contain them at all. For every remaining paragraph in each retrieved page, we score all positions that match an answer using unigram and bigram overlap between the question and a 20 token window, keeping up to the top 5 paragraphs with the highest overlaps. If there is no paragraph with non-zero overlap, the example is discarded; otherwise we add each found pair to our DS training dataset. Some examples are shown in Figure~\ref{fig:ds_examples} and the number of distantly supervised examples we created for training are given in Table~\ref{tab:qa-data-stats} (column \ti{\# DS Train}). \begin{figure} \begin{center} \small \begin{tabularx}{\textwidth}{l|p{4.5cm}|p{7cm}} \hline \bf Dataset & \bf Example & \bf Article / Paragraph \\ \hline \sys{TREC} & {\bf Q}: What U.S. state's motto is ``Live free or Die''? \newline {\bf A}: New Hampshire & {\bf Article}: Live Free or Die \newline {\bf Paragraph}: ``Live Free or Die'' is the official motto of the U.S. state of \hl{New Hampshire}, adopted by the state in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.\\ \hline \sys{WebQuestions} & {\bf Q}: What part of the atom did Chadwick discover?$^\dagger$ \newline {\bf A}: neutron & {\bf Article}: Atom \newline {\bf Paragraph}: ... The atomic mass of these isotopes varied by integer amounts, called the whole number rule. The explanation for these different isotopes awaited the discovery of the \hl{neutron}, an uncharged particle with a mass similar to the proton, by the physicist James Chadwick in 1932. ... \\ \hline \sys{WikiMovies} & {\bf Q}: Who wrote the film Gigli? \newline {\bf A}: Martin Brest & {\bf Article}: Gigli \newline {\bf Paragraph}: Gigli is a 2003 American romantic comedy film written and directed by \hl{Martin Brest} and starring Ben Affleck, Jennifer Lopez, Justin Bartha, Al Pacino, Christopher Walken, and Lainie Kazan. \\ \hline \end{tabularx} \end{center} \longcaption{Examples of distantly-supervised examples from QA datasets}{\label{fig:ds_examples}Example training data from each QA dataset. In each case we show an associated paragraph where distant supervision (DS) correctly identified the answer within it, which is highlighted.} \end{figure} \Subsection{retrieval-eval}{Document Retriever Performance} We first examine the performance of our retrieval module on all the QA datasets. Table~\ref{tab:ir-res} compares the performance of the two approaches described in Section~\ref{sec:doc-retriever} with that of the Wikipedia Search Engine\footnote{We use the Wikipedia Search API \url{https://www.mediawiki.org/wiki/API:Search}.} for the task of finding articles that contain the answer given a question. Specifically, we compute the ratio of questions for which the text span of any of their associated answers appear in at least one the top 5 relevant pages returned by each system. Results on all datasets indicate that our simple approach outperforms Wikipedia Search, especially with bigram hashing. We also compare doing retrieval with Okapi BM25 or by using cosine distance in the word embeddings space (by encoding questions and articles as bag-of-embeddings), both of which we find performed worse. \begin{table}[t] \begin{center} \normalsize \begin{tabular}{l r r r} \toprule \bf Dataset & \sys{Wiki. Search} & \multicolumn{2}{c}{\sys{Document Retriever}} \\ & & unigram & bigram \\ \midrule % SQuAD & 62.7 & 76.1 & \bf 77.8 \\ % %\curq & 82.8 & 84.2 & \bf 85.6 \\ \sys{TREC} & 81.0 & 85.2 & \bf 86.0 \\ \sys{WebQuestions} & 73.7 & \bf 75.5 & 74.4 \\ \sys{WikiMovies} & 61.7 & 54.4 & \bf 70.3 \\ \bottomrule \end{tabular} \end{center} \longcaption{Document retrieval results}{\label{tab:ir-res} Document retrieval results. \% of questions for which the answer segment appears in one of the top 5 pages returned by the method. } \end{table} \subsection{Final Results} \label{sec:drqa-final-results} Finally, we assess the performance of our full system \sys{DrQA} for answering open-domain questions using all these datasets. We compare three versions of \sys{DrQA} which evaluate the impact of using distant supervision and multitask learning across the training sources provided to \sys{Document Reader} (\sys{Document Retriever} remains the same for each case): \begin{itemize} \item \sys{SQuAD}: A single \sys{Document Reader} model is trained on the \sys{SQuAD} training set only and used on all evaluation sets. We used the model that we described in Section~\ref{sec:drqa} (the F1 score is 79.0\% on the test set of \sys{SQuAD}). \item Fine-tune (DS): A \sys{Document Reader} model is pre-trained on \sys{SQuAD} and then fine-tuned for each dataset independently using its distant supervision (DS) training set. \item Multitask (DS): A single \sys{Document Reader} model is jointly trained on the SQuAD training set and all the distantly-supervised examples. \end{itemize} For the full Wikipedia setting we use a streamlined model that does not use the \sys{CoreNLP} parsed $f_{token}$ features or lemmas for $f_{exact\_match}$. We find that while these help for more exact paragraph reading in \sys{SQuAD}, they don't improve results in the full setting. Additionally, \sys{WebQuestions} and \sys{WikiMovies} provide a list of candidate answers (1.6 million \sys{Freebase} entity strings for \sys{WebQuestions} and 76k movie-related entities for \sys{WikiMovies}) and we restrict that the answer span must be in these lists during prediction. Table~\ref{tab:drqa-full-results} presents the results. We only consider top-1, exact-match accuracy, which is the most restricted and challenging setting. In the original paper \cite{chen2017reading}, we also evaluated the question/answer pairs in SQuAD. We omit them here because that at least a third of these questions are context-dependent and are not really suitable for open QA. \begin{table}[t] \begin{center} \begin{tabular}{l c ccc cc} \toprule \textbf{Dataset} & \tf{YodaQA} & \multicolumn{3}{c}{\tf{DrQA}} & \multicolumn{2}{c}{\tf{DrQA*}} \\ & & {SQuAD} & {FT} & {MT} & {SQuAD} & {FT} \\ \midrule \sys{TREC} & 31.3 & 19.7 & 25.7 & 25.4 & 21.3 & 28.8 \\ \sys{WebQuestions} & 38.9 & 11.8 & 19.5 & 20.7 & 14.2 & 24.3 \\ \sys{WikiMovies} & N/A & 24.5 & 34.3 & 36.5 & 31.9 & 46.0 \\ \bottomrule \end{tabular} \end{center} \longcaption{Final performance of DrQA}{\label{tab:drqa-full-results} Full Wikipedia results. Top-1 exact-match accuracy (\%). \tf{FT}: Fine-tune (DS). \tf{MT}: Multitask (DS). The \sys{DrQA*} results are taken from \newcite{raison2018weaver}.} \end{table} Despite the difficulty of the task compared to the reading comprehension task (where you are given the right paragraph) and unconstrained QA (using redundant resources), \sys{DrQA} still provides reasonable performance across all four datasets. We are interested in a single, full system that can answer any question using Wikipedia. The single model trained only on \sys{SQuAD} is outperformed on all the datasets by the multitask model that uses distant supervision. However, performance when training on SQuAD alone is not far behind, indicating that task transfer is occurring. The majority of the improvement from \sys{SQuAD} to Multitask (DS) learning, however, is likely not from task transfer, as fine-tuning on each dataset alone using DS also gives improvements, showing that is the introduction of extra data in the same domain that helps. Nevertheless, the best single model that we can find is our overall goal, and that is the Multitask (DS) system. We compare our system to \sys{YodaQA} \cite{baudivs2015yodaqa} (an unconstrained QA system using redundant resources), giving results which were previously reported on \sys{TREC} and \sys{WebQuestions}.\footnote{The results are extracted from \href{https://github.com/brmson/yodaqa/wiki/Benchmarks}{https://github.com/brmson/yodaqa/wiki/Benchmarks}.} Despite the increased difficulty of our task, it is reassuring that our performance is not too far behind on \sys{TREC} (31.3 vs 25.4). The gap is slightly bigger on \sys{WebQuestions}, likely because this dataset was created from the specific structure of \sys{Freebase} which \sys{YodaQA} uses directly. We also include the results from an enhancement of our model named \sys{DrQA*}, presented in \newcite{raison2018weaver}. The biggest change is that this reading comprehension model is trained and evaluated directly on the Wikipedia articles instead of paragraphs (documents are on average 40 times larger than individual paragraphs). As we can see, the performance has been improved consistently on all the datasets, and the gap from \sys{YodaQA} is hence further reduced. \clearpage \begin{longtable}{l l p{12cm}} \hline (a) & \tf{Question} & What is question answering? \\ & \tf{Answer} & a computer science discipline within the fields of information retrieval and natural language processing \\ & \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Question_answering}{Question Answering} \\ & \tf{Passage} & {\small Question Answering (QA) is \hl{a computer science discipline within the fields of information retrieval and natural language processing} (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.} \\ \hline (b) & \tf{Question} & Which state is Stanford University located in? \\ & \tf{Answer} & California \\ & \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Stanford_Memorial_Church}{Stanford Memorial Church} \\ & \tf{Passage} & {\small Stanford Memorial Church (also referred to informally as MemChu) is located on the Main Quad at the center of the Stanford University campus in Stanford, \hl{California}, United States. It was built during the American Renaissance by Jane Stanford as a memorial to her husband Leland. Designed by architect Charles A. Coolidge, a protégé of Henry Hobson Richardson, the church has been called "the University's architectural crown jewel".} \\ \hline (c) & \tf{Question} & Who invented LSTM? \\ & \tf{Answer} & Sepp Hochreiter \& J\"urgen Schmidhuber \\ & \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Deep_learning}{Deep Learning} \\ & \tf{Passage} & {\small Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by \hl{Sepp Hochreiter \& J\"urgen Schmidhuber} in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn ``Very Deep Learning'' tasks that require memories of events that happened thousands of discrete time steps ago, which is important for speech. In 2003, LSTM started} \\ & & {\small to become competitive with traditional speech recognizers on certain tasks. Later it was combined with CTC in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49\% through CTC-trained LSTM, which is now available through Google Voice to all smartphone users, and has become a show case of deep learning.} \\ \hline (d) & \tf{Question} & What is the answer to life, the universe, and everything? \\ & \tf{Answer} & 42 \\ & \tf{Wiki. article} & \href{https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy}{Phrases from The Hitchhiker's Guide to the Galaxy} \\ & \tf{Passage} & {\small The number 42 and the phrase, "Life, the universe, and everything" have attained cult status on the Internet. "Life, the universe, and everything" is a common name for the off-topic section of an Internet forum and the phrase is invoked in similar ways to mean "anything at all". Many chatbots, when asked about the meaning of life, will answer "42". Several online calculators are also programmed with the Question. Google Calculator will give the result to "the answer to life the universe and everything" as 42, as will Wolfram's Computational Knowledge Engine. Similarly, DuckDuckGo also gives the result of "the answer to the ultimate question of life, the universe and everything" as \hl{42}. In the online community Second Life, there is a section on a sim called "42nd Life." It is devoted to this concept in the book series, and several attempts at recreating Milliways, the Restaurant at the End of the Universe, were made.} \\ \hline \longcaption{Sample predictions of our \sys{DrQA} system}{\label{tab:drqa-output}Sample predictions of our \sys{DrQA} system.} \end{longtable} Lastly, our \sys{DrQA} system is open-sourced at \href{https://github.com/facebookresearch/DrQA}{https://github.com/facebookresearch/DrQA} (the Multitask (DS) system was deployed). Table~\ref{tab:drqa-output} lists some sample predictions that we tried by ourselves (not in any of these datasets). As is seen, our system is able to return a precise answer to all these factoid questions and answering some of these questions is not trivial: \begin{enumerate}[(a)] \item It is not trivial to identify that \ti{a computer science discipline within the fields of information retrieval and natural language processing} is the complete noun phrase and the correct answer although the question is pretty simple. \item Our system finds the answer in another Wikipedia article \ti{Stanford Memorial Church}, and gives the exactly correct answer \ti{California} as the \ti{state} (instead of \ti{Stanford} or \ti{United States}). \item To get the correct answer, the system needs to understand the syntactic structure of the question and the context \ti{Who invented LSTM?} and \ti{a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter \& J\"urgen Schmidhuber in 1997.} \end{enumerate} Conceptually, our system is simple and elegant, and doesn't rely on any additional linguistic analysis or external or hand-coded resources (e.g., dictionaries). We think this approach holds great promise for a new generation of open-domain question answering systems. In the next section, we discuss current limitations and possible directions for further improvement. ================================================ FILE: chapters/openqa/future.tex ================================================ %!TEX root = ../../thesis.tex \section{Future Work} \label{sec:openqa-future} Our \sys{DrQA} demonstrates that combining information retrieval and neural reading comprehension is an effective approach for open-domain question answering. We hope that our work takes the first step in this research direction. However, our system is still at an early stage and many implementation details can be further improved. We think the following research directions will (greatly) improve our \sys{DrQA} system and should be pursued as future work. Indeed, some of the ideas have already been implemented in the following year after we published our \sys{DrQA} system and we will also describe them in detail in this section. \paragraph{Aggregating evidence from multiple paragraphs.} Our system adopted the most simple and straightforward approach: we took the argmax over the unnormalized scores of all the retrieved passages. This is not ideal because 1) It implies that each passage must contain the correct answer (as \sys{SQuAD} examples) so our system will output one and only one answer for each passage. This is indeed not the case for most retrieved passages. 2) Our current training paradigm doesn't guarantee that the scores in different passages are comparable which causes a gap between the training and the evaluation process. Training on full Wikipedia articles is a solution to alleviate this problem (see the \sys{DrQA*} results in Table~\ref{tab:drqa-full-results}), however, these models are running slowly and difficult to parallelize. \newcite{clark2018simple} proposed to perform multi-paragraph training with modified training objectives, where the span start and end scores are normalized across all paragraphs sampled from the same context. They demonstrated that it works much better than training on individual passages independently. Similarly, \newcite{wang2018r} and \newcite{wang2018evidence} proposed to train an explicit passage re-ranking component on the retrieved articles: \newcite{wang2018r} implemented this in a reinforcement learning framework so the re-ranker component and answer extraction components are jointly trained; \newcite{wang2018evidence} proposed a strength-based re-ranker and a coverage-based re-ranker which aggregate evidence from multiple paragraphs more directly. \paragraph{Using more and better training data.} The second aspect which makes a big impact is the training data. Our \sys{DrQA} system only collected 44k distantly-supervised training examples from \sys{TREC}, \sys{WebQuestions} and \sys{WikiMovies}, and we demonstrated their effectiveness in Section~\ref{sec:drqa-final-results}. The system should be further improved if we can leverage more supervised training data --- from either \sys{TriviaQA}~\cite{joshi2017triviaqa} or generating more data from other QA resources. Moreover, these distantly supervised examples inevitably suffer from the noise problem (i.e., the paragraph doesn't imply the answer to the question even if the answer is contained) and \newcite{lin2018denoising} proposed a solution to de-noise these distantly supervised examples and demonstrated gains in an evaluation. We also believe that adding negative examples should improve the performance of our system substantially. We can either create some negative examples using our full pipeline: we can leverage the \sys{Document Retrieval} module to help us find relevant paragraphs while they don't contain the correct answer. We can also incorporate existing resources such as \sys{SQuAD 2.0}~\cite{rajpurkar2018know} into our training process, which contains curated, high-quality negative examples. \paragraph{Making the \sys{Document Retriever} trainable.} A third promising direction that has not been fully studied yet is to employ a machine learning approach for the \sys{Document Retriever} module. Our system adopted a straightforward, non-machine learning model and further improvement on the retrieval performance (Table~\ref{tab:ir-res}) should lead to an improvement on the full system. A training corpus for the \sys{Document Retriever} component can be collected either from other resources or from the QA data (e.g., using whether an article contains the answer to the question as a label). Joint training of the \sys{Document Retrieval} and the \sys{Document Reader} component will be a very desirable and promising direction for future work. Related to this, \newcite{clark2018simple} also built an open-domain question answering system\footnote{The demo is at \href{https://documentqa.allenai.org}{https://documentqa.allenai.org}.} on top of a search engine (Bing web search) and demonstrated superior performance compared to ours. We think the results are not directly comparable and the two approaches (using a commercial search engine or building an independent IR component) both have pros and cons. Building our own IR component gets rid of an existing API call and can run faster and easily adapt to new domains. \paragraph{Better \sys{Document Reader} module.} For our \sys{DrQA} system, we used the neural reading comprehension model which achieved F1 of 79.0\% on the test set of \sys{SQuAD 1.1}. With the recent development of neural reading comprehension models (Section~\ref{sec:advances}), we are confident that if we replace our current \sys{Document Reader} model with the state-of-the-art models~\cite{devlin2018bert}, the performance of our full system will be improved as well. \paragraph{More analysis is needed.} Another important missing work is to conduct an in-depth analysis of our current systems: to understand which questions they can answer, and which they can't. We think it is important to compare our modern systems to the earlier TREC QA results under the same conditions. It will help us understand where we make genuine progress and what techniques we can still use from the pre-deep learning era, to build better question answering systems in the future. Concurrent to our work, there are several works in a similar spirit to ours, including \sys{SearchQA}~\cite{dunn2017searchqa} and \sys{Quasar-T}~\cite{dhingra2017quasar}, which both collected relevant documents for trivia or \sys{Jeopardy!} questions --- the former one retrieved documents from \sys{ClueWeb} using the \sys{Lucene} index and the latter used \sys{Google} search. \sys{TriviaQA}~\cite{joshi2017triviaqa} also has an open-domain setting where all the retrieved documents from Bing web search are kept. However, these datasets still focus on the task of question answering from the retrieved documents, while we are more interested in building an end-to-end QA system. ================================================ FILE: chapters/openqa/intro.tex ================================================ %!TEX root = ../../thesis.tex % \section{Introduction} In \sys{Part I}, we described the task of reading comprehension: its formulation and development over recent years, the key components of neural reading comprehension systems, and future research directions. However, it is unclear yet whether reading comprehension is merely used as a task of measuring language understanding abilities, or it can enable any useful applications. In \sys{Part II}, we will answer this question and discuss our efforts at building applications which leverage neural reading comprehension as their core component. In this chapter, we view \ti{open domain question answering} as an application of reading comprehension. Open domain question answering has been a long-standing problem in the history of NLP\@. The goal of open domain question answering is to build automated computer systems which are able to answer any sort of (factoid) questions that humans might ask, based on a large collection of unstructured natural language documents, structured data (e.g., knowledge bases), semi-structured data (e.g., tables) or even other modalities such as images or videos. We are the first to test how the neural reading comprehension methods can perform in an open-domain QA framework. We believe that the high performance of these systems can be a key ingredient in building a new generation of open-domain question answering systems, when combined with effective information retrieval techniques. This chapter is organized as follows. We first give a high-level overview of open domain question answering and some notable systems in the history (Section~\ref{sec:openqa-rw}). Next, we introduce an open-domain question answering system that we built called \sys{DrQA}, designed to answer questions from English Wikipedia (Section~\ref{sec:drqa}). It essentially combines an information retrieval module and the high-performing neural reading comprehension module that we described in Section~\ref{sec:sar}. We further talk about how we can improve the system by creating distantly-supervised training examples from the retrieval module. We then present a comprehensive evaluation on multiple question answering benchmarks (Section~\ref{sec:drqa-eval}). Finally, we discuss current limitations, follow-up work and future directions in Section~\ref{sec:openqa-future}. ================================================ FILE: chapters/openqa/related_work.tex ================================================ %!TEX root = ../../thesis.tex \section{A Brief History of Open-domain QA} \label{sec:openqa-rw} Question answering was one of the earliest tasks for NLP systems since 1960s. One early system, which prefigures modern text-based question answering systems, was the \sys{Protosynthex} system of \cite{simmons1964indexing}. The system first formulated a query based on the content words in the question, retrieved candidate answer sentences based on the frequency-weighted term overlap with the question, and finally performed a dependency parse match to get the final answer. Another notable system \sys{MURAX} \cite{kupiec1993murax}, was designed to answer general-knowledge questions over \sys{Grolier}'s on-line encyclopedia, using shallow linguistic processing and information retrieval (IR) techniques. The interest in open domain question answering has increased since 1999, when the QA track was first included as part of the annual TREC competitions\footnote{\url{http://trec.nist.gov/data/qamain.html}}. The task was at first defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain questions. It has spurred a wide range of QA systems developed at the time, and the majority of the systems consisted of two stages: an IR system used to select the top $n$ documents or passages which match a query that has been generated from the question, and a window-based word scoring system used to pinpoint likely answers. For more details, readers are referred to \cite{voorhees1999trec,moldovan2000structure}. More recently, with the development of knowledge bases (KBs) such as \sys{Freebase}~\cite{bollacker2008freebase} and \sys{DBpedia}~\cite{auer2007dbpedia}, many innovations have occurred in the context of question answering from KBs with the creation of resources like \sys{WebQuestions} \cite{berant2013semantic} and \sys{SimpleQuestions} \cite{bordes2015large} based on \sys{Freebase}, or on automatically extracted KBs, e.g., OpenIE triples and \sys{NELL} \cite{fader2014open}. A lot of progress has been made on knowledge-based question answering and the major approaches are either based on semantic parsing or information extraction techniques~\cite{yao2014freebase}. However, KBs have inherent limitations (incompleteness and fixed schemas) that motivated researchers to return to the original setting of answering from raw text lately. \begin{figure}[t] \center \includegraphics[scale=0.25]{img/deepqa.png} \longcaption{The high-level architecture of IBM's \sys{DeepQA} used in \sys{Watson}.}{\label{fig:watson}The high-level architecture of IBM's \sys{DeepQA} used in \sys{Watson}. Image courtesy: \href{https://en.wikipedia.org/wiki/Watson_(computer)}{https://en.wikipedia.org/wiki/Watson\_(computer)}.} \end{figure} There are also a number of highly developed full pipeline QA approaches using a myriad of resources, including both text collections (Web pages, Wikipedia, newswire articles) and structured knowledge bases (\sys{Freebase}, \sys{DBpedia} etc.). A few notable systems include Microsoft's \sys{AskMSR} \cite{brill2002askmsr}, IBM's \sys{DeepQA} \cite{ferrucci2010building} and \sys{YodaQA} \cite{baudivs2015yodaqa} --- the latter of which is open source and hence reproducible for comparison purposes. \sys{AskMSR} is a search-engine based QA system that relies on ``data redundancy rather than sophisticated linguistic analyses of either questions or candidate answers''. \sys{DeepQA} is the most representative modern question answering system and its victory at the TV game-show \sys{Jeopardy!} in 2011 received a great deal of attention. It is a very sophisticated system that consists of many different pieces in the pipeline and it relies on unstructured information as well as structured data to generate candidate answers or vote over evidence. A high-level architecture is illustrated in Figure~\ref{fig:watson}. \sys{YodaQA} is an open source system modeled after \sys{DeepQA}, similarly combining websites, databases and Wikipedia in particular. Comparing against these methods provides a useful datapoint for an ``upper bound'' benchmark on performance. Finally, there are other types of question answering problems based on different types of resources, including Web tables~\cite{pasupat2015compositional}, images~\cite{antol2015vqa}, diagrams~\cite{kembhavi2017you} or even videos~\cite{tapaswi2016movieqa}. We are not going into further details as our work focuses on text-based question answering. Our \sys{DrQA} system (Section~\ref{sec:drqa}) focuses on question answering using Wikipedia as the unique knowledge source, such as one does when looking for answers in an encyclopedia. QA using Wikipedia as a resource has been explored previously. \newcite{ryu2014open} perform open-domain QA using a Wikipedia-based knowledge model. They combine article content with multiple other answer matching modules based on different types of semi-structured knowledge such as infoboxes, article structure, category structure, and definitions. Similarly, \newcite{Ahn2004using} also combine Wikipedia as a text resource with other resources, in this case with information retrieval over other documents. \newcite{buscaldi2006mining} also mine knowledge from Wikipedia for QA. Instead of using it as a resource for seeking answers to questions, they focus on validating answers returned by their QA system, and use Wikipedia categories for determining a set of patterns that should fit with the expected answer. In our work, we consider the comprehension of text only, and use Wikipedia text documents as the sole resource in order to emphasize the task of reading comprehension. We believe that adding other knowledge sources or information will further improve the performance of our system. ================================================ FILE: chapters/openqa/system.tex ================================================ %!TEX root = ../../thesis.tex \section{Our System: \sys{DrQA}} \label{sec:drqa} \subsection{An Overview} In the following we describe our system \sys{DrQA}, which focuses on answering questions using English Wikipedia as the unique knowledge source for documents. We are interested in building a general-knowledge question answering system, which can answer any sort of factoid questions where the answer is contained in and can be extracted from Wikipedia. There are several reasons that we choose to use Wikipedia: 1) Wikipedia is a constantly evolving source of large-scale, rich, detailed information that could facilitate intelligent machines. Unlike knowledge bases (KBs) such as \sys{Freebase} or \sys{DBPedia}, which are easier for computers to process but too sparsely populated for open-domain question answering, Wikipedia contains up-to-date knowledge that humans are interested in. 2) Many reading comprehension datasets (e.g., \sys{SQuAD}) are built on Wikipedia so that we can easily leverage these resources and we will describe it soon. 3) Generally speaking, Wikipedia articles are clean, high-quality and well-formed and thus they are highly useful resources for open domain question answering. Using Wikipedia articles as the knowledge source causes the task of question answering (QA) to combine the challenges of both large-scale open-domain QA and of machine comprehension of text. In order to answer any question, one must first retrieve the few relevant articles among more than 5 million items, and then scan them carefully to identify the answer. This is reminiscent of how classical TREC QA systems worked, but we believe that neural reading comprehension models will play a crucial role of \ti{reading} the retrieved articles/passages to obtain the final answer. As shown in Figure \ref{fig:drqa-system}, our system essentially consists of two components: (1) the \sys{Document Retriever} module for finding relevant articles and (2) a reading comprehension model, \sys{Document Reader}, for extracting answers from a single document or a small collection of documents. Our system treats Wikipedia as a collection of articles and does not rely on its internal graph structure. As a result, our approach is generic and could be switched to other collections of documents, books, or even daily updated newspapers. We detail the two components next. \subsection{Document Retriever} \label{sec:doc-retriever} Following classical QA systems, we use an efficient (non-machine learning) document retrieval system to first narrow our search space and focus on reading only articles that are likely to be relevant. A simple inverted index lookup followed by term vector model scoring performs quite well on this task for many question types, compared to the built-in ElasticSearch based Wikipedia Search API \cite{gormley2015elasticsearch}. Articles and questions are compared as TF-IDF weighted bag-of-word vectors. We further improve our system by taking local word order into account with n-gram features. Our best performing system uses bigram counts while preserving speed and memory efficiency by using the hashing of \cite{weinberger2009feature} to map the bigrams to $2^{24}$ bins with an unsigned \emph{murmur3} hash. We use the \sys{Document Retriever} as the first part of our full model, by setting it to return 5 Wikipedia articles given any question. Those articles are then processed by the \sys{Document Reader}. \begin{figure}[t] \begin{center} \includegraphics[height=8cm]{img/drqa_system.pdf} \end{center} \longcaption{An overview of DrQA system}{\label{fig:drqa-system} An overview of our question answering system DrQA.} \end{figure} \subsection{Document Reader} The \sys{Document Reader} takes the top 5 Wikipedia articles and aims to read all the paragraphs and extracts the possible answers from them. This is exactly the setup as we did in span-based reading comprehension problems, and the \sys{Stanford Attentive Reader} model that we described in Section~\ref{sec:sar} can be directly plugged into this pipeline. We apply our trained \sys{Document Reader} for each single paragraph that appears in the top 5 Wikipedia articles and it predicts an answer span with a confidence score. To make scores compatible across paragraphs in one or several retrieved documents, we use the unnormalized exponential and take argmax over all considered paragraph spans for our final prediction. This is just a very simple heuristic and there are better ways to aggregate evidence over different paragraphs. We will discuss future work in Section~\ref{sec:openqa-future}. \subsection{Distant Supervision} We have built a complete pipeline which integrates a classical retrieval module and our previous neural reading comprehension component. The remaining key question is how can we train this reading comprehension module for the open-domain question answering setting? The most direct approach is just to reuse the SQuAD dataset~\cite{rajpurkar2016squad} as the training corpus, which was also built on top of Wikipedia paragraphs. However, this approach is limited in the following ways: \begin{itemize} \item As we discussed earlier in Section~\ref{sec:future-datasets}, the questions in \sys{SQuAD} were crowdsourced after the annotators see the paragraphs to ensure they can be answered by a span in the passage. This distribution is quite specific and different from that of real-world question-answering when people have a question in mind first and try to find out he answers from the Web or other sources. \item Many \sys{SQuAD} questions are indeed context-dependent. For example, a question is \ti{What individual is the school named after?} posed on one passage of the Wikipedia article \ti{Harvard University}, or another question is \ti{What did Luther call these donations?} based on a passage that describes \ti{Martin Luther}. Basically, these questions cannot be understood by themselves and thus are useless for open-domain QA problems. \newcite{clark2018simple} estimated around 32.6\% of the questions in \sys{SQuAD} are either document-dependent or passage-dependent. \item Finally, the size of SQuAD is rather small (80k training examples). It should further improve the system performance if wen can collect more training examples. \end{itemize} To overcome these problems, we propose a procedure to automatically create additional training examples from other question answering resources. The idea is to re-use the efficient information retrieval module that we built: if we already have a question answer pair $(q, a)$ and the retrieval module can help us find a paragraph relevant to the question $q$ and the answer segment $a$ appears in the paragraph, then we can create a \ti{distantly-supervised} training example in the form of a $(p, q, a)$ triple for training the reading comprehension models: \begin{eqnarray} & f: (q, a) \Longrightarrow (p, q, a) \\ & \text{ if } p \in \text{ Document\_Retriever }(q) \text{ and } a \text{ appears in } p \nonumber \end{eqnarray} This idea is a similar spirit to the popular approach of using distant supervision (DS) for relation extraction \cite{mintz2009distant} \footnote{The idea for relation extraction is to pair textual mentions which contain the two entities which is known as a relation between them in an existing knowledge base.}. Despite that these examples can be noisy to some extent, it offers a cheap solution to create distantly supervised examples for open-domain question answering and will be a useful addition to \sys{SQuAD} examples. We will describe the effectiveness of these distantly supervised examples in Section~\ref{sec:drqa-eval}. ================================================ FILE: chapters/rc_future/datasets.tex ================================================ %!TEX root = ../../thesis.tex \section{Future Work: Datasets} \label{sec:future-datasets} We have mostly focused on \sys{CNN/Daily Mail} and \sys{SQuAD} and demonstrated that both 1) neural models are able to achieve either super-human or the ceiling performance on them; 2) although these datasets are highly useful, most of the examples are rather simple and don't require much reasoning yet. What desired properties are still missing in these datasets? What kind of datasets should we work on next? And how to collect better datasets? % still a quite restricted setup: (a) the crowdworkers can see the passage when they write the questions. As a result, there is usually a high lexical overlap between the question and the paragraph and thus it greatly eases the difficulty of answering these questions; (b) questions are only allowed when they can be answered using a single span in the passage and this excludes many possible questions from the dataset such as those \ti{yes/no}, \ti{counting} or \ti{why} questions; (c) it is known that most of the questions in \sys{SQuAD} don't really need complex reasoning (combining facts from multiple sentences or background knowledge) and they are usually not compositional (which needs to be decomposed into multiple steps of simple questions). We think that datasets like \sys{SQuAD} mainly have the following limitations: \begin{itemize} \item The questions are \ti{posed based on the passage}. That said, if a questioner is looking at the passage while they ask a question, they are quite likely to mirror the sentence structure and to reuse the same words. This eases the difficulty of answering questions as many questions words are overlapping with the passage words. \item It only allows questions that are \ti{answerable by a single span in the passage}. This not only implies all the questions are answerable, but also excludes many possible questions to be posed such as \ti{yes/no}, \ti{counting} questions. As we discussed earlier, most of the questions in \sys{SQuAD} are factoid questions and the answers are generally short (3.1 tokens on average). Therefore, there are also very few \ti{why} (cause and effect) and \ti{how} (procedure) questions in the dataset. \item Most of the questions can be answered by \ti{a single supporting sentence} in the passage and don't require multiple-sentence reasoning. \newcite{rajpurkar2016squad} estimated that only $13.6\%$ of the examples need multiple sentence reasoning. Among them, we think that most of the cases are resolving conferences, which might be solved by a coreference system. \end{itemize} To address these limitations, there have been a number of new datasets collected recently. They follow a similar paradigm of \sys{SQuAD} but are constructed in various ways. Table~\ref{tab:recent-datasets} gives an overview of a few representative datasets. As we can see, these datasets are of a similar order of magnitude (ranging from 33k to 529k training examples), and there is still a gap between the state-of-the-art and the human performance (some gaps are bigger than the others though). In the following, we describe these datasets in detail and discuss how they tackle the aforementioned limitations and their advantages/disadvantages: \begin{table}[t] \centering \small \begin{tabular}{l | c c c | c | c c c} \toprule \tf{Dataset} & \tf{\#Train} & \tf{\#Dev} & \tf{\#Test} & \tf{Domain} & \tf{Metric} & \tf{Human} & \tf{SOTA} \\ \midrule \sys{TriviaQA} (Web) & 528,979 & 68,621 & 65,059 & Web & F1 & N/A\footnote{\newcite{joshi2017triviaqa} provided oracles scores of \ti{exact match} accuracies of 82.8\% and 83.0\% of the Web and Wikipedia domain respectively. These numbers measure the percentage of examples that answer can be found in the documents and differ from human performance.} & 71.3 \\ \sys{TriviaQA} (Wiki.)\footnote{In contrast to the Web domain of \sys{TriviaQA}, the Wikipedia domain is evaluated over questions instead of documents.} & 61,888 & 9,951 & 9,509 & { Wikipedia} & F1 & N/A & 68.9 \\ \sys{RACE} & 87,866 & 4,887 & 4,934 & Exams & Accuracy & 100.0 & 59.0 \\ \sys{NarrativeQA}\footnote{We only list the setting where the summaries are given.} & 32,747 & 3,461 & 10,557 & Wikipedia & ROUGE-L & 57.0 & 36.3 \\ \sys{SQuAD 2.0} & 130,319 & 11,873 & 8,862 & Wikipedia & F1 & 89.5 & 83.1 \\ \sys{HotpotQA}\footnote{We only list the ``distractor'' setting.} & 90,564 & 7,405 & 7,405 & {Wikipedia} & F1 & 91.4 & 59.0 \\ \bottomrule \end{tabular} \longcaption{A summary of more recent reading comprehension datasets}{\label{tab:recent-datasets}A summary of more recent reading comprehension datasets. We only show the F1 results for span-prediction tasks and ROUGE-L for free-form answer tasks. The state-of-the-art results are taken from \newcite{clark2018simple} for \sys{TriviaQA}~\cite{joshi2017triviaqa}, \newcite{radford2018improving} for \sys{RACE}~\cite{lai2017race}, \newcite{kovcisky2018narrativeqa} for \sys{NarrativeQA}, \newcite{devlin2018bert} for \sys{SQuAD 2.0}~\cite{rajpurkar2018know} and \newcite{yang2018hotpotqa} for \sys{HotpotQA}.} \end{table} \paragraph{TriviaQA~\cite{joshi2017triviaqa}.} The key idea of this dataset is that question/answer pairs were collected \ti{before} constructing the corresponding passages. More specifically, they gathered 95k question-answer pairs from trivia and quiz-league websites and collected textual evidence which contained the answer from either Web search results or Wikipedia pages corresponding to the entities which are mentioned in the question. As a result, they collected 650k (passage, question, answer) triples in total. This paradigm effectively solved the problem that questions were dependent on the passage and also it is easier to construct a large dataset cheaply. It is worth noting that the passages used in this dataset are mostly long documents (the average document length is 2,895 words and it is 20 times longer than that of \sys{SQuAD}), and also posed a challenge of scalability for existing models. However, it has a similar problem to the \sys{CNN/Daily Mail} dataset --- as the dataset was curated heuristically, there is no guarantee that the passage really provides the answer to the question and this influences the quality of the training data. \paragraph{RACE~\cite{lai2017race}.} Humans' standardized tests are a proper way to evaluate machines' reading comprehension abilities. \sys{RACE} is a multiple choice dataset collected from the English exams for middle-school and high-school Chinese students within the 12–-18 age range. All the questions and answer options were created by experts. As a result, the dataset is more difficult than most existing datasets and it was estimated that 26\% of the questions require multiple sentence reasoning. The state-of-the-art performance is only 59\% so far (each question has 4 candidate answers). \paragraph{NarrativeQA~\cite{kovcisky2018narrativeqa}.} This is a challenging dataset and it required crowdworkers to ask questions based on the plot summaries of a book or a movie from Wikipedia. The answers are free-form human-generated text and in particular, the annotators were encouraged to use their own words and copying is not allowed in the interface. The plot summaries usually contain more characters and events and more complex to follow than news articles or Wikipedia paragraphs. The dataset consists of two settings: one is to answer questions base on the summary (659 tokens on average) which is more similar to \sys{SQuAD}, and the other is to answer questions based on the full book or movie script (62,528 tokens on average). The second setting is especially difficult, as it requires IR components to locate relevant information in the long documents. One problem with this dataset is that human agreement is low due to its free-form answer form and thus it is difficult to evaluate. \paragraph{SQuAD 2.0~\cite{rajpurkar2018know}.} \sys{SQuAD 2.0} proposed to add 53,775 negative examples to the original \sys{SQuAD} dataset. These questions are not answerable from the passage, but look similar to the positive ones (relevant and the passage contains a plausible answer). To work well on the dataset, systems need to not only answer questions but also determine when no answer is supported by the paragraph and abstain from answering. This is an important aspect in practical applications but has been omitted in previous datasets. \paragraph{HotpotQA~\cite{yang2018hotpotqa}.} This dataset aims to construct questions which need multiple supporting documents to answer. To approach this, the crowdworkers were required to ask questions based on two relevant Wikipedia paragraphs (there is a hyperlink from the first paragraph of one article to the other). It also offers a new type of factoid comparison question, for which systems need to compare two entities on some shared properties. The dataset consists of two settings for evaluation -- one is called the \ti{distractor} setting in which each question is provided 10 passages, including the two passages used for constructing the question and 8 distractor passages retrieved from Wikipedia; the second setting is to use the full Wikipedia to answer the question. Compared to \sys{SQuAD}, these datasets either require more complex reasoning cross sentences or documents, or need to handle longer documents, or need to generate free-form answers instead of extracting a single span, or predict when there is no answer in the passage. They posed new challenges and many are still beyond the scope of existing models. We believe that these datasets will further inspire a series of modeling innovations in the future. After our models can reach the next level of performance, we will need to set out to construct even more difficult datasets to solve. ================================================ FILE: chapters/rc_future/models.tex ================================================ %!TEX root = ../../thesis.tex \section{Future Work: Models} \label{sec:future-models} Next we turn to the research directions of models for future work. We first describe the desiderata of reading comprehension models. Most of the existing work only focuses on \ti{accuracy} --- given a standard training/development/testing split of a dataset, the major goal is to get the best accuracy score on the testing set. However, we argue that there are other important aspects which have been overlooked that we will need to work on in the future, including \ti{speed and scalability}, \ti{robustness} and \ti{interpretability}. Lastly, we discuss what important elements are still missing in the current models, to solve more difficult reading comprehension problems. \subsection{Desiderata} Besides \ti{accuracy} (achieving a better performance score on a standard dataset), the following desiderata are also very important for future work: \paragraph{Speed and Scalability.} How to build faster models (for both training and inference) and how to scale to longer documents is an important direction to pursue. Building faster models for training can lead to lower turnaround time for experimentation and also enable us to train on bigger datasets. Building faster models for inference is highly useful when we deploy the models in practical use. Also, it is unrealistic to encode a very long document (e.g., \sys{TriviaQA}) or even a full book (e.g., \sys{NarrativeQA}) using an RNN and this still remains a severe challenge. For example, the average document length of \sys{TriviaQA} is 2,895 tokens and the authors truncated the documents to the first 800 tokens for the sake of scalability. The average document length of \sys{NarrativeQA} is 62,528 tokens and the authors have to first retrieve a small number of relevant passages from the story using an IR system. Existing solutions to these problems include: \begin{itemize} \item Replacing LSTMs with non-recurrent models such as \sys{Transformer}~\cite{vaswani2017attention} or lighter recurrent units such as \sys{SRU}~\cite{lei2018simple} as we discussed in Section~\ref{sec:alt-lstms}. \item Training models which learn to skip part of the documents so they don't need to read all of the content. These models can run much faster while still retaining a similar performance. Representative works in this line include \newcite{yu2017learning} and \newcite{seo2018neural}. \item The choice of optimization algorithms can also greatly affect the convergence speed. Multi-GPU training and hardware performance are also important aspects to consider but they are beyond the scope of this thesis. \newcite{coleman2017dawnbench} provide a benchmark\footnote{\href{https://dawn.cs.stanford.edu/benchmark/}{https://dawn.cs.stanford.edu/benchmark/}} which measures the end-to-end training and inference time to achieve a state-of-the-art accuracy level for a wide range of tasks, including \sys{SQuAD}. \end{itemize} \paragraph{Robustness.} We discussed in Section~\ref{sec:squad-errors} that existing models are very brittle to adversarial examples which will become a severe problem when we deploy these models in the real world. Moreover, most of the current works follow the standard paradigm: training and evaluating on the splits of one dataset. It is known that if we train our models on one dataset and evaluate on another dataset, the performance will drop dramatically due to their different source of text and construction methods. For future work, we need to consider: \begin{itemize} \item How to create better adversarial training examples and incorporate them into the training process. \item Researching more on transfer learning and multi-task learning, so that we can build models with high performance across various datasets. \item We might need to break the standard paradigm of supervised learning, and think about how to create better ways of evaluating our current models for the sake of building more robust models. \end{itemize} \paragraph{Interpretability.} The last important aspect is \ti{interpretability} and it has been mostly ignored in the current systems. Our future systems should not only be able to provide the final answers, but also provide the rationales behind their predictions, so users can decide if they can trust the outputs and leverage them or not. Neural networks are especially notorious for the fact that the end-to-end training paradigm makes these models like a black box and it is hard to interpret their predictions. This is especially crucial if we want to apply these systems to medical or legal domains. Interpretability can have different definitions. In our context, we think there could be several ways to approach that: \begin{itemize} \item The easiest way is to require the models to learn to extract input pieces from the documents as supporting evidence. This has been studied before (e.g., \cite{lei2016rationalizing}) for sentence classification problems but not yet in reading comprehension problems. \item A more complex way is that the models can indeed generate rationales. Instead of only highlighting the relevant piece of information in the passage, the models need to interpret how these pieces are connected and finally get to the answer. Take Figure~\ref{fig:sar-squad-errors} (c) as an example, the systems need to interpret that the two cities are the two largest and 3.7 million is bigger than 1.3 million thus it is the second largest. We think this desiderata is very important but far beyond the scope of current models. \item Finally, another important aspect to consider is what training resources we can get to approach this level of interpretability. Inferring rationales from the final answers is feasible but quite difficult. We should consider collecting human explanations as the supervision of training rationales in the future. \end{itemize} \subsection{Structures and Modules} In this section, we are going to discuss what are the missing elements in the current models, if we want to solve more difficult reading comprehension problems. First of all, current models are all built on either sequence models or tackle all pairs of words symmetrically (e.g., \sys{Transformer}), and omit the inherent structure of language. On the one hand, this forces our models to learn all the relevant linguistic information from scratch, which makes the learning of our models more difficult. On the other hand, the NLP community has put a lot of effort into studying linguistic representation tasks (e.g., syntactic parsing, coreference) and building many linguistic resources and tools for years. Language encodes meaning in terms of hierarchical, nested structures on sequences of words. Would it be still useful to encode linguistic structures more explicitly in our reading comprehension models? Figure~\ref{fig:corenlp-output} illustrates the \sys{CoreNLP}~\cite{manning2014stanford} output of several examples in \sys{SQuAD}. We believe that this structural information would be useful as follows: \begin{enumerate}[(a)] \item The information that \ti{2,400} is a \ti{numeric modifier} of \ti{professors} should help answer the question \ti{What is the total number of professors, instructors, and lecturers at Harvard?} (We have seen this example as an error case in Figure~\ref{fig:sar-squad-errors}). \item The coreference information that \ti{it} refers to \ti{Harvard} should help answer the question \ti{Starting in what year has Harvard topped the Academic Rankings of World Universities?}. \end{enumerate} Therefore, we think that these linguistic knowledge/structures would be still a useful addition to the current models. The remaining questions that we need to answer are: 1) What are the best ways to incorporate these structures into sequence models? 2) Do we want to model the structures as a latent variable or rely on off-the-shelf linguistic tools? For the latter case, are the current tools good enough so that the models can benefit more (rather than suffering from noise)? Can we further improve the performance of these representation tasks? \begin{figure}[t] \center (a) \includegraphics[scale=0.20]{img/dep_example.png} (b) \includegraphics[scale=0.42]{img/coref_example.png} \longcaption{Example output of \sys{CoreNLP}: dependencies and coreference}{\label{fig:corenlp-output} Example output of \sys{CoreNLP}: (a) dependencies and (b) coreference. The image is taken from \href{http://corenlp.run}{http://corenlp.run}.} \end{figure} Another aspect we think is still missing from most existing models is \ti{modules}. The task of reading comprehension is inherently very complex and different types of examples require different types of reasoning capabilities. It still remains a grand challenge if we want to learn everything through a giant neural network (This is reminiscent of why the attention mechanism was proposed because we don't want to squash the meaning of a sentence or a paragraph into one vector!). We believe that, if we want to approach deeper level of reading comprehension, our future models will be more structured, more modularized, and solving one comprehensive task can be decomposed into many subproblems and we can tackle each smaller subproblem (e.g., each reasoning type) separately and combine all of them in the end. The idea of \ti{modules} has been implemented in \sys{Neural Module Networks (NMN)} \cite{andreas2016learning} before. They first perform a dependency parse of the question, and then decompose the question answering problem into several ``modules'' based on the parse structure. One example they used for a visual question answering (VQA) task is: a question ``What color is the bird?'' can be decomposed as two modules. One module is used to detect the bird in the given image, and another module is used to detect the color of the found region (bird). We believe that this sort of approach holds promise to answer questions such as \ti{What is the population of the second largest city in California?} (Figure~\ref{fig:sar-squad-errors} (c)). However, \sys{NMN} has only been studied on visual question answering or small knowledge-base question question problems so far, and applying to reading comprehension problems can be more challenging due to the flexibility of language variations and question types. ================================================ FILE: chapters/rc_future/overview.tex ================================================ %!TEX root = ../../thesis.tex % \section{Introduction} In the previous chapter, we have described how neural reading comprehension models succeeded in current reading comprehension benchmarks and their key insights. Despite its rapid progress, there is still a long way to go towards genuine human-level reading comprehension. In this chapter, we will discuss future work and open questions. We first examine the error cases of existing models in Section~\ref{sec:squad-errors}, and conclude that they still fail on ``easy'' or ``trivial'' cases despite their high accuracies on average. As we discussed earlier, the success of recent reading comprehension is attributed to both the creation of large-scale datasets and the development of neural reading comprehension models. In the future, we believe both components will be still equally important. We discuss the future work of datasets and models respectively in Section~\ref{sec:future-datasets} and \ref{sec:future-models}. What is still missing in the existing datasets and models? How can we approach that? Finally, we review several important research questions in this field in Section~\ref{sec:research-questions}. \section{Is SQuAD Solved Yet?} \label{sec:squad-errors} Although we have already achieved super-human performance on the \sys{SQuAD} dataset, does it indicate that our reading comprehension models are capable of solving all the \sys{SQuAD} examples or any examples with the same level of difficulty? Figure~\ref{fig:sar-squad-errors} demonstrates some failure cases of our \sys{Stanford Attentive Reader} model described in Section \ref{sec:sar}. As we can see, the model predicts the answer type perfectly for all these examples: it predicts a number for the question \ti{what is the total number of \ldots ?} and \ti{what is the population \ldots ?} and a team name for the question \ti{Which team won Super Bowl 50?}. However, the model failed to understand the subtleties expressed in the text and can't distinguish among the candidate answers. In more detail, \begin{enumerate}[(a)] \item The number \ti{2,400} modifies \ti{professors, lecturers, and instructors} while \ti{7,200} modifies \ti{undergraduates}. However, the system failed to identify that and we believe that linguistic structures (e.g., syntactic parsing) can help resolve this case. \item Both teams \ti{Denver Broncos} and \ti{Carolina Panthers} are modified by the word \ti{champion}, but the system failed to infer that ``X defeated Y'' so ``X won''. \item The system predicted \ti{100,000} probably because it is closer to the word \ti{population}. However, to answer the question correctly, the system has to identify that \ti{3.7 million} is the population of \ti{Los Angles}, and \ti{1.3 million} is the population of \ti{San Diego} and compare the two numbers and infer that \ti{1.3 million} is the answer because it is \ti{second largest}. This is a difficult example and probably beyond the scope of all the existing systems. \end{enumerate} \begin{figure}[p] \centering \begin{tabular}{l | p{13.5cm}} \hline (a) &\tf{Question}: What is the total number of professors, instructors, and lecturers at Harvard? \\ & \tf{Passage}: Harvard's \blue{2,400} professors, lecturers, and instructors instruct \red{7,200} undergraduates and 14,000 graduate students. The school color is crimson, which is also the name of the Harvard sports teams and the daily newspaper, The Harvard Crimson. The color was unofficially adopted (in preference to magenta) by an 1875 vote of the student body, although the association with some form of red can be traced back to 1858, when Charles William Eliot, a young graduate student who would later become Harvard's 21st and longest-serving president (1869–-1909), bought red bandanas for his crew so they could more easily be distinguished by spectators at a regatta. \\ & \tf{Gold answer}: 2,400 \\ & \tf{Predicted answer}: 7,200 \\ \hline (b) & \tf{Question}: Which team won Super Bowl 50? \\ & \tf{Passage}: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion \blue{Denver Broncos} defeated the National Football Conference (NFC) champion \red{Carolina Panthers} 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. \\ & \tf{Gold answer}: Denver Broncos \\ & \tf{Predicted answer}: Carolina Panthers \\ \hline (c) & \tf{Question}: What is the population of the second largest city in California? \\ & \tf{Passage}: Los Angeles (at 3.7 million people) and San Diego (at \blue{1.3 million} people), both in southern California, are the two largest cities in all of California (and two of the eight largest cities in the United States). In southern California there are also twelve cities with more than 200,000 residents and 34 cities over \red{100,000} in population. Many of southern California's most developed cities lie along or in close proximity to the coast, with the exception of San Bernardino and Riverside. \\ & \tf{Gold answer}: 1.3 million \\ & \tf{Predicted answer}: 100,000 \\ \hline \end{tabular} \longcaption{Failure cases of our model on SQuAD}{\label{fig:sar-squad-errors}Several failure cases of our model on \sys{SQuAD}. Gold answers are marked as \blue{blue} and predicted answers are marked as \red{red}.} \end{figure} \begin{figure}[p] \centering \small \begin{tabular}{l | p{13.5cm}} \hline (d) &\tf{Question}: What is the least number of members a board of trustees can have? \\ & \tf{Passage}: The Book of Discipline is the guidebook for local churches and pastors and describes in considerable detail the organizational structure of local United Methodist churches. All UM churches must have a board of trustees with at least \blue{three} members and no more than \red{nine} members and it is recommended that no gender should hold more than a 2/3 majority. All churches must also have a nominations committee, a finance committee and a church council or administrative council. Other committees are suggested but not required such as a missions committee, or evangelism or worship committee. Term limits are set for some committees but not for all. The church conference is an annual meeting of all the officers of the church and any interested members. This committee has the exclusive power to set pastors' salaries (compensation packages for tax purposes) and to elect officers to the committees. \\ & \tf{Gold answer}: three \\ & \tf{Predicted answer}: nine \\ \hline (e) & \tf{Question}: Where does centripetal force go? \\ & \tf{Passage}: where is the mass of the object, is the velocity of the object and is the distance to the center of the circular path and is the unit vector pointing in the radial direction outwards from the center. This means that the unbalanced centripetal force felt by any object is always directed toward \blue{the center of the curving path}. Such forces act perpendicular to the velocity vector associated with the motion of an object, and therefore do not change the speed of the object (magnitude of the velocity), but only the direction of the velocity vector. The unbalanced force that accelerates an object can be resolved into a component that is perpendicular to the path, and one that is tangential to the path. This yields both the tangential force, which accelerates the object by either slowing it down or speeding it up, and the radial (centripetal) force, which \red{changes its direction}. \\ & \tf{Gold answer}: the center of the curving path \\ & \tf{Predicted answer}: changes its direction \\ \hline (f) & \tf{Question}: How many times have the Panthers been in the Super Bowl? \\ & \tf{Passage}: The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP). They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to their \blue{second} Super Bowl appearance since the franchise was founded in 1995. The Broncos finished the regular season with a 12–4 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20–18 in the AFC Championship Game. They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made \red{eight} appearances in the Super Bowl. \\ & \tf{Gold answer}: second \\ & \tf{Predicted answer}: eight \\ \hline \end{tabular} \longcaption{Failure cases of the currently best model (\sys{BERT} ensemble) on SQuAD}{\label{fig:bert-squad-errors}Several failure cases of the currently best model (\sys{BERT} ensemble) on \sys{SQuAD}. Gold answers are marked as \blue{blue} and predicted answers are marked as \red{red}.} \end{figure} \begin{figure}[!h] \centering \begin{tabular}{p{13.5cm}} \hline \tf{Question}: What is the name of the quarterback who was 38 in Super Bowl XXXIII? \\ \tf{Passage}: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by \blue{John Elway}, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. \ti{Quarterback \red{Jeff Dean} had jersey number 37 in Champ Bowl XXXIV.} \\ \hline \end{tabular} \longcaption{An adversarial example used in ~\cite{jia2017adversarial}}{\label{fig:adversarial-example}An adversarial example used in ~\cite{jia2017adversarial}, where a distracting sentence is added to the end of the passage (italicized). \blue{Blue}: the correct answer and \red{red}: the predicted answer.} \end{figure} We also took a closer look at the predictions of the best SQuAD model so far --- an ensemble of 7 \sys{BERT} models \cite{devlin2018bert}. As is demonstrated in Figure~\ref{fig:bert-squad-errors}, we can see that this strong model still makes some simple mistakes which humans hardly make. It is fair to conjecture that these models have been doing very sophisticated matching of text while they still have difficulty understanding the inherent structure between entities and the events expressed in the text. Lastly, \newcite{jia2017adversarial} find that if we add a distracting sentence to the end of the passage (see an example in Figure~\ref{fig:adversarial-example}), the average performance of current reading comprehension systems will drop drastically from 75.4\% to 36.4\%. These distracting sentences have word overlap with the question while not actually contradict the correct answer and not mislead human understanding. The performance is even worse if the distracting sentence is allowed to add ungrammatical sequences of words. These results suggest that 1) The current models strongly depend on the lexical cues between the passage and the question. That's why the distracting sentences can be so destructive; 2) Even though the models achieved a high accuracy on the original development set, they are really not robust to the adversarial examples. This is a critical problem of the standard supervised learning paradigm and it makes existing models difficult to deploy in the real world. We will discuss the property of robustness more in Section~\ref{sec:future-models}. To sum up, we believe that, although a very high accuracy was already obtained on the \sys{SQuAD} dataset, the current models only focus on the surface-level information of the text, and still make simple errors when it comes to a (slightly) deeper level of understanding. On the other hand, the high accuracies also indicate that most of the \sys{SQuAD} examples are rather easy and require little understanding. There are difficult examples which require complex reasoning in \sys{SQuAD} (for example, (c) in Figure~\ref{fig:sar-squad-errors}), but due to their scarcity, their accuracies aren't really reflected in the averaged metric. Furthermore, the high accuracies only hold when training and development come from the same distribution, and it still remains a severe problem when they differ. In the next two sections, we discuss the possibilities of creating more challenging datasets and building more effective models. ================================================ FILE: chapters/rc_future/questions.tex ================================================ %!TEX root = ../../thesis.tex \section{Research Questions} \label{sec:research-questions} In the last section, we discuss a few central research questions in this field, which still remain as open questions and yet to be answered in the future. \subsection{How to Measure Progress?} The first question is: \ti{How can we measure the progress of this field?} The evaluation metrics are certainly clear indicators of measuring progress on our reading comprehension benchmarks. Does this indicate that we make real progress on reading comprehension in general? How can we tell if some progress on one benchmark can generalize to others? How about if model $A$ works better than model $B$ on one dataset, while model $B$ works better on the other dataset? How to tell how far these computer systems are sill from genuine human-level reading comprehension? On the one hand, we think that taking human's standardized tests could be a good strategy for evaluating the performance of machine reading comprehension systems. These questions are usually carefully curated and designed to test human's reading comprehension abilities at different levels. To get computer systems aligned with human measurements is a proper way in building natural language understanding systems. % {\red{TODO: Not always correct --- some questions are easy for humans to answer but difficult for machines}}. On the other hand, we believe that it would be desirable to integrate many reading comprehension datasets as a testing suite for evaluation in the future, instead of only testing on one single dataset. This will help us better distinguish what are genuine progress for reading comprehension and what might be just overfitting to one specific dataset. More importantly, we need to understand our existing datasets better: characterizing their quality and what skills are required to answer the questions. This will be a crucial step for building more challenging datasets and analyzing the behavior of our models. Besides our work on analyzing the \sys{CNN/Daily Mail} examples in \newcite{chen2016thorough}, \newcite{sugawara2017evaluation} attempted to separate reading comprehension skills into two disjoint sets: \ti{prerequisite skills} and \ti{readability}. Prerequisite skills measure different types of reasoning and knowledge required to answer the question and 13 skills are defined: object tracking, mathematical reasoning, coreference resolution, logical reasoning, analogy, causal relation, spatiotemporal relation, ellipsis, bridging, elaboration, meta-knowledge, schematic clause relation and punctuation. Readability measures the “text ease of processing”, and a wide range of linguistic features/human readability measurements are used. The authors concluded that these two sets are weakly correlated and it is possible to design difficult questions from the contexts that are easy to read. These studies suggest that we could construct datasets and develop models based on these properties separately. In addition, \newcite{sugawara2018what} designed a few simple filtering heuristics and divided the examples from many existing datasets into a hard subset and an easy subset, based on 1) whether the question can be answered using only the first few words; 2) whether the answer is contained in the most similar sentence in the passage. They observed that the baseline performances for the hard subsets remarkably degrade compared to those of the entire datasets. Moreover, \newcite{kaushik2018how} analyzed the performance of existing models using passage-only or question-only information, and found that these models sometimes can work surprisingly well and hence there exists annotation artifacts in some of the existing datasets. In conclusion, we believe that if we want to make steady progress on reading comprehension in the future, we will have to answer these basic questions about the difficulty of examples first. Understanding what is required for the datasets, what our current systems can do and can't do will help us identify the challenges we are facing and measure the progress. \subsection{Representations vs. Architecture: Which is More Important?} \label{sec:rep-vs-arch} \begin{figure}[!t] \center \includegraphics[scale=0.45]{img/rep_vs_arch.pdf} \longcaption{A comparison of a complex architecture vs. a simple architecture with pre-training}{\label{fig:rep-vs-arch}A comparison of a complex architecture (left) vs. a simple architecture with pre-training (right). The parameters in the dashed box can be pre-trained from unlabeled text, while all the remaining parameters are initialized randomly and learned from the reading comprehension datasets.} \end{figure} The second important question is to understand the role of representations vs. architectures to the performance of reading comprehension models. Since \sys{SQuAD} was created, there has been a recent trend of increasing the complexity of neural architectures. In particular, more and more complex attention mechanisms have been proposed to capture the similarity between the passage and the question (Section~\ref{sec:attention-mechanisms}). However, recent works~\cite{radford2018improving,devlin2018bert} proposed that if we can pretrain a deep language model on large text corpora, a simple model which takes the concatenation of the question and the passage without modeling any direct interactions between the two can work extremely well on reading comprehension datasets such as \sys{SQuAD} and \sys{RACE} (see Table~\ref{tab:squad-results} and Table~\ref{tab:recent-datasets}). As illustrated in Figure~\ref{fig:rep-vs-arch}, the first class of models (left) only builds on top of word embeddings (each word type has a vector representation) pre-trained from unlabeled text, while all the remaining parameters (including all the weights to compute various attention functions) need to be learned from the limited training data. The second class of models (right) makes the model architecture very simple and it only models the question and passage as a single sequence. The whole model is pre-trained and all the parameters are kept. Only a few new parameters are added (e.g., the parameters for predicting the start and end positions for \sys{SQuAD}) and the other parameters will be fine-tuned on the training set of the reading comprehension tasks. We think these two classes of models indicate two extremes. On the one hand, it certainly demonstrates the incredible power of unsupervised representations. As we have a powerful language model pre-trained from huge amount of text, the model already encodes a great deal of properties about language while a simple model which concatenates the passage and the question is sufficient to learn the dependency between the two. On the other hand, when only word embeddings are given, it seems that modeling the interactions between the passage and the question carefully (or giving the model more prior knowledge) helps. In the future, we suspect that we will need to combine the two and a model like \sys{BERT} is too coarse to handle the examples which require complex reasoning. \subsection{How Many Training Examples Are Needed?} The third question is \ti{how many training examples are actually needed?} We have discussed many times that the success of neural reading comprehension is driven by large-scale supervised datasets. All the datasets that we have been actively working on contain at least 50,000 examples. Can we always embrace data abundance and further improve the performance of our systems? Is it possible to train a neural reading comprehension model with only hundreds of annotated examples today? We think there isn't a clear answer yet. On the one hand, there is clear evidence that having more data helps. \newcite{bajgar2016embracing} demonstrated that inflating the cloze-style training data constructed from books available through project Gutenberg can provide a boost of 7.4\%--14.8\% on the \sys{Children Book Test (CBT)} dataset~\cite{hill2016goldilocks} using the same model. We discussed before that using data augmentation techniques~\cite{yu2018qanet} or augmentating the training data with \sys{TriviaQA} can help improve the performance on \sys{SQuAD} (\# training examples = 87,599). On the other hand, pre-trained (language) models~\cite{radford2018improving,devlin2018bert} can help us reduce the dependence on large-scale datasets. In these models, most of the parameters are already pretrained on abundant unlabeled data and will be only fine-tuned during training. In the future, we should encourage more research on unsupervised learning and transfer learning. Leveraging unlabeled data (e.g., text) or other cheap resources or supervision (e.g., datasets like \sys{CNN/Daily Mail}) will relieve us from collecting expensive annotated data. We also should seek better and cheaper ways of collecting supervised datasets. % % \red{Chris: I think the main substantive thing missing here is a discussion of more difficult types of questions that probe deeper levels of Reading Comprehension. That is a middle school reading comprehension exercise normally is not so much about answering factoid style questions that but showing that you understood the reasoning and implications of the text and what the author is trying to convey. Often this is done with how/why questions: In the story, why is Cynthia upset with her mother? How does John attempt to make up for his original mistake? How does the author indicate that Benjamin is scared to be left alone? But there are other aspects of deeper comprehension too. We can argue about how successful they have been, but I think very clearly the goal of the AI2 Aristo work has been to try to have comprehension tests where you actually have to understand the underlying science of what is being discussed, rather than just answering from text matching. It would be good to have a paragraph or two on issues like this --- assessing deeper reading comprehension than question text matching.} ================================================ FILE: chapters/rc_models/advances.tex ================================================ %!TEX root = ../../thesis.tex \section{Further Advances} \label{sec:advances} In this section, we summarize recent advances in neural reading comprehension. We divide them into the following four categories: {word representations}, {attention mechanisms}, {alternatives to LSTMs}, and {others} (such as training objectives, data augmentation). We give a summary and discuss their importance in the end. \subsection{Word Representations} The first category is better word representations for question and passage words, so the neural models are built off of better grounds. Learning better distributed word representations from text or finding the best set of word embeddings for specific tasks still remains an active research topic --- for example, \newcite{mikolov2017advances} find that replacing \sys{GloVe} pre-trained vectors with the new \sys{fastText} vectors~\cite{bojanowski2017enriching} in our model brings about 1 point of improvement on \sys{SQuAD}. More than that, there are two key ideas which have been proved (highly) useful: \subsubsection*{Character embeddings} The first idea is to use character-level embeddings to represent words, which are especially helpful for rare or out-of-vocabulary words. Most of the existing works employ a \sys{convolutional neural network} (CNN), which can usefully exploit the surface patterns of $n$-gram characters. More concretely, let $\mathcal{C}$ be the vocabulary of characters and each word type $x$ can be represented as a sequence of characters $(c_1, \ldots, c_{|x|}), \forall c_i \in \mathcal{C}$. We first map each character in $\mathcal{C}$ into a $d_c$-dimensional vector, so word $x$ can be represented as $\mf{c}_1, \ldots, \mf{c}_{|x|}$. Next we apply a convolution layer with a filter $\mf{w} \in \R^{d_c \times w}$ of width $w$, and we denote $\mf{c}_{i:i+j}$ as the concatenation of $\mf{c}_i, \mf{c}_{i + 1}, \ldots, \mf{c}_{i + j}$. Therefore, for $i = 1, \ldots, |x| - w + 1$, we can apply this filter $\mf{w}$ and after which we add a bias $b$ and apply a nonlinearity $\tanh$ as follows: \begin{equation} f_i = \tanh\left(\mf{w}^{\intercal} \mf{c}_{i:i+w-1} + b \right). \end{equation} Finally we can apply a \ti{max-pooling} operation on $f_1, \ldots, f_{|x| - w + 1}$ and obtain one scalar feature: \begin{equation} f = \max_{i}{\{f_i\}} \end{equation} This feature essentially picks out a character $n$-gram, where the size of the $n$-gram corresponds to the filter width $w$. We can repeat the above process by repeating $d^*$ different filters $\mf{w}_1, \ldots, \mf{w}_{d^*}$. As a result, we can obtain a character-based word representation for each word type $\mf{E}_c(x) \in \R^{d^*}$. All the character embeddings, filter weights $\{\mf{w}\}$ and biases $\{b\}$ are learned during training. More details can be found in \newcite{kim2014convolutional}. In practice, the dimension of character embeddings $d_c$ usually takes a small value (e.g., 20), width $w$ usually takes $3 - 5$, while $100$ is a typical value for $d^*$. \subsubsection*{Contextualized word embeddings} Another important idea is \ti{contextualized word embeddings}. Different from traditional word embeddings in which each word type is mapped to one single vector, contextualized word embeddings assign each word a vector as a function of the entire input sentence. These word embeddings can model better complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., polysemy). A concrete implementation is \sys{ELMo} detailed in \newcite{peters2018deep}: their contextualized word embeddings are learned functions of the internal states of a deep bidirectional language model, which is pretrained on a large text corpus. Basically, given a sequence of words $(x_1, x_2, \ldots, x_n)$, they run an $L$-layer forward LSTM and models the sequence probability as: \begin{equation} P(x_1, x_2, \ldots, x_n) = \prod_{k = 1}^{n}P(x_k \mid x_1, \ldots, x_{k - 1}) \end{equation} Only the top layer of the LSTM $\overrightarrow{\mf{h}}^{(L)}_k$ is used to predict the next token $x_{k + 1}$. Similarly, another $L$-layer LSTM is run backward and $\overleftarrow{\mf{h}}^{(L)}_k$ is used to predict the token $x_{k - 1}$. The overall training objective is to maximize the log-likelihood from both directions: \begin{equation} \small \sum_{k=1}^{n}\left({\log P (x_k \mid x_1, \ldots, x_{k-1}; {\Theta}_x, \overrightarrow{{\Theta}}_{\text{LSTM}}, {\Theta}_s ) + \log P (x_k \mid x_{k+1}, \ldots, x_{n}; {\Theta}_x, \overleftarrow{{\Theta}}_{\text{LSTM}}, {\Theta}_s )}\right), \end{equation} where $\Theta_x$ and $\Theta_s$ are word embeddings and softmax parameters and shared for both LSTMs. The final contextualized word embeddings are computed as a linear combination of all the biLSTM layers and the input word embeddings, multiplied by a linear scalar: \begin{equation} \sys{ELMo}(x_k) = \gamma \left(s_0 \mf{x}_k + \sum_{j=1}^{L}{\overrightarrow{s}_{j} \overrightarrow{\mf{h}}^{(j)}_k} + \sum_{j=1}^{L}{\overleftarrow{s}_{j} \overleftarrow{\mf{h}}^{(j)}_k} \right) \end{equation} All the weights $\gamma, s_0, \overrightarrow{s}_{j}, \overleftarrow{s}_{j}$ are task-specific and learned during the training process. These contextualized word embeddings are usually used in conjunction with traditional word type embeddings and character embeddings. It turns out that this sort of contextualized word embeddings pre-trained on very large text corpora (e.g., 1B Word Benchmark~\cite{chelba2014one}) has been highly effective. \newcite{peters2018deep} demonstrated that adding ELMo embeddings ($L = 2$ biLSTM layers with $4096$ units and $512$ dimension projections) to an existing competitive model can bring the F1 score on \sys{SQuAD} from $81.1$ to $85.8$ directly, a $4.7$ point of absolute improvement. Earlier than \sys{Elmo}, \newcite{mccann2017learned} proposed \sys{CoVe}, which learned contextualized word embeddings in a neural machine translation framework, and the resulting encoder can be used in a similar way as an addition to the word embeddings. They also demonstrated a $4.3$ point of absolute improvement on \sys{SQuAD}. Very recently, \newcite{radford2018improving} and \newcite{devlin2018bert} find that these contextualized word embeddings can not only be used as features of word representations in a task-specific neural architecture (a reading comprehension model in our context), but we can fine-tune the deep language models directly with minimal modifications to perform downstream tasks. This is indeed a very striking result at the time of writing this thesis and we will discuss it more in Section~\ref{sec:rep-vs-arch} and there still remain many open questions to answer in the future. Additionally, \newcite{devlin2018bert} proposed a clever way to train bidirectional language models: instead of always stacking LSTMs in one direction and predicting the next word,\footnote{To make it clear, although ELMo adopts a biLSTM, it is essentially the use of two unidirectional LSTMs for predicting the next word in both directions.} they mask out some words at random at the input layer, stack bidirectional layers and predict these masked words at the top layer. They find this training strategy extremely useful empirically. \subsection{Attention Mechanisms} \label{sec:attention-mechanisms} There have been a multitude of attention variants proposed for neural reading comprehension models, and they aim to capture semantic similarity between the question and the passage, at different levels, multiple granularity, or in a hierarchical way. A typical complex example in this direction can be found at \cite{huang2018fusionnet}. To our best knowledge, there isn't a conclusion yet if there is one single variant that stands out. Our \sys{Stanford Attentive Reader} (Section~\ref{sec:sar}) takes the most simple possible form of attention (Figure~\ref{fig:att-overview} illustrates an overview of different layers of attention). Besides that, we think there are two ideas which can generally further improve the performance of these systems: \begin{figure}[t] \centering \vspace{1em} \includegraphics[scale=0.25]{img/gen_fusion.pdf} \vspace{1em} \begin{tabular}{l|ccccc} \hline \bf Architectures & \bf (1) & \bf (2) & \bf (2') & \bf (3) & \bf (3') \\ \hline Match-LSTM \citep{wang2017machine} & & \checkmark & & & \\ DCN \citep{xiong2017dynamic} & & \checkmark & & & \checkmark \\ BiDAF \citep{seo2017bidirectional} & & \checkmark & & & \checkmark \\ RaSoR \citep{lee2016learning} & \checkmark & & \checkmark & & \\ R-net \citep{wang2017gated} & & \checkmark & & \checkmark & \\ \hline Our model & \checkmark & & & & \\ \hline \end{tabular} \longcaption{A summary of different layers of attention.}{\label{fig:att-overview} A summary of different layers of attention. Image courtesy: \cite{huang2018fusionnet} with minimal modifications.} \end{figure} \subsubsection*{Bidirectional attention} \newcite{seo2017bidirectional} first introduced the idea of \ti{bidirectional attention}. In addition to what we already have, the key difference is that they have the \ti{question-to-passage} attention, which signifies which passage words have the closest similarity to each of the question words. In practice, this can be implemented as: for each word in the question, we can compute an attention map over all the passage words, similar as we did in Equation~\ref{eq:aligned_question} and \ref{eq:aligned_question_attention}, but in opposite directions: \begin{equation} f_{q\_align}(q_i) = \sum_j{b_{i, j} \mf{E}(p_j)}. \end{equation} After this, we can simply feed $f_{q\_align}(q_i)$ into the input layer of the question encoding (Section~\ref{sec:question-encoding}). The attention mechanism in \newcite{seo2017bidirectional} is a bit more complex, but we think it is similar. We also argue that the attention function in this direction is less useful, as also demonstrated in \newcite{seo2017bidirectional}. This is because the questions are generally short (10-20 words on average) and using one single LSTM for question encoding (without extra attention) is usually sufficient. \subsubsection*{Self-attention over passage} The second idea is \ti{self-attention} over the passage words, first introduced in \newcite{wang2017gated}.\footnote{They named it as ``self-matching attention mechanism'' in the paper.} The intuition is that the passage words can be aligned to the other passage words, with the hope that it can address coreference problems and aggregate information (of the same entity) from multiple places in the passage. In detail, \newcite{wang2017gated} first compute the hidden vectors for the passage: $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p}$ (Equation~\ref{eq:passage-lstm}), and then for each $\mf{p}_i$, they apply an attention function over $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p}$ via one hidden layer of MLP (Equation~\ref{eq:mlp-att}): \begin{eqnarray} a_{i, j} & =& \frac{\exp\left(g_{\text{MLP}}(\mf{p}_i, \mf{p}_j)\right)}{\sum_{j'}\exp\left(g_{\text{MLP}}(\mf{p}_i, \mf{p}_{j'})\right)} \\ \mf{c}_i & = & \sum_{j}{a_{i, j}\mf{p}_j} \end{eqnarray} Later, $\mf{c}_i$ and $\mf{p}_i$ are concatenated and fed into another BiLSTM: $\mf{h}^{(p)}_i = \text{BiLSTM}(\mf{h}^{(p)}_{i-1}, [\mf{p}_i, \mf{c}_i])$, and can be used as the final passage representations. \subsection{Alternatives to LSTMs} \label{sec:alt-lstms} All the models we discussed so far are based on recurrent neural networks (RNNs), in particular, LSTMs. It is well known that increasing the depth of neural networks can improve the capacity of models and bring gains in performance~\cite{he2016deep}. We also discussed earlier that deep BiLSTMs of $3$ or $4$ layers usually perform better than a single layer of BiLSTM (Section~\ref{sec:imp-details}). However, we are facing two challenges as we further increase the depth of the LSTM models: 1) It gets more difficult to optimize due to the vanishing gradient problem; 2) Scalability becomes an issue as the training/inference time increases linearly as the number of layer grows. It is known that LSTMs are difficult to parallelize and thus scale poorly due to their sequential nature. On the one hand, there are works which attempt to add highway connections~\cite{srivastava2015training} or residual connections~\cite{he2016deep} between layers, so it eases the optimization and enables training more layers of LSTMs. On the other hand, people set out to find replacements for LSTMs, getting rid of recurrent structures while still performing similarly or even better. The most notable work in this line is the \sys{Transformer} model proposed by Google researchers~\cite{vaswani2017attention}. \sys{Transformer} only builds on top of word embeddings and simple positional encodings with stacked self-attention layers and position-wise fully connected layers. With residual connections, this model is able to be trained fast with many layers. It first demonstrated superior performance on a machine translation task with $L = 6$ layers (each layer consists of a self-attention and a fully connected feedforward network), and then later was adapted by~\cite{yu2018qanet} for reading comprehension. The model called \sys{QANet} \cite{yu2018qanet} stacks multiple convolutional layers followed by the self-attention and fully connected layer, as a building block, for both question and passage encoding as well as a few more layers stacked before the final prediction. The model demonstrated state-of-the-art performance at the time (Table~\ref{tab:squad-results}) while showing significant speed-ups. Another research work by \newcite{lei2018simple} proposed a lightweight recurrent unit called \sys{Simple Recurrent Unit} (SRU) by simplifying the LSTM formulation while enabling CUDA-level optimizations for high parallelization. Their results suggest that simplified recurrence retains strong modeling capacity through layer stacking. They also demonstrate that replacing the LSTMs in our model with their \sys{SRU} unit can improve the F1 score by 2 points while being faster for training and inference. \subsection{Others} \paragraph{Training objectives.} It is also possible to make further progress by improving the training objectives. It is usually straightforward to employ a cross-entropy or max-margin loss for the cloze style or multiple choice problems. However, for span prediction problems, \newcite{xiong2018dcn+} suggest that there is a discrepancy between the cross-entropy loss of predicting two endpoints of the answer and the final evaluation metrics, which concerns the word overlap between gold answer and ground truth. For the following example: \begin{displayquote} \tf{passage}: Some believe that the Golden State Warriors team of 2017 is one of the greatest teams in NBA history \ldots \\ \tf{question}: Which team is considered to be one of the greatest teams in NBA history? \\ \tf{ground truth answer}: the Golden State Warriors team of 2017 \end{displayquote} Span ``Warriors'' is also a correct answer, however, from the perspective of cross entropy based training it is no better than the span ``history''. \newcite{xiong2018dcn+} propose to use a mixed training objective which combines cross entropy loss over positions and the word overlap measure trained with reinforcement learning. Basically, they use $P^{(\text{start})}(i)$ and $P^{(\text{end})}(i)$ trained with cross-entroy loss to sample the start and end positions of the answer and then use the F1 score as reward function. For the free-form answer of reading comprehension problems, there has been many recent advances in training better \sys{seq2seq} models especially in the context of neural machine translation, such as sentence-level training~\cite{ranzato2016sequence} and minimum risk training~\cite{shen2016minimum}. However, we don't see many such applications in reading comprehension problems yet. \paragraph{Data augmentation.} Data augmentation has been a very successful approach for image recognition, while it is less explored in NLP problems. \newcite{yu2018qanet} proposed a simple technique for creating more training data for reading comprehension models. The technique is called \ti{backtranslation} --- basically they leverage two state-of-the-art neural machine translation models: one model from English to French and the other model from French to English, and paraphrase each single sentence in the passage by running through the two models (with some modifications to the answer if needed). They obtained ~2 points gain in F1 by doing this on \sys{SQuAD}. \newcite{devlin2018bert} also find that joint training of \sys{SQuAD} and \sys{TriviaQA}~\cite{joshi2017triviaqa} can help improve the performance on \sys{SQuAD} modestly. \subsection{Summary} So far, we have discussed recent advances in different aspects, which, in sum, contribute to the latest progress on current reading comprehension benchmarks (esp. \sys{SQuAD}). Which components are more important than the others? Do we need to add up all of these? Are these recent advances able to generalize to other reading comprehension tasks? How are they correlated with different capacities of language understanding? We think there isn't a clear answer to most of these questions yet and it still requires a lot of investigation. \begin{table}[!t] \centering \begin{tabular}{p{6cm} | c l} \hline \tf{Components} & \tf{F1 improvement} & \tf{References} \\ \hline \sys{Glove}$\Rightarrow$\sys{Fasttext} & 78.9 $\Rightarrow$ 79.8: $+0.9$ & \cite{mikolov2017advances} \\ Character embeddings & 75.4 $\Rightarrow$ 77.3: $+1.9$ & \cite{seo2017bidirectional} \\ {\small Contextualized embeddings: \sys{ELMo}} & 81.1 $\Rightarrow$ 85.8: $+4.7$ & \cite{peters2018deep} \\ \hline Question to passage attention & 73.7 $\Rightarrow$ 77.3: $+3.6$ & \cite{seo2017bidirectional} \\ Self-attention over passage & 76.7 $\Rightarrow$ 79.5: $+2.8$ & \cite{wang2017gated} \\ \hline 3-layer LSTMs $\Rightarrow$ 6-layer SRUs & 78.8 $\Rightarrow$ 80.2: $+1.4$ & \cite{lei2018simple} \\ \hline Mixed training objective & 82.1 $\Rightarrow$ 83.1: $+1.0$ & \cite{xiong2018dcn+} \\ Data augmentation & 82.7 $\Rightarrow$ 83.8: $+1.1$ & \cite{yu2018qanet} \\ \hline \end{tabular} \longcaption{A summary of recent advances on \sys{SQuAD}}{\label{tab:impr-squad} A summary of recent advances on \sys{SQuAD}. The numbers are taken from the corresponding papers, on the development set of \sys{SQuAD}.} \end{table} We compiled the improvements of different components on \sys{SQuAD} in Table~\ref{tab:impr-squad}. We would like to caution readers that these numbers are really not directly comparable, as they are built on different model architectures and different implementations. We hope that this table at least reflects some ideas regarding the importance of these components on the \sys{SQuAD} dataset. As is seen, all these components contribute to the final performance, more or less. The most important innovation is probably the use of contextualized word embeddings (e.g., \sys{ELMo}), while the formulation of attention functions is also crucial. It will be important to investigate whether these advances can generalize to other reading comprehension tasks in the future. ================================================ FILE: chapters/rc_models/experiments.tex ================================================ %!TEX root = ../../thesis.tex \section{Experiments} \label{sec:sar-experiments} \subsection{Datasets} We evaluate our model on \sys{CNN/Daily Mail}~\cite{hermann2015teaching} and \sys{SQuAD}~\cite{rajpurkar2016squad}, the two most popular and competitive reading comprehension datasets. We have described them before in Section~\ref{sec:deep-learning-era} regarding their importance in the development of neural reading comprehension and the way the datasets were constructed. Now we give a brief review of these datasets and the statistics. \begin{itemize} \item The \sys{CNN/Daily Mail} datasets were made from articles on the news websites CNN and Daily Mail, utilizing articles and their bullet point summaries. One bullet point is converted to a question with one entity replaced by a placeholder and the answer is this entity. The text has been run through a Google NLP pipeline. It it tokenized, lowercased, and named entity recognition and coreference resolution have been run. For each coreference chain containing at least one named entity, all items in the chain are replaced by an @entity$n$ marker, for a distinct index $n$ (Table~\ref{tab:rc-examples} (a)). On average, both the \sys{CNN} and \sys{Daily Mail} contain 26.2 different entities in the article. The training, development, and testing examples were collected from the news articles at different times. The accuracy (percentage of examples predicting the correct entity) is used for evaluation. \item The \sys{SQuAD} dataset was collected based on Wikipedia articles. 536 high-quality Wikipedia articles were sampled and crowdworkers created questions based on each individual paragraph (paragraphs shorter than 500 characters were discarded), requiring that answer must be highlighted from the paragraph (Table~\ref{tab:rc-examples} (c)). The training/development/testing splits were made randomly based on articles (80\% vs. 10\% vs. 10\%). To estimate human performance and also make evaluation more reliable, they collected a few additional answers for each question (each question in the development set has 3.3 answers on average). Exact match and macro-averaged F1 scores are used for evaluation, as we discussed in Section~\ref{sec:evaluation}. Note that \sys{SQuAD} 2.0~\cite{rajpurkar2018know} was proposed more recently, which added 53,775 unanswerable questions to the original dataset and we will discuss it in Section~\ref{sec:future-datasets}. For most of this thesis, \sys{SQuAD} refers to \sys{SQuAD} 1.1 unless stated otherwise. \end{itemize} \begin{table}[t] \centering \begin{tabular}{l | r r | r } \hline & \multicolumn{2}{c|}{cloze style} & span prediction \\ & \tf{CNN} & \tf{Daily Mail} & \tf{SQuAD} \\ \hline \#Train & 380,298 & 879,450 & 87,599 \\ \#Dev & 3,924 & 64,835 & 10,570 \\ \#Test & 3,198 & 53,182 & 9,533 \\ \hline Passage: avg. tokens & 761.8 & 813.1 & 134.4 \\ Question: avg. tokens & 12.5 & 14.3 & 11.3 \\ Answer: avg. tokens & 1.0 & 1.0 & 3.1 \\ \hline \end{tabular} \longcaption{Data statistics of \sys{CNN/Daiily Mail} and \sys{SQuAD}}{\label{tab:data-statistics}Data statistics of \sys{CNN/Daily Mail} and \sys{SQuAD}. The average numbers of tokens are computed based on the training set.} \end{table} Table~\ref{tab:data-statistics} gives more detailed statistics of the datasets. As it is shown, the \sys{CNN/Daily Mail} datasets are much larger than \sys{SQuAD} (almost one order of magnitude bigger) due to the way the datasets were constructed. The passages used in \sys{CNN/Daily Mail} are also much longer --- 761.8 and 813.1 tokens for \sys{CNN} and \sys{Daily Mail} respectively, while it is 134.4 tokens for SQuAD. Finally, the answers in \sys{SQuAD} consists of only 3.1 tokens on average, which reflects the fact the most of the \sys{SQuAD} questions are factoid and a large portion of the answers are common nouns or named entities. \subsection{Implementation Details} \label{sec:imp-details} Besides different model architecture designs, implementation details also play a crucial role in the final performance of these neural reading comprehension systems. In the following we highlight a few important aspects that we haven't covered yet and finally give the model specifications that we used on the two datasets. \paragraph{Stacked BiLSTMs.} One simple idea is to increase the depth of bidirectional LSTMs for question and passage encoding. It computes $\mf{h}_t = [\overrightarrow{\mf{h}}_t; \overleftarrow{\mf{h}}_t] \in \R^{2h}$ and then regard $\mf{h}_t$ as the input $\mf{x}_t$ of the next layer and pass them into another BiLSTM, and so on. We generally find that stacking BiLSTMs works better than a one-layer BiLSTM and we used $3$ layers for the SQuAD experiment.\footnote{We only used a shallow one-layer BiLSTM for the CNN/Daily Mail experiments in 2016 though.} \paragraph{Dropout.} Dropout is an effective and widely used approach to regularization in neural networks. Simply put, dropout refers to masking out some units at random during the training process. For our model, dropout can be added to the word embeddings, input vectors and hidden vectors of every LSTM layer. Finally, the variational dropout approach \cite{gal2016theoretically} has demonstrated to work better than the standard dropout on regularizing RNNs. The idea is to apply the same dropout mask at each time step for both inputs, outputs and recurrent layers, i.e., the same units are dropped at each time step. We suggest readers to use this variant in practice.\footnote{We didn't include variational dropout in our published paper results but later found it useful.} \paragraph{Handling word embeddings.} One common way (and also our default choice) to handle word embeddings is to keep the most frequent $K$ (e.g., $K = 500,000$) word types in the training set and map all other words to an $\left$ token and then use pre-trained word embeddings to initialize the $K$ words. Typically, when the training set is large enough, we fine tune all the word embeddings; when the training set is relatively small (e.g., \sys{SQuAD}), we usually keep all the word embeddings fixed as static features. In \newcite{chen2017reading}, we find that it helps to fine-tune the most frequent question words because the representations of these key words such as \ti{what}, \ti{how}, \ti{which} could be crucial for reading comprehension systems. Some studies such as \cite{dhingra2017comparative} demonstrated the use of pre-trained embeddings and the ways of handling out-of-vocabulary words have a large impact on the performance of reading comprehension tasks. \paragraph{Model specifications.} For all the experiments which require linguistic annotations (lemma, part-of-speech tags, named entity tags, dependency parses), we use the Stanford CoreNLP toolkit~\cite{manning2014stanford} for preprocessing. For training all the neural models, we sort all the examples by the length of its passage, and randomly sample a mini-batch of size 32 for each update. For the results on \sys{CNN/Daily Mail}, we use the lowercased, 100-dimensional pre-trained \sys{Glove} word embeddings~\cite{pennington2014glove} trained on Wikipedia and Gigaword for initialization. The attention and output parameters are initialized from a uniform distribution between $(-0.01, 0.01)$, and the LSTM weights are initialized from a Gaussian distribution $\mathcal{N}(0, 0.1)$. We use a 1-layer BiLSTM of hidden size $h = 128$ for \sys{CNN} and $h = 256$ for \sys{Daily Mail}. Optimization is carried out using vanilla stochastic gradient descent (SGD), with a fixed learning rate of $0.1$. We also apply dropout with probability $0.2$ to the embedding layer and gradient clipping when the norm of gradients exceeds $10$. For the results on \sys{SQuAD}, we use 3-layer BiLSTMs with $h = 128$ hidden units for both paragraph and question encoding. We use \sys{Adamax} for optimization as described in \cite{kingma2014adam}. Dropout with probably $0.3$ is applied to word embeddings and all the hidden units of LSTMs. We used the $300$-dimensional \sys{Glove} word embeddings trained from 840B Web crawl data for initialization and only fine-tune the 1000 most frequent question words. Other implementation details can be found in the following two Github repositories: \begin{itemize} \item \href{https://github.com/danqi/rc-cnn-dailymail}{https://github.com/danqi/rc-cnn-dailymail} for our experiments in \newcite{chen2016thorough}. \item \href{https://github.com/facebookresearch/DrQA}{https://github.com/facebookresearch/DrQA} for our experiments in \newcite{chen2017reading}. \end{itemize} We also would like to caution readers that our experimental results were published in two papers (2016 and 2017) and they differ in various places. A key difference is that our results on \sys{CNN/Daily Mail} didn't include manual features $f_{token}(p_i)$, exact match features $f_{exact\_match}(p_i)$, aligned question embeddings $f_{align}(p_i)$ and $\tilde{\mf{p}}_i$ just takes the word embedding $\mf{E}(p_i)$. Another difference is that we didn't have the attention layer in question encoding before but simply concatenated the last hidden vectors from the LSTMs in both directions. We believe that that these additions are useful on \sys{CNN/Daily Mail} and other cloze style tasks as well, but we didn't further investigate it. \subsection{Experimental Results} \subsubsection{Results on \sys{CNN/Daily Mail}} \begin{table}[t] \centering \begin{tabular}{l c c c c} \toprule \multirow{2}{*}{\tf{Model}} & \multicolumn{2}{c}{\sys{CNN}} & \multicolumn{2}{c}{\sys{Daily Mail}} \\ & \tf{Dev} & \tf{Test} & \tf{Dev} & \tf{Test} \\ \midrule Frame-semantic model $^\dagger$ &36.3 & 40.2 & 35.5 & 35.5 \\ Word distance model $^\dagger$ & 50.5 & 50.9 & 56.4 & 55.5 \\ Deep LSTM Reader $^\dagger$ & 55.0 & 57.0 & 63.3 & 62.2 \\ Attentive Reader $^\dagger$ & 61.6 & 63.0 & 70.5 & 69.0 \\ Impatient Reader $^\dagger$ & 61.8 & 63.8 & 69.0 & 68.0 \\ \midrule MemNNs (window memory) $^\ddagger$ & 58.0 & 60.6 & N/A & N/A \\ MemNNs (window memory + self-sup.) $^\ddagger$ & 63.4 & 66.8 & N/A & N/A\\ MemNNs (ensemble) $^\ddagger$ & 66.2\rlap{$^*$} & 69.4\rlap{$^*$} & N/A & N/A \\ \midrule Our feature-based classifier & 67.1 & 67.9 & 69.1 & 68.3 \\ \midrule Stanford Attentive Reader & 72.5 & 72.7 & 76.9 & 76.0 \\ Stanford Attentive Reader (ensemble) & 76.2\rlap{$^*$} & 76.5\rlap{$^*$} & 79.5\rlap{$^*$} & 78.7\rlap{$^*$} \\ \bottomrule \end{tabular} \longcaption{Evaluation results on CNN/Daily Mail}{\label{tab:cnn-dm-results}Accuracy of all models on the \sys{CNN} and \sys{Daily Mail} datasets. Results marked $^\dagger$ are from \newcite{hermann2015teaching} and results marked $^\ddagger$ are from \newcite{hill2016goldilocks}. The numbers marked with $^*$ indicate that the results are from ensemble models.} \end{table} Table~\ref{tab:cnn-dm-results} presents the results that we reported in \newcite{chen2016thorough}. We run our neural models 5 times independently with different random seeds and report average performance across the runs. We also report ensemble results which average the prediction probabilities of the 5 models. We also present the results for the feature-based classifier we described in Section~\ref{sec:feature-models}. \paragraph{Baselines.} We were among the earliest groups to study this first large-scale reading comprehension dataset. At the time, \newcite{hermann2015teaching} and \newcite{hill2016goldilocks} proposed a few baselines, both symbolic approaches and neural models, for this task. The baselines include: \begin{itemize} \item A \sys{frame-semantic} model in \newcite{hermann2015teaching}, which they run a state-of-the-art semantic parser, and extract entity-predicate triples denoted as $(e_1, V, e_2)$ from both the question and the passage, and attempt to match the correct entity using a number of heuristic rules. \item A \sys{word distance} model in \newcite{hermann2015teaching}, in which they align the placeholder of the question with each possible entity, and compute a distance measure between the question and the passage around the aligned entity. \item Several LSTM-based neural models proposed in \newcite{hermann2015teaching}, named \sys{Deep LSTM Reader}, \sys{Attentive Reader} and \sys{Impatient Reader}. The \sys{Deep LSTM Reader} just processes the question and the passage as one sequence using a deep LSTM (without attention mechanism), and makes a prediction in the end. The \sys{Attentive Reader} is similar in spirit to ours, as it computes an attention function between the question vector and all the passage vectors; while the \sys{Impatient Reader} computes an attention function for all the question words and recurrently accumulates information as the model reads each question word. \item \sys{Window-based memory networks} proposed by \newcite{hill2016goldilocks} is based on the memory network architecture \cite{weston2015memory}. We think the model is also similar to ours and the biggest difference is their way of encoding passages: they only use a 5-word context window when evaluating a candidate entity and they use a positional unigram approach to encode the contextual embeddings. If a window consists of $5$ words $x_1, x_2, \ldots, x_5$, then it is encoded as $\sum{\mf{E}_i(x_i)}$, resulting in $5$ separate embedding matrices to learn. They encode the $5$-word window surrounding the placeholder in a similar way and all other words in the question text are ignored. In addition, they simply use a dot product to compute the ``relevance'' between the question and a contextual embedding. \end{itemize} As seen in Table~\ref{tab:cnn-dm-results}, our feature-based classifier obtains 67.9\% accuracy on the \sys{CNN} test set and 68.3\% accuracy on the \sys{Daily Mail} test set. It significantly outperforms any of the symbolic approaches reported in \newcite{hermann2015teaching}. We feel that their frame-semantic model is not suitable for these tasks due to the poor coverage of the parser and is not representative of what a straightforward NLP system can achieve. Indeed, the frame-semantic model is even markedly inferior to the word distance model. To our surprise, our feature-based classifier even outperforms all the neural network systems in \newcite{hermann2015teaching} and the best single-system result reported from \newcite{hill2016goldilocks}. Moreover, our single-model neural network surpasses the previous results by a large margin (over 5\%), pushing up the state-of-the-art accuracies to 72.7\% and 76.0\% respectively. The ensembles of 5 models consistently bring further 2-4\% gains. \subsubsection{Results on \sys{SQuAD}} \begin{table}[t] \begin{center} \begin{tabular}{p{8.5cm} c c c c} \hline \bf Method & \multicolumn{2}{c}{\bf Dev} & \multicolumn{2}{c}{\bf Test} \\ & \tf{EM} & \tf{F1} & \tf{EM} & \tf{F1} \\ \hline Logistic regression \cite{rajpurkar2016squad} & 40.0 & 51.0 & 40.4 & 51.0 \\ \hline Match-LSTM~\cite{wang2017machine} & 64.1 & 73.9 & 64.7 & 73.7 \\ RaSoR~\cite{lee2016learning} & 66.4 & 74.9 & 67.4 & 75.5 \\ DCN~\cite{xiong2017dynamic} & 65.4 & 75.6 & 66.2 & 75.9 \\ BiDAF~\cite{seo2017bidirectional} & 67.7 & 77.3 & 68.0 & 77.3 \\ \hline \tf{Our model}~\cite{chen2017reading} & 69.5 & 78.8 & 70.0 & 79.0\\ \hline R-NET~\cite{wang2017gated} & 71.1 & 79.5 & 71.3 & 79.7 \\ BiDAF + self-attention~\cite{peters2018deep} & N/A & N/A & 72.1 & 81.1 \\ FusionNet~\cite{huang2018fusionnet} & N/A & N/A & 76.0 & 83.9 \\ QANet~\cite{yu2018qanet} & 73.6 & 82.7 & N/A & N/A \\ SAN~\cite{liu2018stochastic} & 76.2 & 84.1 & 76.8 & 84.4 \\ {\small BiDAF + self-attention + ELMo}~\cite{peters2018deep} & N/A & N/A & 78.6 & 85.8 \\ BERT~\cite{devlin2018bert} & 84.1 & 90.9 & N/A & N/A \\ \hline Human performance \cite{rajpurkar2016squad} & 80.3 & 90.5 & 82.3 & 91.2 \\ \hline \end{tabular} \end{center} \longcaption{Evaluation results on SQuAD}{\label{tab:squad-results} Evaluation results on the SQuAD dataset (single model only). The results below ``our model'' were released after we finished the paper in Feb 2017. We only list representative models and report the results from the published papers. For a fair comparison, we didn't include the results which use other training resources (e.g., TriviaQA) or data augmentation techniques, except pre-trained language models, but we will discuss them in Section~\ref{sec:advances}. } \end{table} Table~\ref{tab:squad-results} presents our evaluation results on both the development and testing sets. SQuAD has been a very competitive benchmark since it was created and we only list a few representative models and the single-model performance. It is well known that the ensemble models can further improve the performance by a few points. We also included results from the logistic regression baseline (i.e., feature-based classifiers) created by the original authors \cite{rajpurkar2016squad}. Our system can achieve 70.0\% exact match and 79.0\% F1 scores on the test set, which surpassed all the published results and matched the top performance on the SQuAD leaderboard\footnote{\href{https://stanford-qa.com}{https://stanford-qa.com}.} at the time we wrote the paper~\cite{chen2017reading}. Additionally, we think that our model is conceptually simpler than most of the existing systems. Compared to the logistic regression baseline, which gets $\text{F1} = 51.0$, this model is already close to a 30\% absolute improvement and it is a big win for neural models. Since then, \sys{SQuAD} has received tremendous attention and great progress has been made on this dataset, as seen in Table~\ref{tab:squad-results}. Recent advances include pre-trained language models for initialization, more fine-grained attention mechanisms, data augmentation techniques and even better training objectives. We will discuss them in Section~\ref{sec:advances}. \subsubsection{Ablation studies} \begin{table}[h] \begin{center} \begin{tabular}{l | l} \hline \bf Features & \bf F1\\ \hline Full & 78.8 \\ \hline No $f_{token}$ & 78.0 (-0.8)\\ No $f_{exact\_match}$ & 77.3 (-1.5)\\ No $f_{aligned}$ & 77.3 (-1.5)\\ No $f_{aligned}$ and $f_{exact\_match}$ & 59.4 (-19.4) \\ \hline \end{tabular} \end{center} \longcaption{Feature ablation analysis on SQuAD}{\label{tab:feature-ablation}Feature ablation analysis of the paragraph representations of our model. Results are reported on the SQuAD development set.} \end{table} In \newcite{chen2017reading}, we conducted an ablation analysis on the components of the passage representations. As shown in Table~\ref{tab:feature-ablation}, all the components contribute to the performance of our final system. We find that, without the aligned question embeddings (only word embeddings and a few manual features), our system is still able to achieve F1 over ~77\%. The effectiveness of exact match features $f_{exact\_match}$ also indicates that there are a lot of words overlapping between the passage and the question on this dataset. More interestingly, if we remove both $f_{aligned}$ and $f_{exact\_match}$, the performance drops dramatically, so we conclude that both features play a similar but complementary role in the feature representation, like the hard and soft alignments between question and passage words. % \subsubsection{Attention visualization} % \red{TODO} \subsection{Analysis: What Have the Models Learned?} In \newcite{chen2016thorough}, we attempted to understand better what these models have actually learned, and what depth of language understanding is needed to solve these problems. We approach this by doing a careful hand-analysis of 100 randomly sampled examples from the development set of the \sys{CNN} dataset. We roughly classify them into the following categories (if an example satisfies more than one category, we classify it into the earliest one): \begin{description} \item[\tf{Exact match}] The nearest words around the placeholder are also found in the passage surrounding an entity marker; the answer is self-evident. \item[\tf{Sentence-level paraphrasing}] The question text is entailed\slash rephrased by \ti{exactly} one sentence in the passage, so the answer can definitely be identified from that sentence. \item[\tf{Partial clue}] In many cases, even though we cannot find a complete semantic match between the question text and some sentence, we are still able to infer the answer through partial clues, such as some word/concept overlap. \item[\tf{Multiple sentences}] Multiple sentences must be processed to infer the correct answer. \item[\tf{Coreference errors}] It is unavoidable that there are many coreference errors in the dataset. This category includes those examples with critical coreference errors for the answer entity or key entities appearing in the question. Basically we treat this category as ``not answerable''. \item[\tf{Ambiguous or hard}] This category includes examples for which we think humans are not able to obtain the correct answer (confidently). \end{description} Table~\ref{tab:cnn-ex-breakdown} provides our estimate of the percentage for each category, and Figure~\ref{fig:cnn-examples} presents one representative example from each category. We observe that \ti{paraphrasing} accounts for 41\% of the examples and 19\% of the examples are in the \ti{partial clue} category. Adding the most simple \ti{exact match} category, we hypothesize a large portion (73\% in this subset) of the examples are able to be answered by identifying the most relevant (single) sentence and inferring the answer based upon it. Additionally, only 2 examples require multiple sentences for inference. This is a lower rate than we expected and this suggests that the dataset requires less reasoning than previously thought. To our surprise, “coreference errors” and “ambiguous/hard” cases account for 25\% of this sample set, based on our manual analysis, and this certainly will be a barrier for training models with an accuracy much above 75\% (although, of course, a model can sometimes make a lucky guess). In fact, our ensemble neural network model is already able to achieve 76.5\% on the development set, and we think that the prospect for further improving on this dataset is small. \begin{figure}[p] \centering \begin{tabular}{l p{4.5cm} p{6.5cm}} \toprule Category & Question & Passage \\ \midrule Exact Match & \ti{it 's clear @entity0 is leaning toward} {\tf{@placeholder}} , says an expert who monitors @entity0 & \ldots @entity116 , who follows @entity0 's operations and propaganda closely , recently told @entity3 , \ti{it 's clear @entity0 is leaning toward} \tf{@entity60} in terms of doctrine , ideology and an emphasis on holding territory after operations . \ldots \\ \midrule Paraphrasing & {\tf{@placeholder} says he understands why @entity0 wo n't play at his tournament} & \ldots @entity0 called me personally to let me know that he would n't be playing here at @entity23 , " \tf{@entity3} said on his @entity21 event 's website . \ldots \\ \midrule Partial clue & a tv movie based on @entity2 's book \tf{@placeholder} casts a @entity76 actor as @entity5 & \ldots to @entity12 @entity2 professed that his \tf{@entity11} is not a religious book . \ldots \\ \midrule Multiple sent. & he 's doing a his - and - her duet all by himself , @entity6 said of \tf{@placeholder} & \ldots we got some groundbreaking performances , here too , tonight , @entity6 said . we got \tf{@entity17} , who will be doing some musical performances . he 's doing a his - and - her duet all by himself . \ldots \\ \midrule Coref. Error & rapper \tf{@placeholder} " disgusted , " cancels upcoming show for @entity280 & \ldots with hip - hop star \tf{@entity246} saying on @entity247 that he was canceling an upcoming show for the @entity249 . \ldots (but @entity249 = @entity280 = SAEs)\\ \midrule Hard & pilot error and snow were reasons stated for \tf{@placeholder} plane crash & \ldots a small aircraft carrying \tf{@entity5} , @entity6 and @entity7 the @entity12 @entity3 crashed a few miles from @entity9 , near @entity10 , @entity11 . \ldots \\ \bottomrule \end{tabular} \longcaption{Some representative examples from each category}{\label{fig:cnn-examples}Some representative examples from each category on the \sys{CNN} dataset.} \end{figure} \begin{table}[!t] \centering \begin{tabular}{l l r} \toprule \tf{\#} & \tf{Category} & \\ \midrule 1 & Exact match & 13\% \\ 2 & Paraphrasing & 41\% \\ 3 & Partial clue & 19\% \\ 4 & Multiple sentences & 2\% \\ \midrule 5 & Coreference errors & 8\% \\ 6 & Ambiguous / hard & 17\% \\ \bottomrule \end{tabular} \longcaption{An estimate of the breakdown of \sys{CNN} examples}{\label{tab:cnn-ex-breakdown}An estimate of the breakdown of the dataset into classes, based on the analysis of our sampled 100 examples from the \sys{CNN} dataset.} \end{table} \begin{figure}[!t] \center \includegraphics[scale=0.6]{img/cnn_analysis.png} \longcaption{The per-category performance of our two systems}{\label{fig:category-performance} The per-category performance of our two systems: the \sys{Stanford Attentive Reader} and the feature-based classifier, on the sampled 100 examples of the \sys{CNN} dataset.} \end{figure} % \begin{table}[h] % \centering % \begin{tabular}{@{} l r @{\hspace*{0.25em}} r r @{\hspace*{0.25em}} r @{}} % \toprule % {Category} & \multicolumn{2}{c}{{Classifier}} & \multicolumn{2}{c}{{Neural net}} \\ % \midrule % Exact match & 13 & (100.0\%) & 13 & (100.0\%) \\ % Paraphrasing & 32 & (78.1\%) & 39 & (95.1\%) \\ % Partial clue & 14 & (73.7\%) & 17 & (89.5\%) \\ % Multiple sentences & 1 & (50.0\%) & 1 & (50.0\%) \\ % \midrule % Coreference errors & 4 & (50.0\%) & 3 & (37.5\%)\\ % Ambiguous / hard & 2 & (11.8\%) & 1 & (5.9\%) \\ % \midrule % All & 66 & (66.0\%) & 74 & (74.0\%) \\ % \bottomrule % \end{tabular} % \longcaption{The per-category performance of our two systems}{\label{tab:category-performance} The per-category performance of our two systems: the \sys{Stanford Attentive Reader} and the feature-based classifier, on the sampled 100 examples of the \sys{CNN} dataset.} % \end{table} Let's further take a closer look at the per-category performance of our neural network and feature-based classifier, based on the above categorization. As shown in Figure~\ref{fig:category-performance}, we have the following observations: (i)~The exact-match cases are quite simple and both systems get 100\% correct. (ii)~For the ambiguous\slash hard and entity-linking-error cases, meeting our expectations, both of the systems perform poorly. (iii)~The two systems mainly differ in the paraphrasing cases, and some of the ``partial clue'' cases. This clearly shows how neural networks are better capable of learning semantic matches involving paraphrasing or lexical variation between the two sentences. (iv)~We believe that the neural network model already achieves near-optimal performance on all the single-sentence and unambiguous cases. To sum up, we find that neural networks are certainly more powerful at recognizing lexical matches and paraphrases compared to conventional feature-based models; while it is still unclear whether they also win out on the examples which require more complex textual reasoning as the current datasets are still quite limited in that respect. ================================================ FILE: chapters/rc_models/feature_classifier.tex ================================================ %!TEX root = ../../thesis.tex \section{Previous Approaches: Feature-based Models} \label{sec:feature-models} % \red{TODO: What is the space of possible entities? How do you keep it from being too large?} We first describe a strong feature-based model that we built in \newcite{chen2016thorough} for cloze style problems, in particular, the \sys{CNN/Daily Mail} dataset~\cite{hermann2015teaching}. We will then discuss similar models built for multiple choice and span prediction problems. For the cloze style problems, the task is formulated as predicting the correct entity $a \in \mathcal{E}$ that can fill in the blank of the question $q$ based on reading the passage $p$ (one example can be found in Table~\ref{tab:rc-examples}), where $\mathcal{E}$ denotes the candidate set of entities. Conventional linear, feature-based classifiers usually need to construct a feature vector $f_{{p}, {q}}(e) \in \R^d$ for each candidate entity $e \in \mathcal{E}$, and to learn a weight vector $\mf{w} \in \R^d$ such that the correct answer $a$ is expected to rank higher than all other candidate entities: \begin{equation} \mf{w}^{\intercal}f_{p, q}(a) > \mf{w}^{\intercal}f_{{p}, {q}}(e), \forall e \in \mathcal{E} \setminus \{{a}\}, \end{equation} After all the feature vectors are constructed for each entity $e$, we can then apply any popular machine learning algorithms (e.g., logistic regression or SVM). In \newcite{chen2016thorough}, we chose to use \sys{LambdaMART}~\cite{wu2010adapting}, as it is naturally a ranking problem and forests of boosted decision trees have been very successful lately. \begin{table}[t] \centering \begin{tabular}{l p{14cm}} \toprule \tf{\#} & \tf{Feature} \\ \midrule 1 & Whether entity $e$ occurs in the passage. \\ 2 & Whether entity $e$ occurs in the question. \\ 3 & The \tf{frequency} of entity $e$ in the passage. \\ 4 & The \tf{first position} of occurrence of entity $e$ in the passage. \\ 5 & \tf{Word distance}: we align the placeholder with each occurrence of entity $e$, and compute the average minimum distance of each non-stop question word from the entity in the passage. \\ 6 & \tf{Sentence co-occurrence}: whether entity $e$ co-occurs with another entity or verb that appears in the question, in some sentence of the passage. \\ 7 & \tf{$n$-gram exact match}: whether there is an exact match between the text surrounding the placeholder and the text surrounding entity $e$. We have features for all combinations of matching left and/or right one or two words. \\ 8 & \tf{Dependency parse match}: we dependency parse both the question and all the sentences in the passage, and extract an indicator feature of whether $w \xrightarrow{r} \text{@placeholder}$ and $w \xrightarrow{r} e$ are both found; similar features are constructed for $\text{@placeholder} \xrightarrow{r} w$ and $e \xrightarrow{r} w$. \\ \bottomrule \end{tabular} \longcaption{Features used in our entity-centric classifier}{\label{tab:classifier-features}Features used in our entity-centric classifier in \newcite{chen2016thorough}.} \end{table} The key question left is how can we build useful feature vectors from the passage $p$, the question $q$ and every entity $e$? Table~\ref{tab:classifier-features} lists 8 sets of features that we proposed for the \sys{CNN/Daily Mail} task. As shown in the table, these features are well designed and characterize the information of the entity (e.g., frequency, position and whether it is a question/passage word) and how they are aligned with the passage/question (e.g., co-occurrence, distance, linear and syntactic matching). Some features (\#6 and \#8) also rely on linguistic tools such as dependency parsing and part-of-speech tagging (deciding whether a word is a verb or not). Generally speaking, for non-neural models, how to construct a useful set of features always remains as a challenge. Useful features need to be informational and well-tailored to specific tasks, while not too sparse to generalize well from the training set. We have argued before in Sec~\ref{sec:ml-approaches} that this is a common problem in most of the feature-based models. Also, using the off-the-shelf linguistic tools makes the model more expensive and their final performance depends on the the accuracy level of these annotations. \newcite{rajpurkar2016squad} and \newcite{joshi2017triviaqa} also attempted to build feature-based models for the \sys{SQuAD} and \sys{TriviaQA} datasets respectively. The models are similar in spirit to ours, except that for these span prediction tasks, they need to first decide on a set of possible answers. For \sys{SQuAD}, \newcite{rajpurkar2016squad} consider all constituents in parses generated by Stanford CoreNLP~\cite{manning2014stanford} as candidate answers; while for \sys{TriviaQA}, \newcite{joshi2017triviaqa} consider all $n$-gram ($1 \leq n \leq 5$) that occurs in the sentences which contain at least one word in common with the question. They also tried to add more lexicalized features and labels from constituency parses. Other attempts have been made for multiple choice problems such as \cite{wang2015machine} for the \sys{MCTest} dataset and a rich set of features have been used including semantic frames, word embeddings and coreference resolution. We will demonstrate the empirical results of these feature-based classifiers and compare them to the neural models in Section~\ref{sec:sar-experiments}. ================================================ FILE: chapters/rc_models/intro.tex ================================================ %!TEX root = ../../thesis.tex % \section{Introduction} In this chapter, we will cover the essence of neural network models: from the basic building blocks, to more recent advances. Before delving into the details of neural models, we give a brief introduction to non-neural, feature-based models for reading comprehension in Section~\ref{sec:feature-models}. In particular, we describe a model that we built in \newcite{chen2016thorough}. We hope this will give readers a better sense about how these two approaches differ fundamentally. In Section~\ref{sec:sar}, we present a neural approach to reading comprehension called \sys{The Stanford Attentive Reader}, which we first proposed in \newcite{chen2016thorough} for the cloze style reading comprehension tasks, and then later adapted to the span prediction problems \cite{chen2017reading} for \sys{SQuAD}. We first briefly review the basic building blocks of modern neural NLP models, and then describe how our model is built on top of them. We discuss its extensions to the other types of reading comprehension problems in the end. Next we present the empirical results of our model on the \sys{CNN/Daily Mail} and the \sys{SQuAD} datasets, and provide more implementation details in Section~\ref{sec:sar-experiments}. We further conduct careful error analyses to help us better understand: 1) which components are most important for final performance; 2) where the neural models excel compared to non-neural feature-based models empirically. Finally, we summarize recent advances in neural reading comprehension in Section~\ref{sec:advances}. % This chapter is going to cover the following topics: % \begin{itemize} % \item % Talk about non-neural approaches and use my baseline in the ACL'16 paper as an example % \item % Introduce SAR (and its variants on different RC tasks) -- I am hoping to give more intuitions (\red{how?}) % \item % Probably need to give some background of neural NLP: word embeddings and recurrent neural networks etc % \item % Talk about experiments on CNN/Daily Mail and SQuAD: the architectures are slightly different but it should be fine... % \item % Analysis: 1) ablation studies of SQuAD from the ACL17 paper 2) comparison between the neural approach and non-neural approach on the CNN dataset % \item % Further advances: 1) word representations 2) alternatives of RNNs 3) attention mechanisms 4) better training objectives % \end{itemize} ================================================ FILE: chapters/rc_models/sar.tex ================================================ %!TEX root = ../../thesis.tex \section{A Neural Approach: The Stanford Attentive Reader} \label{sec:sar} \subsection{Preliminaries} In the following, we outline a minimal set of elements and the key ideas which form the basis of modern neural NLP models. For more details, we refer readers to \cite{cho2015natural,goldberg2017neural}. \subsubsection*{Word embeddings} The first key idea is to represent words as low-dimensional (e.g., 300), real-valued vectors. Before the deep learning age, it was common to represent a word as an index into the vocabulary, which is a notational variant of using one-hot word vectors: each word is represented as a high-dimensional, sparse vector where only one entry of that word is 1 and all other entires are 0's: \begin{eqnarray*} \mf{v}_{\text{car}} = [0, 0, \ldots, 0, 0, 1, 0, \ldots, 0]^{\intercal} \\ \mf{v}_{\text{vehicle}} = [0, 1, \ldots, 0, 0, 0, 0, \ldots, 0]^{\intercal} \end{eqnarray*} The biggest problem with these sparse vectors is that they don't share any semantic similarity between words, i.e., for any pair of different words $a, b$, $\cos(\mf{v}_a, \mf{v}_b) = 0$. Low-dimensional word embeddings effectively alleviated this problem and similar words can be encoded as similar vectors in geometry space: $\cos(\mf{v}_{\text{car}}, \mf{v}_{\text{vechicle}}) < \cos(\mf{v}_{\text{car}}, \mf{v}_{\text{man}})$. These word embeddings can be effectively learned from large unlabeled text corpora, based on the assumption that words occur in similar contexts tend to have similar meanings (a.k.a. the \ti{distributional hypothesis}). Indeed, learning word embeddings from text has a long-standing history and has been finally popularized by recent scalable algorithms and released sets of pretrained word embeddings such as \sys{word2vec}~\cite{mikolov2013distributed}, \sys{glove}~\cite{pennington2014glove} and \sys{fasttext}~\cite{bojanowski2017enriching}. They have become the mainstay of modern NLP systems. \subsubsection*{Recurrent neural networks} The second important idea is the use of recurrent neural networks (RNNs) to model sentences or paragraphs in NLP. \ti{Recurrent neural networks} are a class of neural networks which are suitable to handle sequences of variable length. More concretely, they apply a parameterized function recursively on a sequence $\mf{x}_1, \ldots, \mf{x}_n$: \begin{equation} \mf{h}_t = f(\mf{h}_{t-1}, \mf{x}_t; \Theta) \end{equation} For NLP applications, we represent a sentence or a paragraph as a sequence of words where each word is transformed into a vector (usually through pre-trained word embeddings): $\mf{x} = \mf{x}_1, \mf{x}_2, \ldots, \mf{x}_n \in \R^d$ and $\mf{h}_t \in \R^h$ can be used to model the contextual information of $\mf{x}_{1:t}$. Vanilla RNNs take the form of \begin{equation} \mf{h}_t = \tanh(\mf{W}^{hh}\mf{h}_{t-1} + \mf{W}^{hx}\mf{x}_t + \mf{b}), \end{equation} where $\mf{W}^{hh} \in \R^{h \times h}, \mf{W}^{hx} \in \R^{h\times d}$, $\mf{b} \in \R^h$ are the parameters to be learned. To ease the optimization, many variants of RNNs have been proposed. Among them, long short-term memory networks (LSTMs)~\cite{hochreiter1997} and gated recurrent units (GRUs)~\cite{cho2014learning} are the commonly used ones. Arguably, LSTM is still the most competitive RNN variant for NLP applications today and also our default choice for the neural models that we will describe. Mathematically, LSTMs can be formulated as follows: \begin{eqnarray} \mf{i}_t & = & \sigma(\mf{W}^{ih}\mf{h}_{t-1} + \mf{W}^{ix}\mf{x_t} + \mf{b}^{i}) \\ \mf{f}_t & = & \sigma(\mf{W}^{fh}\mf{h}_{t-1} + \mf{W}^{fx}\mf{x_t} + \mf{b}^{f}) \\ \mf{o}_t & = & \sigma(\mf{W}^{oh}\mf{h}_{t-1} + \mf{W}^{ox}\mf{x_t} + \mf{b}^{o}) \\ \mf{g}_t & = & \tanh(\mf{W}^{gh}\mf{h}_{t-1} + \mf{W}^{gx}\mf{x_t} + \mf{b}^{g}) \\ \mf{c}_t & = & \mf{f}_t \odot \mf{c}_{t-1} + \mf{i}_t \odot \mf{g}_t \\ \mf{h}_t & = & \mf{o}_t \odot \tanh(\mf{c}_t), \end{eqnarray} where $\mf{W}^{ih}, \mf{W}^{fh}, \mf{W}^{oh}, \mf{W}^{gh} \in \R^{h \times h}$, $\mf{W}^{ix}, \mf{W}^{fx}, \mf{W}^{ox}, \mf{W}^{gx} \in \R^{h \times d}$ and $\mf{b}^{i}, \mf{b}^{f}, \mf{b}^{o}, \mf{b}^{g} \in \R^h$ are the parameters to be learned. Finally, a useful elaboration of an RNN is a \ti{bidirectional RNN}. The idea is simple: for a sentence or a paragraph, $\mf{x} = \mf{x}_1, \ldots, \mf{x}_n$, a forward RNN is used from left to right and then another backward RNN is used from right to left: \begin{eqnarray} \overrightarrow{\mf{h}}_t & = & f(\overrightarrow{\mf{h}}_{t-1}, \mf{x}_t; \overrightarrow{\Theta}), \quad t = 1, \ldots, n\\ \overleftarrow{\mf{h}}_t & = & f(\overleftarrow{\mf{h}}_{t+1}, \mf{x}_t; \overleftarrow{\Theta}), \quad t = n, \ldots, 1 \end{eqnarray} We define $\mf{h}_t = [\overrightarrow{\mf{h}}_t; \overleftarrow{\mf{h}}_t] \in \R^{2h}$ which takes the concatenation of the hidden vectors from the RNNs in both directions. These representations can usefully encode information from both the left context and the right context and are suitable for general-purpose trainable feature-extracting component of many NLP tasks. \subsubsection*{Attention mechanism} The third important component is an attention mechanism. It was first introduced in the \textit{sequence-to-sequence} (seq2seq) models \cite{sutskever2014sequence} for neural machine translation \cite{bahdanau2015neural,luong2015effective} and has later been extended to other NLP tasks. The key idea is, if we want to predict the sentiment of a sentence, or translate a sentence of one language to the other, we usually apply recurrent neural networks to encode a single sentence (or the source sentence for machine translation): $\mf{h}_1, \mf{h}_2, \ldots, \mf{h}_n$ and use the last time step $\mf{h}_n$ to predict the sentiment label or the first word in the target language: \begin{equation} P(Y = y) = \frac{\exp(\mf{W}_y\mf{h}_n)}{\sum_{y'}{\exp\left(\mf{W}_{y'}\mf{h}_n\right)}} \end{equation} This requires the model to be able to compress all the necessary information of a sentence into a fixed-length vector, which causes an information bottleneck in improving performance. An attention mechanism is designed to solve this problem: instead of squashing all the information into the last hidden vector, it looks at the hidden vectors at all time steps and chooses a subset of these vectors adaptively: \begin{eqnarray} \alpha_i & = & \frac{\exp\left(g(\mf{h}_i, \mf{w}; \Theta_g)\right)}{\sum_{i'=1}^{n}\exp\left(g(\mf{h}_{i'}, \mf{w}; \Theta_g)\right)} \label{eq:attention} \\ \mf{c} & = & \sum_{i=1}^{n}{\alpha_i \mf{h}_i} \label{eq:context-vector} \end{eqnarray} Here $\mf{w}$ can be a task-specific vector learned from the training process, or taken as the current target hidden state in machine translation and $g$ is a parameteric function which can be chosen in various ways, such as dot product, bilinear product, or one hidden layer of MLP: \begin{eqnarray} g_{\text{dot}}(\mf{h}_i, \mf{w}) &=& {\mf{h}_i}^{\intercal}\mf{w} \\ g_{\text{bilinear}}(\mf{h}_i, \mf{w}) &=& {\mf{h}_i}^\intercal\mf{W}\mf{w} \\ g_{\text{MLP}}(\mf{h}_i, \mf{w}) &=& {\mf{v}}^\intercal\tanh(\mf{W}^h\mf{h}_i + \mf{W}^w\mf{w}) \label{eq:mlp-att} \end{eqnarray} Roughly, an attention mechanism computes a similarity score for each $\mf{h}_i$ and then a softmax function is applied which returns a discrete probability distribution over all the time steps. Thus $\alpha$ essentially captures which parts of the sentence are indeed relevant and $\mf{c}$ aggregates over all the time steps with a weighted sum and can be used for final prediction. We are not going into more details and interested readers are referred to \newcite{bahdanau2015neural,luong2015effective}. Attention mechanisms have been proved widely effective in numerous applications and become an integral part of neural NLP models. Recently, \newcite{parikh2016decomposable} and \newcite{vaswani2017attention} conjectured that attention mechanisms don't have to be used in conjunction with recurrent neural networks and can be built purely based on word embeddings and feed-forward networks, while providing minimal sequence information. This class of models usually requires less parameters and is more parallelizable and scalable --- in particular, the \sys{Transformer} model proposed in \newcite{vaswani2017attention} has become a recent trend and we will discuss it more in Section~\ref{sec:alt-lstms}. \subsection{The Model} At this point, we are equipped with all the building blocks. How can we build effective neural models out of them for reading comprehension? What are the key ingredients? Next we introduce our model: the \sys{Stanford Attentive Reader}. Our model is inspired by the \sys{Attentive Reader} described in \newcite{hermann2015teaching} and other concurrent works, with a goal of making the model simple yet powerful. We first describe its full form for span prediction problems that we introduced in \newcite{chen2017reading} and then later we discuss its other variants. \begin{figure}[t] \begin{center} \includegraphics[height=8cm]{img/drqa_reader.pdf} \end{center} \longcaption{A full model of \sys{Stanford Attentive Reader}}{\label{fig:sar} A full model of \sys{Stanford Attentive Reader}. Image courtesy: \\ \href{https://web.stanford.edu/~jurafsky/slp3/23.pdf}{https://web.stanford.edu/~jurafsky/slp3/23.pdf}.} \end{figure} Let's first recap the setting of span-based reading comprehension problems: Given a single passage $p$ consisting of $l_p$ tokens $(p_1, p_2, \ldots, p_{l_p})$ and a question $q$ consisting of $l_q$ tokens $(q_1, q_2, \ldots, q_{l_q})$, the goal is to predict a span $(a_{\text{start}}, a_{\text{end}})$ where $1 \leq a_{\text{start}} \leq a_{\text{end}} \leq l_p$ so that the corresponding string $p_{a_{\text{start}}}, p_{a_{\text{start}} + 1}, \ldots, p_{a_{\text{end}}}$ gives the answer to the question. The full model is illustrated in Figure~\ref{fig:sar}. At a high level, the model first builds a vector representation for the question and builds a vector representation for each token in the passage. It then computes a similarity function between the question and its passage word in context, and then uses the question-passage similarity scores to decide the starting and ending positions of the answer span. The model builds on top of the low-dimensional, pre-trained word embeddings for each word in the passage and question (with linguistic annotations optionally). All the parameters for passage/question encoding and similarity functions are optimized jointly for the final answer prediction. Let's go into further details of each component: \subsubsection*{Question encoding} \label{sec:question-encoding} The question encoding is relatively simple: we first map each question word $q_i$ into its word embedding $\mf{E}(q_i) \in \R^d$ and then we apply a bi-directional LSTM on top of them and finally obtain: \begin{equation} \mf{q}_{1}, \mf{q}_2, \ldots, \mf{q}_{l_q} = \text{BiLSTM}(\mf{E}(q_1), \mf{E}(q_2), \ldots, \mf{E}(q_{l_q}); \Theta^{(q)}) \in \R^{h} \end{equation} We then aggregate these hidden units into one single vector through an attention layer: \begin{eqnarray} b_j & = & \frac{\exp({\mf{w}^{q}}^\intercal \mf{q}_j)}{\sum_{j'}{\exp({\mf{w}^{q}}^\intercal \mf{q}_{j'})}} \\ \mf{q} & = & \sum_j{b_j \mf{q}_j} \end{eqnarray} $b_j$ measures the importance of each question word and $\mf{w}^{q} \in \R^h$ is a weight vector to be learned. Therefore, $\mf{q} \in \R^h$ is the final vector representation for the question. Indeed, it was simpler (and also common) to represent $\mf{q}$ as the concatenation of the last hidden vector from the LSTMs in both directions. However, based on the empirical performance, we find that adding this attention layer helps consistently as it adds more weight to the more relevant question words. \subsubsection*{Passage encoding} Passage encoding is similar, as we also first form an input representation $\tilde{\mf{p}}_i \in \R^{\tilde{d}}$ for each word in the passage and pass them through another bidirectional LSTM: \begin{equation} \label{eq:passage-lstm} \mf{p}_{1}, \mf{p}_2, \ldots, \mf{p}_{l_p} = \text{BiLSTM}\left(\tilde{\mf{p}}_1, \tilde{\mf{p}}_2, \ldots, \tilde{\mf{p}}_{l_p}; \Theta^{(p)}\right) \in \R^{h} \end{equation} The input representation $\tilde{\mf{p}}_i$ can be divided into two categories: one is to encode \ti{the properties of each word itself}, and the other is to encode \ti{its relevance with respect to the question}. For the first category, in addition to word embedding $f_{emb}(p_i) = \mf{E}(p_i) \in \R^d$, we also add some manual features which reflect the properties of word $p_i$ in its context, including its part-of-speech (POS) and named entity recognition (NER) tags and its (normalized) term frequency (TF): $f_{token}(p_i) = \left(\text{POS}(p_i), \text{NER}(p_i), \text{TF}(p_i)\right)$. For POS and NER tags, we run off-the-shelf tools and convert it into a one-hot representation as the set of tags is small. The TF feature is real-valued number which measures how many times the word appears in the passage divided by the total number of words. For the second category, we consider two types of representations: \begin{itemize} \item \tf{Exact match}: $f_{exact\_match}(p_i) = \mathbb{I}(p_i \in q) \in \R$. In practice, we use three simple binary features, indicating whether $p_i$ can be exactly matched to one question word in $q$, either in its original, lowercase or lemma form. \item \tf{Aligned question embeddings}: The exact match features encode the hard alignment between question words and passage words. Aligned question embeddings aim to encode a soft notion of alignment between words in the word embedding space, so that similar (but non-identical) words, e.g., \textit{car} and \textit{vehicle}, can be well aligned. Concretely, we use \begin{equation} \label{eq:aligned_question} f_{align}(p_i) = \sum_j{a_{i, j} \mf{E}(q_j)} \end{equation} where $a_{i, j}$ are the attention weights which capture the similarity between $p_i$ and each question words $q_j$ and $\mf{E}(q_j) \in \R^d$ is the word embedding for each question word. $a_{i, j}$ is computed by the dot product between nonlinear mappings of word embeddings: \begin{equation} \label{eq:aligned_question_attention} a_{i, j} = \frac{\exp\left(\text{MLP}(\mf{E}(p_i))^{\intercal} \text{MLP}(\mf{E}(q_{j}))\right)}{\sum_{j'}{\exp\left(\text{MLP}(\mf{E}(p_i)) ^{\intercal} \text{MLP}(\mf{E}(q_{j'}))\right)}}, \end{equation} and $\text{MLP}(\mf{x}) = \max(0, \mf{W}_{\text{MLP}}\mf{x} + \mf{b}_{\text{MLP}})$ is a single dense layer with ReLU nonlinearity, where $\mf{W}_{\text{MLP}} \in \R^{d \times d}$ and $\mf{b}_{\text{MLP}} \in \R^d$. \end{itemize} Finally, we simply concatenate the four components and form the input representation: \begin{equation} \tilde{\mf{p}_i} = (f_{emb}(p_i), f_{token}(p_i), f_{exact\_match}(p_i), f_{align}(p_i)) \in \R^{\tilde{d}} \end{equation} \subsubsection*{Answer prediction} We have vector representations for both the passage $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p} \in \R^h$ and the question $\mf{q} \in \R^h$ and the goal is to predict the span that is most likely the correct answer. We employ the idea of attention mechanism again and train two separate classifiers, one is to predict the start position of the span while the other is to predict the end position. More specifically, we use a bilinear product to capture the similarity between $\mf{p}_i$ and $\mf{q}$: \begin{eqnarray} P^{(\text{start})}(i) & = & \frac{\exp\left(\mf{p}_i \mf{W}^{(\text{start})} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W}^{(\text{start})} \mf{q}\right)} \\ P^{(\text{end})}(i) & = & \frac{\exp\left(\mf{p}_i \mf{W}^{(\text{end})} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W}^{(\text{end})} \mf{q}\right)}, \end{eqnarray} where $\mf{W}^{(\text{start})}, \mf{W}^{(\text{end})} \in \R^{h \times h}$ are additional parameters to be learned. This is slightly different from the formulation of attention as we don't need to take the weighted sum of all the vector representations. Instead, we use the normalized weights to make direct predictions. We use bilinear products because we find them to work well empirically. \subsubsection*{Training and inference} The final training objective is to minimize the cross-entropy loss: \begin{equation} \mathcal{L} = - \sum \log{P^{(\text{start})}(a_{\text{start}})} - \sum \log{P^{(\text{end})}(a_{\text{end}})}, \end{equation} and all the parameters $\Theta = \Theta^{(p)}, \Theta^{(q)}, \mf{w}^{(q)}, \mf{W}_{\text{MLP}}, \mf{b}_{\text{MLP}}, \mf{W}^{(\text{start})}, \mf{W}^{(\text{end})}$ are optimized jointly with stochastic gradient methods.\footnote{We exclude word embeddings here but it is also common to treat all or a subset of the word embeddings as parameters and fine-tune them during training.} During inference, we choose the span $p_i, \ldots, p_{i'}$ such that $i \leq i' \leq i + max\_len$ and $P^{(\text{start})}(i) \times P^{(\text{end})}(i')$ is maximized. $max\_len$ is a pre-defined constant (e.g., 15) which controls the maximum length of the answer. \subsection{Extensions} In the following, we give a few variants of the \sys{Stanford Attentive Reader} for other types of reading comprehension problems. All these models follow the same process of passage encoding and question encoding as described above, hence we have $\mf{p}_1, \mf{p}_2, \ldots, \mf{p}_{l_p} \in \R^h$ and $\mf{q} \in \R^h$. We only discuss the answer prediction component and training objectives. \paragraph{\tf{Cloze style.}} Similarly, we can compute an attention function using a bilinear product of the question over all the words in the passage, and then compute an output vector $\mf{o}$ which takes a weighted sum of all the paragraph representations: \begin{eqnarray} \alpha_i & = & \frac{\exp\left(\mf{p}_i \mf{W} \mf{q}\right)}{\sum_{i'}\exp\left(\mf{p}_{i'} \mf{W} \mf{q}\right)} \\ \mf{o} & = & \sum_{i}{\alpha_i \mf{p}_i} \label{eqn:output_vector} \end{eqnarray} The output vector $\mf{o}$ can be used to predict the missing word or entity: \begin{equation} P(Y = e \mid p, q) = \frac{\exp(\mf{W}^{(a)}_e \mf{o})}{\sum_{e' \in \mathcal{E}}\exp\left(\mf{W}^{(a)}_{e'} \mf{o}\right)}, \end{equation} where $\mathcal{E}$ denotes the candidate set of entities or words. It is straightforward to adopt a negative log-likelihood objective for training and choose $e \in \mathcal{E}$ which maximizes $\mf{W}^{(a)}_{e} \mf{o}$ during prediction. This model has been studied in our earlier paper \cite{chen2016thorough} for the \sys{CNN/Daily Mail} dataset and \cite{onishi2016did} for the \sys{Who-Did-What} dataset. \paragraph{\tf{Multiple choice.}} In this setting, $k$ hypothesized answers are given $\mathcal{A} = \{a_1, \ldots, a_k\}$ and we can encode each of them into a vector $\mf{a}_i$ by applying a third BiLSTM, similar to our question encoding step. We can then compute the output vector $\mf{o}$ as in Equation~\ref{eqn:output_vector} and compare it with each hypothesized answer vector $\mf{a}_i$ through another similarity function using a bilinear product: \begin{equation} P(Y = i \mid p, q) = \frac{\exp(\mf{a}_i \mf{W}^{(a)} \mf{o})}{\sum_{i'=1, \ldots, k}\exp\left(\mf{a}_{i'}\mf{W}^{(a)} \mf{o}\right)} \end{equation} The cross-entroy loss is also used for training. This model has been studied in \newcite{lai2017race} for the \sys{RACE} dataset. \paragraph{\tf{Free-form answer.}} For this type of problems, the answer isn't restricted to a single entity or a span in the passage and can take any sequence of words and the most common solution is to incorporate an LSTM sequence decoder into the current framework. In more detail, assume the answer string is $a = (a_1, a_2, \ldots, a_{l_a})$ and a special ``end-of-sequence'' token $\left$ is added to the end of each answer. We can compute the output vector $\mf{o}$ again as in Equation~\ref{eqn:output_vector}. and the decoder generates a word at a time and hence the conditional probability can be decomposed as: \begin{equation} P(a \mid p, q) = P(a \mid \mf{o}) = \prod_{j = 1}^{l_a}P(a_j \mid a_{