[
  {
    "path": "README.md",
    "content": "# Dakshina Dataset\n\nThe Dakshina dataset is a collection of text in both Latin and native scripts\nfor 12 South Asian languages. For each language, the dataset includes a large\ncollection of native script Wikipedia text, a romanization lexicon which\nconsists of words in the native script with attested romanizations, and some\nfull sentence parallel data in both a native script of the language and the\nbasic Latin alphabet.\n\nDataset URL:\n[https://github.com/google-research-datasets/dakshina](https://github.com/google-research-datasets/dakshina)\n\nIf you use or discuss this dataset in your work, please cite our paper (bibtex\ncitation below).  A PDF link for the paper can be found at\n[https://www.aclweb.org/anthology/2020.lrec-1.294](https://www.aclweb.org/anthology/2020.lrec-1.294).\n\n```\n@inproceedings{roark-etal-2020-processing,\n    title = \"Processing {South} {Asian} Languages Written in the {Latin} Script:\n    the {Dakshina} Dataset\",\n    author = \"Roark, Brian and\n      Wolf-Sonkin, Lawrence and\n      Kirov, Christo and\n      Mielke, Sabrina J. and\n      Johny, Cibu and\n      Demir{\\c{s}}ahin, I{\\c{s}}in and\n      Hall, Keith\",\n    booktitle = \"Proceedings of The 12th Language Resources and Evaluation Conference (LREC)\",\n    year = \"2020\",\n    url = \"https://www.aclweb.org/anthology/2020.lrec-1.294\",\n    pages = \"2413--2423\"\n}\n```\n\n## Data links ##\n\nFile | Download | Version | Date | Notes\n---- | :------: | :-------: | :--------: | :------\n**dakshina_dataset_v1.0.tar** | [link](https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar) | 1.0 | 05/27/2020 | Initial data release\n\n\n## Data Organization\n\nThere are 12 languages represented in the dataset: Bangla (`bn`), Gujarati\n(`gu`), Hindi (`hi`), Kannada (`kn`), Malayalam (`ml`), Marathi (`mr`), Punjabi\n(`pa`), Sindhi (`sd`), Sinhala (`si`), Tamil (`ta`), Telugu (`te`) and Urdu\n(`ur`).\n\nAll data is derived from Wikipedia text. Each language has its own\ndirectory, in which there are three subdirectories:\n\n### Native Script Wikipedia {#native}\n\nIn the `native_script_wikipedia` subdirectories there are native script text\nstrings from Wikipedia. The scripts are:\n\n*   For `bn`, `gu`, `kn`, `ml`, `si`, `ta` and `te`, the scripts are named the\n    same as the language,\n*   `hi` and `mr` are in the Devanagari script,\n*   `pa` is in the Gurmukhi script, and\n*   `ur` and `sd` are in Perso-Arabic scripts.\n\nAll of the scripts other than the Perso-Arabic scripts are Brahmic. This data\nconsists of Wikipedia strings that have been filtered (see\n[below](#native-preprocessing)) to consist only of strings primarily in the\nUnicode codeblock for the script, plus whitespace and, in some cases, commonly\nused ASCII punctuation and digits. The pages from which the strings come from\nhave been split into training and validation sets, so that no strings in the\ntraining partition come from Wikipedia pages from which validation strings are\nextracted. Files have been gzipped, and have accompanying information that\npermits linking strings back to their original Wikipedia pages. For example, the\nfirst line of `mr/native_script_wikipedia/mr.wiki-filt.train.text.shuf.txt.gz`\ncontains:\n\n```\nकोल्हापुरात मिळणारा तांबडा पांढरा रस्सा कुठेच मिळत नाही.\n```\n\n### Lexicons {#lexicons}\n\nIn the `lexicons` subdirectories there are lexicons of words in the native\nscript of each language alongside human-annotated possible romanizations for the\nword. The words in the lexicons are all sampled from words that occurred more\nthan once in the Wikipedia training sets, in the `native_script_wikipedia`\nsubdirectories, and most received a romanization from more than one annotator,\nthough the annotated romanizations may agree. These are in a format similar to\npronunciation lexicons, i.e., single (word, romanization) pair per line in a TSV\nfile, with an additional column indicating the number of attestations for the\npair. For example, the first two lines of the file\n`pa/lexicons/pa.translit.sampled.train.tsv` contains:\n\n<!-- mdformat off(Don't turn TSV tabs into spaces) -->\n\n```tsv\nਅਂਦਾਜਾ\tandaaja\t1\nਅਂਦਾਜਾ\tandaja\t2\n```\n\n<!-- mdformat on -->\n\ni.e., two different possible romanizations for the Punjabi word `ਅਂਦਾਜਾ`, one\npossible romanization (`andaaja`) attested once, the other (`andaja`) twice. For\nconvenience, each lexicon has been partitioned into training, development and\ntesting sets, with partitioning by native script word, so that words in the\ntraining set do not occur in the development or testing sets. In addition, we\n<!-- TODO(wolfsonkin): Describe how we identified lemmata and link to it. -->\nused some automated methods to identify lemmata (see below) in each word, and\nensured that lemmata in words in the development and test sets were unobserved\nin the training set. All native script characters -- specifically, all native\nscript Unicode codepoints -- in the development and test sets are found in the\ntraining set. See below for further details on [data elicitation](#annotation)\nand [preparation](#native-preprocessing). For each language there are\n`*.train.tsv`, `*.dev.tsv` and `*.test.tsv` files in the subdirectory. For all\nlanguages except for Sindhi (`sd`), there are 25,000 (native script) word types\nin the training lexicon, and 2,500 in each of the dev and test lexicons. Sindhi\nalso has 2,500 native script word types in the dev and test lexicons, but just\n15,000 in the training lexicon.\n\n### Romanized {#romanized}\n\nIn the `romanized` subdirectory, we have manually romanized full strings,\nalongside the original native script prompts for the examples. The native script\nprompts were selected from the validation sets in the `native_script_wikipedia`\nsubdirectories (see description of preprocessing\n[below](#native-preprocessing)). 10,000 strings from each native script\nvalidation set were randomly chosen to be romanized by native speaker\nannotators. For long sentences (more than 30 words), the sentences were\nsegmented into shorter fragments (by splitting in half until fragments are < 30\nwords), and each fragment romanized independently, for ease of annotation. From\nthis process, there are `*.split.tsv` and `*.rejoined.tsv`, which contain native\nscript and romanized strings in the two (tab delimited) fields. (Files with\n'split' are the versions with strings >= 30 segmented; those with 'rejoined' are\nnot length segmented.) For example, the first line of\nhi/romanized/hi.romanized.rejoined.tsv contains:\n\n<!-- mdformat off(Don't turn TSV tabs into spaces) -->\n\n```tsv\nजबकि यह जैनों से कम है।\tJabki yah Jainon se km hai.\n```\n\n<!-- mdformat on -->\n\nAdditionally, for convenience, we performed an automatic (white space)\ntoken-level alignment of the strings, with one aligned token per line, as well\nas an end-of-string marker `</s>`. In the case that the tokenization is not 1-1,\nmultiple tokens are left on the same line. These alignments are provided also\nwith the Latin script de-cased and punctuation removed, e.g., the first seven\nlines of the file `hi/romanized/hi.romanized.rejoined.aligned.cased_nopunct.tsv`\nare:\n\n<!-- mdformat off(Don't turn TSV tabs into spaces) -->\n\n```tsv\nजबकि\tjabki\nयह\tyah\nजैनों\tjainon\nसे\tse\nकम\tkm\nहै\thai\n</s>\t</s>\n```\n\n<!-- mdformat on -->\n\nWe also performed a validation of the romanizations, by requesting that\ndifferent annotators transcribe the romanized strings into the native script of\neach language respectively (see details [below](#round-trip-validation)). The\nresulting native script transcriptions are provided\n(`*.split.validation.native.txt`) for each language, along with a file\n(`*.split.validation.edits.txt`) that provides counts of (1) the total number of\nreference characters (in the original native-script strings), (2) substitutions,\n(3) deletions and (4) insertions in the validation transcriptions. For example,\nthe first two lines of the file\n`bn/romanized/bn.romanized.split.validation.edits.txt` are:\n\n```\n LINE REF SUB DEL INS\n    1 126   3   3   0\n```\n\nwhich indicates that the first native script string in\n`bn/romanized/bn.romanized.split.tsv` has 126 characters, and there were 3\nsubstitutions, 3 deletions and 0 insertions in the native script string\ntranscribed by annotators during the validation phase. Note that the comparison\ninvolved some script normalization of visually identical sequences to minimize\nspurious errors, as described in more detail [below](#native-preprocessing). All\nlanguages fell between 3.5 and 8.5 percent character error rates of the\nvalidation text. See [below](#round-trip-validation) for further details on this\nvalidation process.\n\nFinally, for convenience, we randomly shuffled this set and divided into\ndevelopment and test sets, each of which are broken into native and Latin script\ntext files. Thus the first line in the file\n`si/romanized/si.romanized.rejoined.dev.native.txt` is:\n\n```\nවැව්වල ඇළෙවිලි වැව ඉහත්තාව, වේල්ල ආරක්ෂා කිරිමට එකල සියල්ලෝම බැදි සිටියෝය.\n```\n\nand the first line of `si/romanized/si.romanized.rejoined.dev.roman.txt` is:\n\n```\nvevvala eleveli, veva ihatthava, vella araksha kirimata ekala siyalloma bendi sitiyaya.\n```\n\nNote that several hundred strings from the Urdu Wikipedia sample (and one from\nSindhi) were not from those languages, rather from other languages using a\nPerso-Arabic script, e.g., Arabic, Punjabi or others. Those were excluded for\nthose sets, leading to less than 10,000 romanized strings.\n\n## Native script data preprocessing {#native-preprocessing}\n\nLet `$L` be the language code, one of `bn`, `gu`, `hi`, `kn`, `ml`, `mr`, `pa`,\n`sd`, `si`, `ta`, `te`, or `ur`. The native script files are in\n`$L/native_script_wikipedia`. All URLs of Wikipedia pages are included in\n`$L.wiki-full.urls.tsv.gz`. This tab delimited file includes four fields: page\nID, revision ID, base URL, and URL with revision ID.\n\nWe omitted whole pages that were any of the following:\n\n1.  redirected pages.\n2.  pages with infoboxes about settlements or jurisdictions.\n3.  pages with `state=collapsed` or `expanded` or `autocollapse`\n4.  pages referring to `censusindia` or `en.wikipedia.org`.\n5.  pages with `wikitable`.\n6.  pages with lists containing more than 7 items.\n\nIndices of pages omitted are given in `$L.wiki-full.omit_pages.txt.gz`.\n\nFor pages that are not omitted, we extract text and info files:\n\n*   `$L.wiki-full.text.sorted.tsv.gz`\n*   `$L.wiki-full.info.sorted.tsv.gz`\n\nText is organized by page and section within page. We then:\n\n1. split section text by newline (leading to multiple strings per section).\n2. NFC normalize.\n3. sentence segment using ICU sentence segmentation (leading to multiple\n   sentences per string).  The ICU sentence segmenter is initialized with\n   the locale associated with the specific language being segmented.\n\nBoth tab delimited files share the same initial 6 fields: `page_id`,\n`section_index`, `string_index` (in section), `sentence_index` (in string),\n`include_bool`, and `text_freq`, where the `include_bool` indicates whether to\ninclude the string or not (see below), and `text_freq` is the count of the full\nsection text string in the whole collection. The latter enables us to find\nrepeated strings as the means for determining boilerplate sections and other\nthings to exclude.\n\nBoth files are sorted numerically (descending) by the first three fields.\n\nThe final (7th) field of `$L.wiki-full.text.sorted.tsv.gz` is the text.\n\nThe remaining fields of `$L.wiki-full.info.sorted.tsv.gz` are: (7) depth of the\nsection in the page; (8) the heading level of the section; (9) the section index\nof the parent section; (10) the number of words in the text; (11) the number of\nUnicode codepoints in the text; (12) the percentage of Unicode codepoints\nfalling in category A (described below); (13) the percentage of Unicode\ncodepoints falling in category B (described below); and (14) the section title.\n\nFor a given native script Unicode block, we define categories A and B as\nfollows.  First, we identify a subset of codepoints as special symbols, call\nthem non-letter symbols: non-letter ASCII codepoints; Arabic full stop;\nDevanagari Danda; any codepoint in the General Punctuation block; and any\ndigits in the current native script Unicode block. Category A are those\ncodepoints that (1) are outside of the native script Unicode block; and\n(2) are not in the non-letter subset of codepoints. Category B are all\ncodepoints within the native script Unicode block.\n\nThe above-mentioned `include_bool` is set to true (when filtering) if: the\npercentage of category A is below a threshold; the percentage of category B is\nabove a second threshold; and, finally, the percentage of whitespace-delimited\nwords that contain at least one codepoint from the current native script Unicode\nblock (and not in the non-letter subset) is above the same threshold as category\nB codepoints.\n\nFor each non-empty section title, we calculate the total number of Unicode\ncodepoints, the total number of category A codepoints, and the fraction of\ncodepoints that are category A, for all sections with that title. These\nstatistics are stored in `$L.wiki-full.nonblock.sections.tsv.gz`, which is\nsorted in descending order by total category A codepoints. Thus, the first line\nof `hi.wiki-full.nonblock.sections.tsv.gz` shows the section title with the most\ncategory A characters:\n\n<!-- mdformat off(Don't turn TSV tabs into spaces) -->\n\n```sh\n$ gzip -cd hi.wiki-full.nonblock.sections.tsv.gz | head -1\n6387096.000260\t12141241\t0.526066\tसन्दर्भ\n```\n\n<!-- mdformat on -->\n\nIt's unsurprising the a section titled `सन्दर्भ` ('references') would have so\nmuch non-codeblock text (mainly ASCII). It also illustrates why we track this\nstatistic, since we do not want to include references in the text that we are\nextracting. To avoid such sections, we create a list of sections where the\naggregate percentage of category A codepoints in sections with that title is\ngreater than 20%. These omitted section titles are in\n`$L.wiki-full.nonblock.sections.list.txt.gz`.\n\nA second round of text extraction then occurs, omitting text occurring in\nthe aforementioned sections and including only individual strings that\nconsist of at least 85% category B codepoints, at most 10% category A\ncodepoints, and at least 85% of white-space delimited words containing a\nwithin-codeblock (and not non-letter) codepoint.\n\nAll text that is extracted from a given Wikipedia page is collectively\nplaced in either a training or a validation set, i.e., no strings in the\nvalidation set share a Wikipedia page with any string in the training set.\nBetween 23 and 29 thousand strings are placed in each validation set, which\nrepresents a minimum of 2.25% of the data and a maximum of 26% of the data.\n\nThe data from this second iteration of extraction is present in:\n\n*   `$L.wiki-filt.train.info.sorted.tsv.gz`\n*   `$L.wiki-filt.train.text.sorted.tsv.gz`\n*   `$L.wiki-filt.train.text.shuf.txt.gz`\n*   `$L.wiki-filt.valid.info.sorted.tsv.gz`\n*   `$L.wiki-filt.valid.text.sorted.tsv.gz`\n*   `$L.wiki-filt.valid.text.shuf.txt.gz`\n\nThe first three files are training set files, the final three are validation.\nThe `info.sorted` and `text.sorted` files have index sorted data along the lines\ndescribed above for the full set, for both training and validation sets. We\nadditionally randomly shuffled the text from both sets, found in `text.shuf`.\n\n## Annotation\n\n### Validation string selection criteria for romanization\n\nWe randomly selected 10,000 strings from the validation set detailed\n[above](#native-preprocessing) for romanization by annotators. As detailed\nearlier, for strings with >= 30 Unicode codepoints, we segmented into shorter\nstrings for ease of annotation.\n\n### Round-trip romanization validation {#round-trip-validation}\n\nAfter eliciting manual romanizations for each of the 10,000 strings in the\nselected validation sets in each language, we validated the resulting\nromanizations via a second round of annotations, where the romanized\nstrings were provided to annotators and they were tasked with producing\nthe strings in the native script, which was then compared with the original\nstring. To compare the strings, we performed a visual normalization of both\nthe original and validation native script strings, so that visually identical\nstrings were encoded with the same codepoints for comparison. We then\ncalculated the number of character substitutions, deletions and insertions\nfor each string in the Viterbi alignment between the visually normalized\noriginal and validation strings, including whitespace and punctuation. This,\nalong with the count of characters in the reference (visually normalized\noriginal string) allows for calculation of character error rates.\n\n<!-- TODO(wolfsonkin): Maybe split up sections more granularly in romanized. -->\n\nAs stated [above](#romanized), the languages all fell between 3.5 and 8.5\npercent character error rate. The error rate could not have been 0 for a variety\nof reasons:\n\n*   Some non-codeblock text was allowed in the original native script strings,\n    e.g., individual words in the Latin script, something that annotators with\n    access only to the romanizations could not recover;\n*   Digit strings are typically variously realized with Latin and native script\n    digits, which is also not recoverable;\n*   In the Perso-Arabic script in particular, tokenization in native and Latin\n    scripts may be different, leading to whitespace character mismatch;\n    similarly, punctuation placement sometimes leads to different tokenization;\n*   Errors occurring in either the original Wikipedia or validation strings;\n*   Visually identical strings can be encoded with different Unicode codepoint\n    sequences, something we controlled to some extent with visual normalization,\n    but other correspondences may occur; and\n*   Valid spelling variation exists in the languages, e.g., for English\n    loanwords, but also for common words such as \"Hindi\" in Devanagari, which\n    can be equally well realized as either `हिन्दी` or `हिंदी`.\n\nWe provide the validation strings and character edits per string to permit\nusers of the resource to potentially explore methods that take such\ninformation into account, e.g., for model evaluation.\n\n## License\nThe dataset is licensed under\n[CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/).\nAny third party content or data is provided \"As Is\" without any warranty,\nexpress or implied.\n\n## Contacts\n* roark [at] google.com\n* ckirov [at] google.com\n* wolfsonkin [at] google.com\n"
  }
]