[
  {
    "path": ".gitignore",
    "content": "data\n.idea\ndata.zip\nbert_data\nblue_plus_data\n\n### JetBrains template\n# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm\n# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839\n\n# User-specific stuff\n.idea/**/workspace.xml\n.idea/**/tasks.xml\n.idea/**/dictionaries\n.idea/**/shelf\n\n# Sensitive or high-churn files\n.idea/**/dataSources/\n.idea/**/dataSources.ids\n.idea/**/dataSources.local.xml\n.idea/**/sqlDataSources.xml\n.idea/**/dynamic.xml\n.idea/**/uiDesigner.xml\n\n# Gradle\n.idea/**/gradle.xml\n.idea/**/libraries\n\n# CMake\ncmake-build-debug/\ncmake-build-release/\n\n# Mongo Explorer plugin\n.idea/**/mongoSettings.xml\n\n# File-based project format\n*.iws\n\n# IntelliJ\nout/\n\n# mpeltonen/sbt-idea plugin\n.idea_modules/\n\n# JIRA plugin\natlassian-ide-plugin.xml\n\n# Cursive Clojure plugin\n.idea/replstate.xml\n\n# Crashlytics plugin (for Android Studio and IntelliJ)\ncom_crashlytics_export_strings.xml\ncrashlytics.properties\ncrashlytics-build.properties\nfabric.properties\n\n# Editor-based Rest Client\n.idea/httpRequests\n### Python template\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "                          PUBLIC DOMAIN NOTICE\n              National Center for Biotechnology Information\n\nThis software/database is a \"United States Government Work\" under the terms of\nthe United States Copyright Act.  It was written as part of the author's\nofficial duties as a United States Government employee and thus cannot be\ncopyrighted.  This software/database is freely available to the public for use.\nThe National Library of Medicine and the U.S. Government have not placed any\nrestriction on its use or reproduction.\n\nAlthough all reasonable efforts have been taken to ensure the accuracy and\nreliability of the software and data, the NLM and the U.S. Government do not and\ncannot warrant the performance or results that may be obtained by using this\nsoftware or data. The NLM and the U.S. Government disclaim all warranties,\nexpress or implied, including warranties of performance, merchantability or\nfitness for any particular purpose.\n\nPlease cite the author in any work or product based on this material:\n\nPeng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language \nProcessing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets.\nIn Proceedings of the 2019 Workshop on Biomedical Natural Language Processing \n(BioNLP 2019). 2019:58-65.\n"
  },
  {
    "path": "README.md",
    "content": "# BLUE, the Biomedical Language Understanding Evaluation benchmark\r\n\r\n**\\*\\*\\*\\*\\* New Aug 13th, 2019: Change DDI metric from micro-F1 to macro-F1 \\*\\*\\*\\*\\***\r\n\r\n**\\*\\*\\*\\*\\* New July 11th, 2019: preprocessed PubMed texts \\*\\*\\*\\*\\***\r\n\r\nWe uploaded the [preprocessed PubMed texts](https://github.com/ncbi-nlp/ncbi_bluebert/blob/master/README.md#pubmed)  that were used to pre-train the NCBI_BERT models.\r\n\r\n**\\*\\*\\*\\*\\* New June 17th, 2019: data in BERT format \\*\\*\\*\\*\\***\r\n\r\nWe uploaded some [datasets](https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1) that are ready to be used with the [NCBI BlueBERT codes](https://github.com/ncbi-nlp/ncbi_bluebert).\r\n\r\n## Introduction\r\n\r\nBLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.\r\nHere, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks.\r\nThese tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.\r\n\r\n## Tasks\r\n\r\n| Corpus          | Train |  Dev | Test | Task                    | Metrics             | Domain     |\r\n|-----------------|------:|-----:|-----:|-------------------------|---------------------|------------|\r\n| MedSTS          |   675 |   75 |  318 | Sentence similarity     | Pearson             | Clinical   |\r\n| BIOSSES         |    64 |   16 |   20 | Sentence similarity     | Pearson             | Biomedical |\r\n| BC5CDR-disease  |  4182 | 4244 | 4424 | NER                     | F1                  | Biomedical |\r\n| BC5CDR-chemical |  5203 | 5347 | 5385 | NER                     | F1                  | Biomedical |\r\n| ShARe/CLEFE     |  4628 | 1075 | 5195 | NER                     | F1                  | Clinical   |\r\n| DDI             |  2937 | 1004 |  979 | Relation extraction     | macro F1            | Biomedical |\r\n| ChemProt        |  4154 | 2416 | 3458 | Relation extraction     | micro F1            | Biomedical |\r\n| i2b2-2010       |  3110 |   11 | 6293 | Relation extraction     | F1                  | Clinical   |\r\n| HoC             |  1108 |  157 |  315 | Document classification | F1                  | Biomedical |\r\n| MedNLI          | 11232 | 1395 | 1422 | Inference               | accuracy            | Clinical   |\r\n\r\n\r\n### Sentence similarity\r\n\r\n[BIOSSES](http://tabilab.cmpe.boun.edu.tr/BIOSSES/) is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain.\r\nHere, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.\r\n\r\n[MedSTS](https://mayoclinic.pure.elsevier.com/en/publications/medsts-a-resource-for-clinical-semantic-textual-similarity) is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse.\r\nPlease visit the website to obtain a copy of the dataset.\r\nWe use the standard training and testing sets in the shared task.\r\n\r\n### Named entity recognition\r\n\r\n[BC5CDR](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task\r\nWe use the standard training and test set in the BC5CDR shared task\r\n\r\n[ShARe/CLEF](https://physionet.org/works/ShAReCLEFeHealth2013/) eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database\r\nPlease visit the website to obtain a copy of the dataset.\r\nWe use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.\r\n\r\n### Relation extraction\r\n\r\n[DDI](http://labda.inf.uc3m.es/ddicorpus) extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts\r\nIn our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the macro-average F1-score of the four DDI types.\r\n\r\n[ChemProt](https://biocreative.bioinformatics.udel.edu/news/corpora/) consists of 1,820 PubMed abstracts with chemical-protein interactions and was used in the BioCreative VI text mining chemical-protein interactions shared task\r\nWe use the standard training and test sets in the ChemProt shared task and evaluate the same five classes: CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9.\r\n\r\n[i2b2 2010](https://www.i2b2.org/NLP/DataSets/) shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset. \r\nThe dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.\r\n\r\n### Document multilabel classification\r\n\r\n[HoC](https://www.cl.cam.ac.uk/~sb895/HoC.html) (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer\r\nWe use 315 (~20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level\r\n\r\n### Inference task\r\n\r\n[MedNLI](https://physionet.org/physiotools/mimic-code/mednli/) is a collection of sentence pairs selected from MIMIC-III. We use the same training, development,\r\nand test sets in [Romanov and Shivade](https://www.aclweb.org/anthology/D18-1187)\r\n\r\n### Datasets\r\n\r\nSome datasets can be downloaded at [https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1](https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1)\r\n\r\n## Baselines\r\n\r\n| Corpus          | Metrics | SOTA* | ELMo | BioBERT | NCBI_BERT(base) (P) | NCBI_BERT(base) (P+M) | NCBI_BERT(large) (P) | NCBI_BERT(large) (P+M) |\r\n|-----------------|--------:|------:|-----:|--------:|--------------------:|----------------------:|---------------------:|-----------------------:|\r\n| MedSTS          | Pearson |  83.6 | 68.6 |    84.5 |                84.5 |                  84.8 |                 84.6 |                   83.2 |\r\n| BIOSSES         | Pearson |  84.8 | 60.2 |    82.7 |                89.3 |                  91.6 |                 86.3 |                   75.1 |\r\n| BC5CDR-disease  |       F |  84.1 | 83.9 |    85.9 |                86.6 |                  85.4 |                 82.9 |                   83.8 |\r\n| BC5CDR-chemical |       F |  93.3 | 91.5 |    93.0 |                93.5 |                  92.4 |                 91.7 |                   91.1 |\r\n| ShARe/CLEFE     |       F |  70.0 | 75.6 |    72.8 |                75.4 |                  77.1 |                 72.7 |                   74.4 |\r\n| DDI             |       F |  72.9 | 62.0 |    78.8 |                78.1 |                  79.4 |                 79.9 |                   76.3 |\r\n| ChemProt        |       F |  64.1 | 66.6 |    71.3 |                72.5 |                  69.2 |                 74.4 |                   65.1 |\r\n| i2b2 2010       |       F |  73.7 | 71.2 |    72.2 |                74.4 |                  76.4 |                 73.3 |                   73.9 |\r\n| HoC             |       F |  81.5 | 80.0 |    82.9 |                85.3 |                  83.1 |                 87.3 |                   85.3 |\r\n| MedNLI          |     acc |  73.5 | 71.4 |    80.5 |                82.2 |                  84.0 |                 81.5 |                   83.8 |\r\n\r\n**P**: PubMed, **P+M**: PubMed + MIMIC-III\r\n\r\nSOTA, state-of-the-art as of April 2019, to the best of our knowledge\r\n\r\n* **MedSTS, BIOSSES**: Chen et al. 2019. [BioSentVec: creating sentence embeddings for biomedical texts](https://arxiv.org/abs/1810.09302v2). In Proceedings of the 7th IEEE International Conference on Healthcare Informatics.\r\n* **BC5CDR-disease, BC5CDR-chem**: Yoon et al. 2018. [CollaboNet: collaboration of deep neural networks for biomedical named entity recognition](https://arxiv.org/abs/1809.07950v1). arXiv preprint arXiv:1809.07950.\r\n* **ShARe/CLEFE**: Leaman et al. 2015. [Challenges in clinical natural language processing for automated disorder normalization](https://www.sciencedirect.com/science/article/pii/S1532046415001501?via%3Dihub). Journal of biomedical informatics, 57:28–37.\r\n* **DDI**: Zhang et al. 2018. [Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths](https://academic.oup.com/bioinformatics/article/34/5/828/4565590). Bioinformatics (Oxford, England), 34:828–835.\r\n* **Chem-Prot**: Peng et al. 2018. [Extracting chemical-protein relations with ensembles of SVM and deep learning models](https://academic.oup.com/database/article/doi/10.1093/database/bay073/5055578). Database: the journal of biological\r\ndatabases and curation, 2018.\r\n* **i2b2 2010**: Rink et al. 2011. [Automatic extraction of relations between medical concepts in clinical texts](https://academic.oup.com/jamia/article/18/5/594/833364). Journal of the American Medical Informatics Association, 18:594–600.\r\n* **HoC**: Du et al. 2019. [ML-Net: multilabel classification of biomedical texts with deep neural networks](https://arxiv.org/abs/1811.05475v2). Journal of the American Medical Informatics Association (JAMIA).\r\n* **MedNLI**: Romanov et al. 2018. [Lessons from natural language inference in the clinical domain](https://www.aclweb.org/anthology/D18-1187). In Proceedings of EMNLP, pages 1586–1596.\r\n\r\n\r\n\r\n### Fine-tuning with ELMo\r\n\r\nWe adopted the ELMo model pre-trained on PubMed abstracts to accomplish the BLUE tasks.\r\nThe output of ELMo embeddings of each token is used as input for the fine-tuning model. \r\nWe retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. \r\nThe learning rate was set to 0.001 with an Adam optimizer.\r\nWe iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.\r\n\r\n### Fine-tuning with BERT\r\n\r\nPlease see [https://github.com/ncbi-nlp/ncbi_bluebert](https://github.com/ncbi-nlp/ncbi_bluebert).\r\n\r\n\r\n## Citing BLUE\r\n\r\n*  Peng Y, Yan S, Lu Z. [Transfer Learning in Biomedical Natural Language Processing: An\r\nEvaluation of BERT and ELMo on Ten Benchmarking Datasets](https://arxiv.org/abs/1906.05474). In *Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP)*. 2019.\r\n\r\n```\r\n@InProceedings{peng2019transfer,\r\n  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},\r\n  title     = {Transfer Learning in Biomedical Natural Language Processing: \r\n               An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},\r\n  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},\r\n  year      = {2019},\r\n}\r\n```\r\n\r\n## Acknowledgments\r\n\r\nThis work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of\r\nMedicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number K99LM013001-01.\r\n\r\nWe are also grateful to the authors of BERT and ELMo to make the data and codes publicly available. We would like to thank Geeticka Chauhan for providing thoughtful comments.\r\n\r\n## Disclaimer\r\n\r\nThis tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced\r\non this website is not intended for direct diagnostic use or medical decision-making without review and oversight\r\nby a clinical professional. Individuals should not change their health behavior solely on the basis of information\r\nproduced on this website. NIH does not independently verify the validity or utility of the information produced\r\nby this tool. If you have questions about the information produced on this website, please see a health care\r\nprofessional. More information about NCBI's disclaimer policy is available.\r\n"
  },
  {
    "path": "blue/__init__.py",
    "content": ""
  },
  {
    "path": "blue/bert/__init__.py",
    "content": ""
  },
  {
    "path": "blue/bert/create_cdr_bert.py",
    "content": "import fire\nimport tqdm\n\nfrom blue.ext import pubtator\nfrom blue.ext.preprocessing import tokenize_text, print_ner_debug, write_bert_ner_file\n\n\ndef _find_toks(sentences, start, end):\n    toks = []\n    for sentence in sentences:\n        for ann in sentence.annotations:\n            span = ann.total_span\n            if start <= span.offset and span.offset + span.length <= end:\n                toks.append(ann)\n            elif span.offset <= start and end <= span.offset + span.length:\n                toks.append(ann)\n    return toks\n\n\ndef convert(src, dest, entity_type, validate_mentions=None):\n    with open(src) as fp:\n        docs = pubtator.load(fp)\n\n    total_sentences = []\n    for doc in tqdm.tqdm(docs):\n        text = doc.title + ' ' + doc.abstract\n        sentences = tokenize_text(text, doc.pmid)\n        total_sentences.extend(sentences)\n\n        for ann in doc.annotations:\n            if ann.type == entity_type:\n                anns = _find_toks(sentences, ann.start, ann.end)\n                if len(anns) == 0:\n                    print(f'Cannot find {doc.pmid}: {ann}')\n                    print_ner_debug(sentences, ann.start, ann.end)\n                    exit(1)\n                has_first = False\n                for ann in anns:\n                    if not has_first:\n                        ann.infons['NE_label'] = 'B'\n                        has_first = True\n                    else:\n                        ann.infons['NE_label'] = 'I'\n\n    cnt = write_bert_ner_file(dest, total_sentences)\n    if validate_mentions is not None and validate_mentions != cnt:\n        print(f'Should have {validate_mentions}, but have {cnt} {entity_type} mentions')\n    else:\n        print(f'Have {cnt} mentions')\n\n\nclass Commend:\n    def chemical(self, input, output):\n        convert(input, output, 'Chemical')\n\n    def disease(self, input, output):\n        convert(input, output, 'Disease')\n\n\nif __name__ == '__main__':\n    fire.Fire(Commend)\n"
  },
  {
    "path": "blue/bert/create_chemprot_bert.py",
    "content": "import collections\nimport csv\nimport itertools\nfrom pathlib import Path\n\nimport fire\nimport tqdm\n\nfrom blue.ext import pstring\nfrom blue.ext.preprocessing import tokenize_text\n\n\ndef find_entities(sentence, entities, entity_type):\n    es = []\n    for e in entities:\n        if e['type'] != entity_type:\n            continue\n        if sentence.offset <= e['start'] and e['end'] <= sentence.offset + len(sentence.text):\n            es.append(e)\n    return es\n\n\ndef find_relations(relations, chem, prot):\n    labels = []\n    for i in range(len(relations) - 1, -1, -1):\n        r = relations[i]\n        if r['Arg1'] == chem['id'] and r['Arg2'] == prot['id']:\n            del relations[i]\n            labels.append(r['label'])\n    return labels\n\n\ndef replace_text(text, offset, ann1, ann2):\n    ann1_start = ann1['start'] - offset\n    ann2_start = ann2['start'] - offset\n    ann1_end = ann1['end'] - offset\n    ann2_end = ann2['end'] - offset\n\n    if ann1_start <= ann2_start <= ann1_end \\\n            or ann1_start <= ann2_end <= ann1_end \\\n            or ann2_start <= ann1_start <= ann2_end \\\n            or ann2_start <= ann1_end <= ann2_end:\n        start = min(ann1_start, ann2_start)\n        end = max(ann1_end, ann2_end)\n        before = text[:start]\n        after = text[end:]\n        return before + '@CHEM-GENE$' + after\n\n    if ann1_start > ann2_start:\n        ann1_start, ann1_end, ann2_start, ann2_end = ann2_start, ann2_end, ann1_start, ann1_end\n\n    before = text[:ann1_start]\n    middle = text[ann1_end:ann2_start]\n    after = text[ann2_end:]\n\n    if ann1['type'] in ('GENE-N', 'GENE-Y'):\n        ann1['type'] = 'GENE'\n    if ann2['type'] in ('GENE-N', 'GENE-Y'):\n        ann2['type'] = 'GENE'\n\n    return before + f'@{ann1[\"type\"]}$' + middle + f'@{ann2[\"type\"]}$' + after\n\n\ndef print_rel_debug(sentences, entities, id1, id2):\n    e1 = None\n    e2 = None\n    for e in entities:\n        if e['id'] == id1:\n            e1 = e\n        if e['id'] == id2:\n            e2 = e\n    assert e1 is not None and e2 is not None\n    ss = [s for s in sentences\n          if s.offset <= e1['start'] <= s.offset + len(s.text)\n          or s.offset <= e2['start'] <= s.offset + len(s.text)]\n    if len(ss) != 0:\n        for s in ss:\n            print(s.offset, s.text)\n    else:\n        for s in sentences:\n            print(s.offset, s.text)\n\n\ndef merge_sentences(sentences):\n    if len(sentences) == 0:\n        return sentences\n\n    new_sentences = []\n    last_one = sentences[0]\n    for s in sentences[1:]:\n        if last_one.text[-1] in \"\"\".?!\"\"\" and last_one.text[-4:] != 'i.v.' and s.text[0].isupper():\n            new_sentences.append(last_one)\n            last_one = s\n        else:\n            last_one.text += ' ' * (s.offset - len(last_one.text) - last_one.offset)\n            last_one.text += s.text\n    new_sentences.append(last_one)\n    return new_sentences\n\n\ndef convert(abstract_file, entities_file, relation_file, output):\n    # abstract\n    total_sentences = collections.OrderedDict()\n    with open(abstract_file, encoding='utf8') as fp:\n        for line in tqdm.tqdm(fp, desc=abstract_file.stem):\n            toks = line.strip().split('\\t')\n            text = toks[1] + ' ' + toks[2]\n            text = pstring.printable(text, greeklish=True)\n            sentences = tokenize_text(text, toks[0])\n            sentences = merge_sentences(sentences)\n            total_sentences[toks[0]] = sentences\n    # entities\n    entities = collections.defaultdict(list)\n    with open(entities_file, encoding='utf8') as fp:\n        for line in tqdm.tqdm(fp, desc=entities_file.stem):\n            toks = line.strip().split('\\t')\n            entities[toks[0]].append({\n                'docid': toks[0],\n                'start': int(toks[3]),\n                'end': int(toks[4]),\n                'type': toks[2],\n                'id': toks[1],\n                'text': toks[5]})\n    # relations\n    relations = collections.defaultdict(list)\n    with open(relation_file, encoding='utf8') as fp:\n        for line in tqdm.tqdm(fp, desc=relation_file.stem):\n            toks = line.strip().split('\\t')\n            relations[toks[0]].append({\n                'docid': toks[0],\n                'label': toks[1],\n                'Arg1': toks[2][toks[2].find(':') + 1:],\n                'Arg2': toks[3][toks[3].find(':') + 1:],\n                'toks': toks\n            })\n\n    with open(output, 'w') as fp:\n        writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['index', 'sentence', 'label'])\n        cnt = 0\n        for docid, sentences in tqdm.tqdm(total_sentences.items(), total=len(total_sentences)):\n            for sentence in sentences:\n                # find chemical\n                chemicals = find_entities(sentence, entities[docid], 'CHEMICAL')\n                # find prot\n                genes = find_entities(sentence, entities[docid], 'GENE-N') \\\n                        + find_entities(sentence, entities[docid], 'GENE-Y')\n                for i, (chem, gene) in enumerate(itertools.product(chemicals, genes)):\n                    text = replace_text(sentence.text, sentence.offset, chem, gene)\n                    labels = find_relations(relations[docid], chem, gene)\n                    if len(labels) == 0:\n                        writer.writerow([f'{docid}.{chem[\"id\"]}.{gene[\"id\"]}', text, 'false'])\n                    else:\n                        for l in labels:\n                            writer.writerow([f'{docid}.{chem[\"id\"]}.{gene[\"id\"]}', text, l])\n                            cnt += 1\n\n    # print('-' * 80)\n    # for docid, rs in relations.items():\n    #     if len(rs) > 0:\n    #         for r in rs:\n    #             print('\\t'.join(r['toks']))\n    #             print_rel_debug(total_sentences[r['docid']], entities[r['docid']],\n    #                             r['Arg1'], r['Arg2'])\n    #             print('-' * 80)\n\n\ndef create_chemprot_bert(data_dir, output_dir):\n    data_dir = Path(data_dir)\n    output_dir = Path(output_dir)\n    convert(data_dir / 'chemprot_training/chemprot_training_abstracts.tsv',\n            data_dir / 'chemprot_training/chemprot_training_entities.tsv',\n            data_dir / 'chemprot_training/chemprot_training_gold_standard.tsv',\n            output_dir / 'train.tsv')\n    # convert(data_dir / 'chemprot_development/chemprot_development_abstracts.tsv',\n    #         data_dir / 'chemprot_development/chemprot_development_entities.tsv',\n    #         data_dir / 'chemprot_development/chemprot_development_gold_standard.tsv',\n    #         output_dir / 'dev.tsv')\n    # convert(data_dir / 'chemprot_test_gs/chemprot_test_abstracts_gs.tsv',\n    #         data_dir / 'chemprot_test_gs/chemprot_test_entities_gs.tsv',\n    #         data_dir / 'chemprot_test_gs/chemprot_test_gold_standard.tsv',\n    #         output_dir / 'train.tsv')\n\n\nif __name__ == '__main__':\n    fire.Fire(create_chemprot_bert)\n"
  },
  {
    "path": "blue/bert/create_clefe_bert.py",
    "content": "import functools\nimport os\nimport re\nimport shutil\nfrom pathlib import Path\n\nimport fire\nimport tqdm\nfrom lxml import etree\n\nfrom blue.ext.preprocessing import tokenize_text, print_ner_debug, write_bert_ner_file\n\n\ndef pattern_repl(matchobj, prefix):\n    \"\"\"\n    Replace [**Patterns**] with prefix+spaces.\n    \"\"\"\n    s = matchobj.group(0).lower()\n    return prefix.rjust(len(s))\n\n\ndef _find_toks(sentences, start, end):\n    toks = []\n    for sentence in sentences:\n        for ann in sentence.annotations:\n            span = ann.total_span\n            if start <= span.offset and span.offset + span.length <= end:\n                toks.append(ann)\n            elif span.offset <= start and end <= span.offset + span.length:\n                toks.append(ann)\n    return toks\n\n\ndef read_text(pathname):\n    with open(pathname) as fp:\n        text = fp.read()\n    text = re.sub(r'\\[\\*\\*.*?\\*\\*\\]', functools.partial(pattern_repl, prefix='PATTERN'), text)\n    text = re.sub(r'(\\|{4})|___|~~', functools.partial(pattern_repl, prefix=''), text)\n    sentences = tokenize_text(text, pathname.stem)\n\n    # sentences = _cleanupSentences2(sentences)\n    # sentences = _cleanupSentences1(sentences)\n    # sentences = _normalize_sentences(sentences)\n    # sentences = _tokenize_sentences(sentences)\n    return sentences\n\n\ndef map_anns(sentences, ann_file):\n    with open(ann_file) as fp:\n        for line in fp:\n            line = line.strip()\n            toks = line.split('||')\n            has_first = False\n            for i in range(3, len(toks), 2):\n                start = int(toks[i])\n                end = int(toks[i + 1])\n                anns = _find_toks(sentences, start, end)\n                if len(anns) == 0:\n                    print(f'Cannot find {ann_file}: {line}')\n                    print_ner_debug(sentences, start, end)\n                    exit(1)\n                for ann in anns:\n                    if not has_first:\n                        ann.infons['NE_label'] = 'B'\n                        has_first = True\n                    else:\n                        ann.infons['NE_label'] = 'I'\n    return sentences\n\n\ndef convert(text_dir, ann_dir, dest, validate_mentions=None):\n    total_sentences = []\n    with os.scandir(text_dir) as it:\n        for entry in tqdm.tqdm(it):\n            text_file = Path(entry)\n            ann_file = ann_dir / text_file.name\n            if not ann_file.exists():\n                print('Cannot find ann file:', ann_file)\n                continue\n            sentences = read_text(text_file)\n            sentences = map_anns(sentences, ann_file)\n            total_sentences.extend(sentences)\n\n    # print(len(total_sentences))\n\n    cnt = write_bert_ner_file(dest, total_sentences)\n    if validate_mentions is not None and validate_mentions != cnt:\n        print(f'Should have {validate_mentions}, but have {cnt} mentions')\n    else:\n        print(f'Have {cnt} mentions')\n\n\ndef convert_train_gs_to_text(src_dir, dest_dir):\n    def _one_file(src_file, dest_file):\n        # annotation\n        with open(src_file) as fp:\n            tree = etree.parse(fp)\n\n        stringSlotMentions = {}\n        for atag in tree.xpath('stringSlotMention'):\n            stringSlotMentions[atag.get('id')] = atag.xpath('stringSlotMentionValue')[0].get(\n                'value')\n\n        classMentions = {}\n        for atag in tree.xpath('classMention'):\n            classMentions[atag.get('id')] = (atag.xpath('hasSlotMention')[0].get('id'),\n                                             atag.xpath('mentionClass')[0].get('id'))\n\n        with open(dest_file, 'w') as fp:\n            for atag in tree.xpath('annotation'):\n                id = atag.xpath('mention')[0].get('id')\n                mentionClass = classMentions[id][1]\n                try:\n                    stringSlotMentionValue = stringSlotMentions[classMentions[id][0]]\n                except:\n                    stringSlotMentionValue = 'CUI-less'\n\n                fp.write(f'{dest_file.name}||{mentionClass}||{stringSlotMentionValue}')\n                for stag in atag.xpath('span'):\n                    start = stag.get('start')\n                    end = stag.get('end')\n                    fp.write(f'||{start}||{end}')\n                fp.write('\\n')\n\n    with os.scandir(src_dir) as it:\n        for entry in tqdm.tqdm(it):\n            path = Path(entry)\n            basename = path.stem[:path.stem.find('.')]\n            _one_file(path, dest_dir / f'{basename}.txt')\n\n\ndef split_development(data_path, devel_docids_pathname):\n    with open(devel_docids_pathname) as fp:\n        devel_docids = set(line.strip() for line in fp)\n\n    os.mkdir(data_path / 'TRAIN_REPORTS')\n    os.mkdir(data_path / 'DEV_REPORTS')\n\n    with os.scandir(data_path / 'ALLREPORTS') as it:\n        for entry in tqdm.tqdm(it):\n            text_file = Path(entry)\n            if text_file.stem in devel_docids:\n                dest = data_path / 'DEV_REPORTS' / text_file.name\n            else:\n                dest = data_path / 'TRAIN_REPORTS' / text_file.name\n            shutil.copy(text_file, dest)\n\n\ndef create_clefe_bert(gold_directory, output_directory):\n    data_path = Path(gold_directory)\n    dest_path = Path(output_directory)\n\n    convert(data_path / 'Task1TrainSetCorpus199/TRAIN_REPORTS',\n            data_path / 'Task1TrainSetGOLD199knowtatorehost/Task1Gold',\n            dest_path / 'Training.tsv')\n\n    convert(data_path / 'Task1TrainSetCorpus199/DEV_REPORTS',\n            data_path / 'Task1TrainSetGOLD199knowtatorehost/Task1Gold',\n            dest_path / 'Development.tsv')\n\n    convert(data_path / 'Task1TestSetCorpus100/ALLREPORTS',\n            data_path / 'Task1Gold_SN2012/Gold_SN2012',\n            dest_path / 'Test.tsv')\n\n\nif __name__ == '__main__':\n    fire.Fire(create_clefe_bert)\n"
  },
  {
    "path": "blue/bert/create_ddi_bert.py",
    "content": "import csv\nimport logging\nimport os\nimport re\n\nimport bioc\nimport fire\nfrom lxml import etree\n\n\ndef get_ann(arg, obj):\n    for ann in obj['annotations']:\n        if ann['id'] == arg:\n            return ann\n    raise ValueError\n\n\ndef replace_text(text, offset, ann1, ann2):\n    ann1_start = ann1['start'] - offset\n    ann2_start = ann2['start'] - offset\n    ann1_end = ann1['end'] - offset\n    ann2_end = ann2['end'] - offset\n\n    if ann1_start <= ann2_start <= ann1_end \\\n            or ann1_start <= ann2_end <= ann1_end \\\n            or ann2_start <= ann1_start <= ann2_end \\\n            or ann2_start <= ann1_end <= ann2_end:\n        start = min(ann1_start, ann2_start)\n        end = max(ann1_end, ann2_end)\n        before = text[:start]\n        after = text[end:]\n        return before + f'@{ann1[\"type\"]}-{ann2[\"type\"]}$' + after\n\n    if ann1_start > ann2_start:\n        ann1_start, ann1_end, ann2_start, ann2_end = ann2_start, ann2_end, ann1_start, ann1_end\n\n    before = text[:ann1_start]\n    middle = text[ann1_end:ann2_start]\n    after = text[ann2_end:]\n\n    return before + f'@{ann1[\"type\"]}$' + middle + f'@{ann2[\"type\"]}$' + after\n\n\ndef create_ddi_bert(gold_directory, output):\n    fp = open(output, 'w')\n    writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n    writer.writerow(['index', 'sentence', 'label'])\n    cnt = 0\n    for root, dirs, files in os.walk(gold_directory):\n        for name in files:\n            pathname = os.path.join(root, name)\n            tree = etree.parse(pathname)\n            for stag in tree.xpath('/document/sentence'):\n                sentence = bioc.BioCSentence()\n                sentence.offset = 0\n                sentence.text = stag.get('text')\n\n                entities = {}\n                for etag in stag.xpath('entity'):\n                    id = etag.get('id')\n                    m = re.match('(\\d+)-(\\d+)', etag.get('charOffset'))\n                    if m is None:\n                        logging.warning('{}:{}: charOffset does not match. {}'.format(\n                        output, id, etag.get('charOffset')))\n                        continue\n                    start = int(m.group(1))\n                    end = int(m.group(2)) + 1\n                    expected_text = etag.get('text')\n                    actual_text = sentence.text[start:end]\n                    if expected_text != actual_text:\n                        logging.warning('{}:{}: Text does not match. Expected {}. Actual {}'.format(\n                            output, id, repr(expected_text), repr(actual_text)))\n                    entities[id] = {\n                        'start': start,\n                        'end': end,\n                        'type': etag.get('type'),\n                        'id': id,\n                        'text': actual_text\n                    }\n                for rtag in stag.xpath('pair'):\n                    if rtag.get('ddi') == 'false':\n                        label = 'DDI-false'\n                    else:\n                        label = 'DDI-{}'.format(rtag.get('type'))\n                        cnt += 1\n                    e1 = entities.get(rtag.get('e1'))\n                    e2 = entities.get(rtag.get('e2'))\n                    text = replace_text(sentence.text, sentence.offset, e1, e2)\n                    writer.writerow([f'{rtag.get(\"id\")}', text, label])\n\n    print(f'Have {cnt} relations')\n\n\nif __name__ == '__main__':\n    fire.Fire(create_ddi_bert)\n\n"
  },
  {
    "path": "blue/bert/create_i2b2_bert.py",
    "content": "import csv\nimport itertools\nimport os\nimport re\nfrom pathlib import Path\nfrom typing import Match\n\nimport bioc\nimport fire\nimport pandas as pd\nimport tqdm\n\nfrom blue.bert.create_chemprot_bert import print_rel_debug\nfrom blue.bert.create_ddi_bert import replace_text\n\nlabels = ['PIP', 'TeCP', 'TeRP', 'TrAP', 'TrCP', 'TrIP', 'TrNAP', 'TrWP', 'false']\n\n\ndef read_text(pathname):\n    with open(pathname) as fp:\n        text = fp.read()\n    sentences = []\n    offset = 0\n    for sent in text.split('\\n'):\n        sentence = bioc.BioCSentence()\n        sentence.infons['filename'] = pathname.stem\n        sentence.offset = offset\n        sentence.text = sent\n        sentences.append(sentence)\n        i = 0\n        for m in re.finditer('\\S+', sent):\n            if i == 0 and m.start() != 0:\n                # add fake\n                ann = bioc.BioCAnnotation()\n                ann.id = f'a{i}'\n                ann.text = ''\n                ann.add_location(bioc.BioCLocation(offset, 0))\n                sentence.add_annotation(ann)\n                i += 1\n            ann = bioc.BioCAnnotation()\n            ann.id = f'a{i}'\n            ann.text = m.group()\n            ann.add_location(bioc.BioCLocation(m.start() + offset, len(m.group())))\n            sentence.add_annotation(ann)\n            i += 1\n        offset += len(sent) + 1\n    return sentences\n\n\ndef _get_ann_offset(sentences, match_obj: Match,\n                    start_line_group, start_token_group,\n                    end_line_group, end_token_group,\n                    text_group):\n    assert match_obj.group(start_line_group) == match_obj.group(end_line_group)\n    sentence = sentences[int(match_obj.group(start_line_group)) - 1]\n\n    start_token_idx = int(match_obj.group(start_token_group))\n    end_token_idx = int(match_obj.group(end_token_group))\n    start = sentence.annotations[start_token_idx].total_span.offset\n    end = sentence.annotations[end_token_idx].total_span.end\n    text = match_obj.group(text_group)\n\n    actual = sentence.text[start - sentence.offset:end - sentence.offset].lower()\n    expected = text.lower()\n    assert actual == expected, 'Cannot match at %s:\\n%s\\n%s\\nFind: %r, Matched: %r' \\\n                               % (\n                               sentence.infons['filename'], sentence.text, match_obj.string, actual,\n                               expected)\n    return start, end, text\n\n\ndef read_annotations(pathname, sentences):\n    anns = []\n    pattern = re.compile(r'c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)\\|\\|t=\"(.*?)\"(\\|\\|a=\"(.*?)\")?')\n    with open(pathname) as fp:\n        for i, line in enumerate(fp):\n            line = line.strip()\n            m = pattern.match(line)\n            assert m is not None\n\n            start, end, text = _get_ann_offset(sentences, m, 2, 3, 4, 5, 1)\n            ann = {\n                'start': start,\n                'end': end,\n                'type': m.group(6),\n                'a': m.group(7),\n                'text': text,\n                'line': int(m.group(2)) - 1,\n                'id': f'{pathname.name}.l{i}'\n            }\n            if len(m.groups()) == 9:\n                ann['a'] = m.group(8)\n            anns.append(ann)\n    return anns\n\n\ndef _find_anns(anns, start, end):\n    for ann in anns:\n        if ann['start'] == start and ann['end'] == end:\n            return ann\n    raise ValueError\n\n\ndef read_relations(pathname, sentences, cons):\n    pattern = re.compile(\n        r'c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)\\|\\|r=\"(.*?)\"\\|\\|c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)')\n\n    relations = []\n    with open(pathname) as fp:\n        for line in fp:\n            line = line.strip()\n            m = pattern.match(line)\n            assert m is not None\n\n            start, end, text = _get_ann_offset(sentences, m, 2, 3, 4, 5, 1)\n            ann1 = _find_anns(cons, start, end)\n            start, end, text = _get_ann_offset(sentences, m, 8, 9, 10, 11, 7)\n            ann2 = _find_anns(cons, start, end)\n            relations.append({\n                'docid': pathname.stem,\n                'label': m.group(6),\n                'Arg1': ann1['id'],\n                'Arg2': ann2['id'],\n                'string': line\n            })\n    return relations\n\n\ndef find_relations(relations, ann1, ann2):\n    labels = []\n    for i in range(len(relations) - 1, -1, -1):\n        r = relations[i]\n        if (r['Arg1'] == ann1['id'] and r['Arg2'] == ann2['id']) \\\n                or (r['Arg1'] == ann2['id'] and r['Arg2'] == ann1['id']):\n            del relations[i]\n            labels.append(r['label'])\n    return labels\n\n\ndef convert(top_dir, dest):\n    fp = open(dest, 'w')\n    writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n    writer.writerow(['index', 'sentence', 'label'])\n    with os.scandir(top_dir / 'txt') as it:\n        for entry in tqdm.tqdm(it):\n            if not entry.name.endswith('.txt'):\n                continue\n            text_pathname = Path(entry.path)\n            docid = text_pathname.stem\n\n            sentences = read_text(text_pathname)\n            # read assertions\n            cons = read_annotations(top_dir / 'concept' / f'{text_pathname.stem}.con',\n                                    sentences)\n            # read relations\n            relations = read_relations(top_dir / 'rel' / f'{text_pathname.stem}.rel',\n                                       sentences, cons)\n            for i, (con1, con2) in enumerate(itertools.combinations(cons, 2)):\n                if con1['line'] != con2['line']:\n                    continue\n                # if con['type'] != 'treatment' or ast['type'] != 'problem':\n                #     continue\n                sentence = sentences[con1['line']]\n                text = replace_text(sentence.text, sentence.offset, con1, con2)\n                labels = find_relations(relations, con1, con2)\n                if len(labels) == 0:\n                    writer.writerow([f'{docid}.{con1[\"id\"]}.{con2[\"id\"]}', text, 'false'])\n                else:\n                    for l in labels:\n                        writer.writerow([f'{docid}.{con1[\"id\"]}.{con2[\"id\"]}', text, l])\n\n            if len(relations) != 0:\n                for r in relations:\n                    print(r['string'])\n                    print_rel_debug(sentences, cons, r['Arg1'], r['Arg2'])\n                    print('-' * 80)\n    fp.close()\n\n\ndef split_doc(train1, train2, dev_docids, dest_dir):\n    train1_df = pd.read_csv(train1, sep='\\t')\n    train2_df = pd.read_csv(train2, sep='\\t')\n    train_df = pd.concat([train1_df, train2_df])\n\n    with open(dev_docids) as fp:\n        dev_docids = fp.readlines()\n\n    with open(dest_dir / 'train.tsv', 'w') as tfp, open(dest_dir / 'dev.tsv', 'w') as dfp:\n        twriter = csv.writer(tfp, delimiter='\\t', lineterminator='\\n')\n        twriter.writerow(['index', 'sentence', 'label'])\n        dwriter = csv.writer(dfp, delimiter='\\t', lineterminator='\\n')\n        dwriter.writerow(['index', 'sentence', 'label'])\n        for i, row in train_df.iterrows():\n            if row[0][:row[0].find('.')] in dev_docids:\n                dwriter.writerow(row)\n            else:\n                twriter.writerow(row)\n\n\ndef create_i2b2_bert(gold_directory, output_directory):\n    data_path = Path(gold_directory)\n    dest_path = Path(output_directory)\n    convert(data_path / 'original/reference_standard_for_test_data',\n            dest_path / 'test.tsv')\n    convert(data_path / 'original/concept_assertion_relation_training_data/beth',\n            dest_path / 'train-beth.tsv')\n    convert(data_path / 'original/concept_assertion_relation_training_data/partners',\n            dest_path / 'train-partners.tsv')\n    split_doc(dest_path / 'train-beth.tsv',\n              dest_path / 'train-partners.tsv',\n              data_path / 'dev-docids.txt',\n              dest_path)\n\n\nif __name__ == '__main__':\n    fire.Fire(create_i2b2_bert)\n"
  },
  {
    "path": "blue/bert/create_mednli_bert.py",
    "content": "import csv\nimport json\n\nimport fire\nimport tqdm\nfrom pathlib import Path\n\nfrom blue.ext import pstring\n\n\ndef convert(src, dest):\n    with open(src, encoding='utf8') as fin, open(dest, 'w', encoding='utf8') as fout:\n        writer = csv.writer(fout, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['index', 'sentence1', 'sentence2', 'label'])\n        for line in tqdm.tqdm(fin):\n            line = pstring.printable(line, greeklish=True)\n            obj = json.loads(line)\n            writer.writerow([obj['pairID'], obj['sentence1'], obj['sentence2'], obj['gold_label']])\n\n\ndef create_mednli(input_dir, output_dir):\n    mednli_dir = Path(input_dir)\n    output_dir = Path(output_dir)\n    for src_name, dst_name in zip(['mli_train_v1.jsonl', 'mli_dev_v1.jsonl', 'mli_test_v1.jsonl'],\n                                  ['train.tsv', 'dev.tsv', 'test.tsv']):\n        source = mednli_dir / src_name\n        dest = output_dir / dst_name\n        convert(source, dest)\n\n\nif __name__ == '__main__':\n    fire.Fire(create_mednli)\n"
  },
  {
    "path": "blue/create_bert.sh",
    "content": "#!/usr/bin/env bash\npython blue/bert/create_mednli_bert.py data\\mednli\\Original data\\mednli\npython blue/bert/create_chemprot_bert.py data\\ChemProt\\original data\\ChemProt\\\npython blue/bert/create_ddi_bert.py data\\ddi2013-type\\original\\Test data\\ddi2013-type\\test.tsv"
  },
  {
    "path": "blue/create_gs.sh",
    "content": "#!/usr/bin/env bash\npython blue/gs/create_cdr_test_gs.py \\\n    --input data\\BC5CDR\\Original\\CDR_TestSet.PubTator.txt \\\n    --output data\\BC5CDR\\CDR_TestSet.chem.jsonl \\\n    --type Chemical\n\npython blue/gs/create_cdr_test_gs.py \\\n    --input data\\BC5CDR\\Original\\CDR_TestSet.PubTator.txt \\\n    --output data\\BC5CDR\\CDR_TestSet.disease.jsonl \\\n    --type Disease\n\npython blue/gs/create_clefe_test_gs.py \\\n    --reports_dir data\\ShAReCLEFEHealthCorpus\\Origin\\Task1TestSetCorpus100\\ALLREPORTS \\\n    --anns_dir data\\ShAReCLEFEHealthCorpus\\Origin\\Task1Gold_SN2012\\Gold_SN2012 \\\n    --output data\\ShAReCLEFEHealthCorpus\\Task1TestSetCorpus100_test_gs.jsonl\n\npython blue/gs/create_chemprot_test_gs.py \\\n    --entities data\\ChemProt\\original\\chemprot_test_gs\\chemprot_test_entities_gs.tsv \\\n    --relations data\\ChemProt\\original\\chemprot_test_gs\\chemprot_test_gold_standard.tsv \\\n    --output data\\ChemProt\\chemprot_test_gs.tsv\n\npython blue\\gs\\create_ddi_test_gs.py \\\n    --input_dir data\\ddi2013-type\\original\\Test \\\n    --output data\\ddi2013-type\\test_gs.tsv\n\npython blue\\gs\\create_i2b2_test_gs.py \\\n    --input_dir data\\i2b2-2010\\Original\\reference_standard_for_test_data \\\n    --output_dir data\\i2b2-2010\\\n\npython blue\\gs\\create_mednli_test_gs.py \\\n    --input data\\mednli\\Orignial\\mli_test_v1.jsonl\n    --output data\\mednli\\test_gs.tsv\n\n# eval\npython blue/eval_rel.py data\\ChemProt\\chemprot_test_gs.tsv data\\ChemProt\\chemprot_test_gs.tsv"
  },
  {
    "path": "blue/eval_hoc.py",
    "content": "import fire\nimport pandas as pd\nimport numpy as np\n\nfrom blue.ext.pmetrics import divide\n\nLABELS = ['activating invasion and metastasis', 'avoiding immune destruction',\n          'cellular energetics', 'enabling replicative immortality', 'evading growth suppressors',\n          'genomic instability and mutation', 'inducing angiogenesis', 'resisting cell death',\n          'sustaining proliferative signaling', 'tumor promoting inflammation']\n\n\ndef get_p_r_f_arrary(test_predict_label, test_true_label):\n    num, cat = test_predict_label.shape\n    acc_list = []\n    prc_list = []\n    rec_list = []\n    f_score_list = []\n    for i in range(num):\n        label_pred_set = set()\n        label_gold_set = set()\n\n        for j in range(cat):\n            if test_predict_label[i, j] == 1:\n                label_pred_set.add(j)\n            if test_true_label[i, j] == 1:\n                label_gold_set.add(j)\n\n        uni_set = label_gold_set.union(label_pred_set)\n        intersec_set = label_gold_set.intersection(label_pred_set)\n\n        tt = len(intersec_set)\n        if len(label_pred_set) == 0:\n            prc = 0\n        else:\n            prc = tt / len(label_pred_set)\n\n        acc = tt / len(uni_set)\n\n        rec = tt / len(label_gold_set)\n\n        if prc == 0 and rec == 0:\n            f_score = 0\n        else:\n            f_score = 2 * prc * rec / (prc + rec)\n\n        acc_list.append(acc)\n        prc_list.append(prc)\n        rec_list.append(rec)\n        f_score_list.append(f_score)\n\n    mean_prc = np.mean(prc_list)\n    mean_rec = np.mean(rec_list)\n    f_score = divide(2 * mean_prc * mean_rec, (mean_prc + mean_rec))\n    return mean_prc, mean_rec, f_score\n\n\ndef eval_hoc(true_file, pred_file):\n    data = {}\n\n    true_df = pd.read_csv(true_file, sep='\\t')\n    pred_df = pd.read_csv(pred_file, sep='\\t')\n    assert len(true_df) == len(pred_df), \\\n        f'Gold line no {len(true_df)} vs Prediction line no {len(pred_df)}'\n\n    for i in range(len(true_df)):\n        true_row = true_df.iloc[i]\n        pred_row = pred_df.iloc[i]\n        assert true_row['index'] == pred_row['index'], \\\n            'Index does not match @{}: {} vs {}'.format(i, true_row['index'], pred_row['index'])\n\n        key = true_row['index'][:true_row['index'].find('_')]\n        if key not in data:\n            data[key] = (set(), set())\n\n        if not pd.isna(true_row['labels']):\n            for l in true_row['labels'].split(','):\n                data[key][0].add(LABELS.index(l))\n\n        if not pd.isna(pred_row['labels']):\n            for l in pred_row['labels'].split(','):\n                data[key][1].add(LABELS.index(l))\n\n    assert len(data) == 315, 'There are 315 documents in the test set: %d' % len(data)\n\n    y_test = []\n    y_pred = []\n    for k, (true, pred) in data.items():\n        t = [0] * len(LABELS)\n        for i in true:\n            t[i] = 1\n\n        p = [0] * len(LABELS)\n        for i in pred:\n            p[i] = 1\n\n        y_test.append(t)\n        y_pred.append(p)\n\n    y_test = np.array(y_test)\n    y_pred = np.array(y_pred)\n\n    r, p, f1 = get_p_r_f_arrary(y_pred, y_test)\n    print('Precision: {:.1f}'.format(p*100))\n    print('Recall   : {:.1f}'.format(r*100))\n    print('F1       : {:.1f}'.format(f1*100))\n\n\nif __name__ == '__main__':\n    fire.Fire(eval_hoc)\n"
  },
  {
    "path": "blue/eval_mednli.py",
    "content": "import fire\nimport pandas as pd\n\nfrom blue.ext import pmetrics\n\nlabels = ['contradiction', 'entailment', 'neutral']\n\n\ndef eval_mednli(gold_file, pred_file):\n    true_df = pd.read_csv(gold_file, sep='\\t')\n    pred_df = pd.read_csv(pred_file, sep='\\t')\n    assert len(true_df) == len(pred_df), \\\n        f'Gold line no {len(true_df)} vs Prediction line no {len(pred_df)}'\n\n    y_test = []\n    y_pred = []\n    for i in range(len(true_df)):\n        true_row = true_df.iloc[i]\n        pred_row = pred_df.iloc[i]\n        assert true_row['index'] == pred_row['index'], \\\n            'Index does not match @{}: {} vs {}'.format(i, true_row['index'], pred_row['index'])\n        y_test.append(labels.index(true_row['label']))\n        y_pred.append(labels.index(pred_row['label']))\n    result = pmetrics.classification_report(y_test, y_pred, classes_=labels, macro=False, micro=True)\n    print(result.report)\n\n\nif __name__ == '__main__':\n    fire.Fire(eval_mednli)\n"
  },
  {
    "path": "blue/eval_ner.py",
    "content": "from typing import List\n\nimport fire\n\nfrom ext import pmetrics\nfrom ext.data_structure import read_annotations, Annotation\n\n\ndef has_strict(target: Annotation, lst: List[Annotation]):\n    for x in lst:\n        if target.strict_equal(x):\n            return True\n    return False\n\n\ndef eval_cdr(gold_file, pred_file):\n    golds = read_annotations(gold_file)\n    preds = read_annotations(pred_file)\n\n    # tp\n    tps = []\n    fns = []\n    fps = []\n    for g in golds:\n        if has_strict(g, preds):\n            tps.append(g)\n        else:\n            fns.append(g)\n    tps2 = []\n    for p in preds:\n        if has_strict(p, golds):\n            tps2.append(p)\n        else:\n            fps.append(p)\n\n    tp = len(tps)\n    fp = len(fps)\n    fn = len(fns)\n    tp2 = len(tps2)\n\n    if tp != tp2:\n        print(f'TP: {tp} vs TPs: {tp2}')\n\n    TPR = pmetrics.tpr(tp, 0, fp, fn)\n    PPV = pmetrics.ppv(tp, 0, fp, fn)\n    F1 = pmetrics.f1(PPV, TPR)\n    print('tp:      {}'.format(tp))\n    print('fp:      {}'.format(fp))\n    print('fn:      {}'.format(fn))\n    print('pre:     {:.1f}'.format(PPV * 100))\n    print('rec:     {:.1f}'.format(TPR * 100))\n    print('f1:      {:.1f}'.format(F1 * 100))\n    print('support: {}'.format(tp + fn))\n\n\nif __name__ == '__main__':\n    fire.Fire(eval_cdr)\n"
  },
  {
    "path": "blue/eval_rel.py",
    "content": "import json\nimport logging\n\nimport fire\nimport pandas as pd\n\nfrom blue.ext import pmetrics\n\nall_labels = set()\n\n\ndef _read_relations(pathname):\n    objs = []\n    df = pd.read_csv(pathname, sep='\\t')\n    for i, row in df.iterrows():\n        obj = {'docid': row['docid'], 'id': row['id'], 'arg1': row['arg1'],\n               'arg2': row['arg2'], 'label': row['label']}\n        objs.append(obj)\n        all_labels.add(obj['label'])\n    return objs\n\n\ndef eval_chemprot(gold_file, pred_file):\n    trues = _read_relations(gold_file)\n    preds = _read_relations(pred_file)\n    if len(trues) != len(preds):\n        logging.error('%s-%s: Unmatched line no %s vs %s',\n                      gold_file, pred_file, len(trues), len(preds))\n        exit(1)\n\n    labels = list(sorted(all_labels))\n\n    y_test = []\n    y_pred = []\n    for i, (t, p) in enumerate(zip(trues, preds)):\n        if t['docid'] != p['docid'] or t['arg1'] != p['arg1'] or t['arg2'] != p['arg2']:\n            logging.warning('%s:%s-%s:%s: Cannot match %s vs %s',\n                            gold_file, i, pred_file, i, t, p)\n            continue\n        y_test.append(labels.index(t['label']))\n        y_pred.append(labels.index(p['label']))\n\n    result = pmetrics.classification_report(y_test, y_pred, macro=False,\n                                            micro=True, classes_=labels)\n    print(result.report)\n    print()\n\n    subindex = [i for i in range(len(labels)) if labels[i] != 'false']\n    result = result.sub_report(subindex, macro=False, micro=True)\n    print(result.report)\n\n\nif __name__ == '__main__':\n    fire.Fire(eval_chemprot)\n"
  },
  {
    "path": "blue/eval_sts.py",
    "content": "import fire\r\nimport pandas as pd\r\n\r\n\r\ndef eval_sts(true_file, pred_file):\r\n    true_df = pd.read_csv(true_file, sep='\\t')\r\n    pred_df = pd.read_csv(pred_file, sep='\\t')\r\n    assert len(true_df) == len(pred_df), \\\r\n        f'Gold line no {len(true_df)} vs Prediction line no {len(pred_df)}'\r\n    for i in range(len(true_df)):\r\n        true_row = true_df.iloc[i]\r\n        pred_row = pred_df.iloc[i]\r\n        assert true_row['index'] == pred_row['index'], \\\r\n            'Index does not match @{}: {} vs {}'.format(i, true_row['index'], pred_row['index'])\r\n    print('Pearson correlation: {}'.format(true_df['score'].corr(pred_df['score'])))\r\n\r\n\r\nif __name__ == '__main__':\r\n    fire.Fire(eval_sts)\r\n"
  },
  {
    "path": "blue/ext/__init__.py",
    "content": ""
  },
  {
    "path": "blue/ext/data_structure.py",
    "content": "import json\nfrom typing import List, Any, Dict\n\n\nclass Span:\n    def __init__(self, start: int, end: int, text: str):\n        self.start = start\n        self.end = end\n        self.text = text\n\n    def __str__(self):\n        return f'[start={self.start}, end={self.end}, text={self.text}]'\n\n    def __repr__(self):\n        return str(self)\n\n\nclass Annotation:\n    def __init__(self, id: str, docid: str, spans: List[Span], type: str):\n        self.spans = spans\n        self.id = id\n        self.docid = docid\n        self.type = type\n\n    def __str__(self):\n        return f'docid={self.docid}, spans={self.spans}'\n\n    def __repr__(self):\n        return str(self)\n\n    def strict_equal(self, another: 'Annotation') -> bool:\n        if self.docid != another.docid:\n            return False\n        if len(self.spans) != len(another.spans):\n            return False\n        for s1, s2 in zip(self.spans, another.spans):\n            if s1.start != s2.start:\n                return False\n            if s1.end != s2.end:\n                return False\n        return True\n\n    def relaxed_equal(self, another: 'Annotation') -> bool:\n        if self.docid != another.docid:\n            return False\n        for s1 in self.spans:\n            for s2 in another.spans:\n                if s2.start >= s1.end or s1.start >= s2.end:\n                    continue\n                return True\n        return False\n\n    def to_obj(self) -> Dict:\n        return {\n            'id': self.id,\n            'docid': self.docid,\n            'locations': [{'start': s.start, 'end': s.end, 'text': s.text} for s in self.spans],\n            'type': self.type\n        }\n\n    @staticmethod\n    def from_obj(obj: Any) -> \"Annotation\":\n        return Annotation(obj['id'], obj['docid'],\n                          [Span(o['start'], o['end'], o['text']) for o in obj['locations']],\n                          obj['type'])\n\n\ndef read_annotations(pathname)->List[Annotation]:\n    anns = []\n    with open(pathname) as fp:\n        for line in fp:\n            obj = json.loads(line)\n            anns.append(Annotation.from_obj(obj))\n    return anns\n"
  },
  {
    "path": "blue/ext/pmetrics.py",
    "content": "\"\"\"\r\nCopyright (c) 2019, Yifan Peng\r\nAll rights reserved.\r\n\r\nRedistribution and use in source and binary forms, with or without modification,\r\nare permitted provided that the following conditions are met:\r\n\r\n* Redistributions of source code must retain the above copyright notice, this\r\n  list of conditions and the following disclaimer.\r\n\r\n* Redistributions in binary form must reproduce the above copyright notice, this\r\n  list of conditions and the following disclaimer in the documentation and/or\r\n  other materials provided with the distribution.\r\n\r\n* Neither the name of the copyright holder nor the names of its\r\n  contributors may be used to endorse or promote products derived from\r\n  this software without specific prior written permission.\r\n\r\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\r\nANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\r\nWARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\r\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR\r\nANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\r\n(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\r\nLOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON\r\nANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\r\n(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\r\nSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\r\n\"\"\"\r\n\r\nimport numpy as np\r\nimport pandas as pd\r\nimport tabulate\r\nfrom sklearn import metrics\r\nfrom sklearn.metrics import precision_recall_fscore_support\r\n\r\n\r\nclass Report(object):\r\n    def __init__(self):\r\n        self.report = None\r\n        self.table = None\r\n        self.overall_acc = None\r\n        self.kappa = None\r\n        self.weighted_acc = None\r\n        self.confusion = None\r\n\r\n    def sensitivity(self, targetclass):\r\n        return self.table.iloc[targetclass, 9]\r\n\r\n    def specificity(self, targetclass):\r\n        return self.table.iloc[targetclass, 10]\r\n\r\n    def precision(self, targetclass):\r\n        return self.table.iloc[targetclass, 5]\r\n\r\n    def recall(self, targetclass):\r\n        return self.table.iloc[targetclass, 6]\r\n\r\n    def f1(self, targetclass):\r\n        return self.table.iloc[targetclass, 7]\r\n\r\n    def sub_report(self, targetclasses, *_, **kwargs) -> 'Report':\r\n        digits = kwargs.pop('digits', 3)\r\n        macro = kwargs.pop('macro', False)\r\n        has_micro = kwargs.pop('micro', False)\r\n\r\n        TP = np.zeros(len(targetclasses))\r\n        TN = np.zeros(len(targetclasses))\r\n        FP = np.zeros(len(targetclasses))\r\n        FN = np.zeros(len(targetclasses))\r\n        for i, targetclass in enumerate(targetclasses):\r\n            TP[i] = self.table.iloc[targetclass, 1]\r\n            TN[i] = self.table.iloc[targetclass, 2]\r\n            FP[i] = self.table.iloc[targetclass, 3]\r\n            FN[i] = self.table.iloc[targetclass, 4]\r\n\r\n        TPR = tpr(TP, TN, FP, FN)\r\n        TNR = tnr(TP, TN, FP, FN)\r\n        PPV = ppv(TP, TN, FP, FN)\r\n        NPV = npv(TP, TN, FP, FN)\r\n        ACC = accuracy(TP, TN, FP, FN)\r\n        F1 = f1(PPV, TPR)\r\n\r\n        headings = ['Class', 'TP', 'TN', 'FP', 'FN',\r\n                    'Precision', 'Recall', 'F-score',\r\n                    'Accuracy', 'Sensitivity', 'Specificity', 'PPV', 'NPV',\r\n                    'Support']\r\n        tables = []\r\n\r\n        for i, targetclass in enumerate(targetclasses):\r\n            row = [t[i] for t in (TP, TN, FP, FN, PPV, TPR, F1, ACC, TPR, TNR, PPV, NPV, TP + FN)]\r\n            tables.append([self.table.iloc[targetclass, 0]] + row)\r\n\r\n        if has_micro:\r\n            row = list(micro(TP, TN, FP, FN))\r\n            tables.append(['micro'] + row)\r\n\r\n        if macro:\r\n            row = [np.nan] * 4\r\n            row += [np.average(t) for t in [PPV, TPR, F1, ACC, TPR, TNR, PPV, NPV]]\r\n            row += [np.nan]\r\n            tables.append(['macro'] + row)\r\n\r\n        df = pd.DataFrame(tables, columns=headings)\r\n        float_formatter = ['g'] * 5 + ['.{}f'.format(digits)] * 8 + ['g']\r\n        rtn = Report()\r\n        rtn.report = tabulate.tabulate(df, showindex=False, headers=df.columns,\r\n                                       tablefmt=\"plain\", floatfmt=float_formatter)\r\n        rtn.table = df\r\n        rtn.overall_acc = overall_acc(TP, FN, FP, FN)\r\n        rtn.weighted_acc = weighted_acc(TP, FN, FP, FN)\r\n        return rtn\r\n\r\n\r\ndef divide(x, y):\r\n    return np.true_divide(x, y, out=np.zeros_like(x, dtype=np.float), where=y != 0)\r\n\r\n\r\ndef tpr(tp, tn, fp, fn):\r\n    \"\"\"Sensitivity, hit rate, recall, or true positive rate\"\"\"\r\n    return divide(tp, tp + fn)\r\n\r\n\r\ndef tnr(tp, tn, fp, fn):\r\n    \"\"\"Specificity or true negative rate\"\"\"\r\n    return divide(tn, tn + fp)\r\n\r\n\r\ndef tp_tn_fp_fn(confusion_matrix):\r\n    FP = np.sum(confusion_matrix, axis=0) - np.diag(confusion_matrix)\r\n    FN = np.sum(confusion_matrix, axis=1) - np.diag(confusion_matrix)\r\n    TP = np.diag(confusion_matrix)\r\n    TN = np.sum(confusion_matrix) - (FP + FN + TP)\r\n    return TP, TN, FP, FN\r\n\r\n\r\ndef ppv(tp, tn, fp, fn):\r\n    \"\"\"Precision or positive predictive value\"\"\"\r\n    return divide(tp, tp + fp)\r\n\r\n\r\ndef npv(tp, tn, fp, fn):\r\n    \"\"\"Negative predictive value\"\"\"\r\n    return divide(tn, tn + fn)\r\n\r\n\r\ndef fpr(tp, tn, fp, fn):\r\n    \"\"\"Fall out or false positive rate\"\"\"\r\n    return divide(fp, fp + tn)\r\n\r\n\r\ndef fnr(tp, tn, fp, fn):\r\n    \"\"\"False negative rate\"\"\"\r\n    return divide(fn, tp + fn)\r\n\r\n\r\ndef fdr(tp, tn, fp, fn):\r\n    \"\"\"False discovery rate\"\"\"\r\n    return divide(fp, tp + fp)\r\n\r\n\r\ndef accuracy(tp, tn, fp, fn):\r\n    \"\"\"tp / N, same as \"\"\"\r\n    return divide(tp, tp + fn)\r\n\r\n\r\ndef f1(precision, recall):\r\n    return divide(2 * precision * recall, precision + recall)\r\n\r\n\r\ndef cohen_kappa(confusion, weights=None):\r\n    n_classes = confusion.shape[0]\r\n    sum0 = np.sum(confusion, axis=0)\r\n    sum1 = np.sum(confusion, axis=1)\r\n    expected = divide(np.outer(sum0, sum1), np.sum(sum0))\r\n\r\n    if weights is None:\r\n        w_mat = np.ones([n_classes, n_classes], dtype=np.int)\r\n        w_mat.flat[:: n_classes + 1] = 0\r\n    elif weights == \"linear\" or weights == \"quadratic\":\r\n        w_mat = np.zeros([n_classes, n_classes], dtype=np.int)\r\n        w_mat += np.arange(n_classes)\r\n        if weights == \"linear\":\r\n            w_mat = np.abs(w_mat - w_mat.T)\r\n        else:\r\n            w_mat = (w_mat - w_mat.T) ** 2\r\n    else:\r\n        raise ValueError(\"Unknown kappa weighting type.\")\r\n\r\n    k = divide(np.sum(w_mat * confusion), np.sum(w_mat * expected))\r\n    return 1 - k\r\n\r\n\r\ndef micro(tp, tn, fp, fn):\r\n    \"\"\"Returns tp, tn, fp, fn, ppv, tpr, f1, acc, tpr, tnr, ppv, npv, support\"\"\"\r\n    TP, TN, FP, FN = [np.sum(t) for t in [tp, tn, fp, fn]]\r\n    TPR = tpr(TP, TN, FP, FN)\r\n    TNR = tnr(TP, TN, FP, FN)\r\n    PPV = ppv(TP, TN, FP, FN)\r\n    NPV = npv(TP, TN, FP, FN)\r\n    FPR = fpr(TP, TN, FP, FN)\r\n    FNR = fnr(TP, TN, FP, FN)\r\n    FDR = fdr(TP, TN, FP, FN)\r\n    F1 = f1(PPV, TPR)\r\n    return TP, TN, FP, FN, PPV, TPR, F1, np.nan, TPR, TNR, PPV, NPV, TP + FN\r\n\r\n\r\ndef overall_acc(tp, tn, fp, fn):\r\n    \"\"\"Same as micro recall.\"\"\"\r\n    return divide(np.sum(tp), np.sum(tp + fn))\r\n\r\n\r\ndef weighted_acc(tp, tn, fp, fn):\r\n    weights = tp + fn\r\n    portion = divide(weights, np.sum(weights))\r\n    acc = accuracy(tp, tn, fp, fn)\r\n    return np.average(acc, weights=portion)\r\n\r\n\r\ndef micro_weighted(tp, tn, fp, fn):\r\n    weights = tp + fn\r\n    portion = divide(weights, np.sum(weights))\r\n    # print(portion)\r\n    TP, TN, FP, FN = [np.average(t, weights=portion) for t in [tp, tn, fp, fn]]\r\n    TPR = tpr(TP, TN, FP, FN)\r\n    TNR = tnr(TP, TN, FP, FN)\r\n    PPV = ppv(TP, TN, FP, FN)\r\n    NPV = npv(TP, TN, FP, FN)\r\n    FPR = fpr(TP, TN, FP, FN)\r\n    FNR = fnr(TP, TN, FP, FN)\r\n    FDR = fdr(TP, TN, FP, FN)\r\n    # ACC = accuracy(TP, TN, FP, FN)\r\n    F1 = f1(PPV, TPR)\r\n    return TP, TN, FP, FN, PPV, TPR, F1, np.nan, TPR, TNR, PPV, NPV, TP + FN\r\n\r\n\r\ndef confusion_matrix_report(confusion_matrix, *_, **kwargs) -> 'Report':\r\n    classes_ = kwargs.get('classes_', None)\r\n    digits = kwargs.pop('digits', 3)\r\n    macro = kwargs.pop('macro', False)\r\n    has_micro = kwargs.pop('micro', False)\r\n    kappa_weights = kwargs.pop('kappa', None)\r\n\r\n    TP, TN, FP, FN = tp_tn_fp_fn(confusion_matrix)\r\n    TPR = tpr(TP, TN, FP, FN)\r\n    TNR = tnr(TP, TN, FP, FN)\r\n    PPV = ppv(TP, TN, FP, FN)\r\n    NPV = npv(TP, TN, FP, FN)\r\n    FPR = fpr(TP, TN, FP, FN)\r\n    FNR = fnr(TP, TN, FP, FN)\r\n    FDR = fdr(TP, TN, FP, FN)\r\n    ACC = accuracy(TP, TN, FP, FN)\r\n    F1 = f1(PPV, TPR)\r\n\r\n    if classes_ is None:\r\n        classes_ = [str(i) for i in range(confusion_matrix.shape[0])]\r\n\r\n    headings = ['Class', 'TP', 'TN', 'FP', 'FN',\r\n                'Precision', 'Recall', 'F-score',\r\n                'Accuracy', 'Sensitivity', 'Specificity', 'PPV', 'NPV',\r\n                'Support']\r\n    tables = []\r\n\r\n    for i, c in enumerate(classes_):\r\n        row = [t[i] for t in (TP, TN, FP, FN, PPV, TPR, F1, ACC, TPR, TNR, PPV, NPV, TP + FN)]\r\n        tables.append([str(c)] + row)\r\n\r\n    if has_micro:\r\n        row = list(micro(TP, TN, FP, FN))\r\n        tables.append(['micro'] + row)\r\n\r\n    if macro:\r\n        row = [np.nan] * 4\r\n        row += [np.average(t) for t in [PPV, TPR, F1, ACC, TPR, TNR, PPV, NPV]]\r\n        row += [np.nan]\r\n        tables.append(['macro'] + row)\r\n\r\n    df = pd.DataFrame(tables, columns=headings)\r\n    float_formatter = ['g'] * 5 + ['.{}f'.format(digits)] * 8 + ['g']\r\n    rtn = Report()\r\n    rtn.report = tabulate.tabulate(df, showindex=False, headers=df.columns,\r\n                                   tablefmt=\"plain\", floatfmt=float_formatter)\r\n    rtn.table = df\r\n    rtn.kappa = cohen_kappa(confusion_matrix, weights=kappa_weights)\r\n    rtn.overall_acc = overall_acc(TP, FN, FP, FN)\r\n    rtn.weighted_acc = weighted_acc(TP, FN, FP, FN)\r\n    rtn.confusion = pd.DataFrame(confusion_matrix)\r\n    return rtn\r\n\r\n\r\ndef auc(y_true, y_score, y_column: int = 1):\r\n    \"\"\"Compute Area Under the Curve (AUC).\r\n\r\n    Args:\r\n        y_true: (n_sample, )\r\n        y_score: (n_sample, n_classes)\r\n        y_column: column of y\r\n    \"\"\"\r\n    fpr, tpr, _ = metrics.roc_curve(y_true, y_score[:, y_column], pos_label=1)\r\n    roc_auc = metrics.auc(fpr, tpr)\r\n    return roc_auc, fpr, tpr\r\n\r\n\r\ndef multi_class_auc(y_true, y_score):\r\n    \"\"\"Compute Area Under the Curve (AUC).\r\n\r\n    Args:\r\n        y_true: (n_sample, n_classes)\r\n        y_score: (n_sample, n_classes)\r\n    \"\"\"\r\n    assert y_score.shape[1] == y_true.shape[1]\r\n\r\n    fpr = dict()\r\n    tpr = dict()\r\n    roc_auc = dict()\r\n    n_classes = y_score.shape[1]\r\n    for i in range(n_classes):\r\n        fpr[i], tpr[i], _ = metrics.roc_curve(y_true[:, i], y_score[:, i])\r\n        roc_auc[i] = metrics.auc(fpr[i], tpr[i])\r\n\r\n    fpr[\"micro\"], tpr[\"micro\"], _ = metrics.roc_curve(y_true.ravel(), y_score.ravel())\r\n    roc_auc[\"micro\"] = metrics.auc(fpr[\"micro\"], tpr[\"micro\"])\r\n    return roc_auc, fpr, tpr\r\n\r\n\r\ndef classification_report(y_true, y_pred, *_, **kwargs) -> 'Report':\r\n    \"\"\"\r\n    Args:\r\n        y_true: (n_sample, )\r\n        y_pred: (n_sample, )\r\n    \"\"\"\r\n    m = metrics.confusion_matrix(y_true, y_pred)\r\n    report = confusion_matrix_report(m, **kwargs)\r\n\r\n    confusion = pd.DataFrame(m)\r\n    if 'classes_' in kwargs:\r\n        confusion.index = kwargs['classes_']\r\n        confusion.columns = kwargs['classes_']\r\n    report.confusion = confusion\r\n    return report\r\n\r\n\r\ndef precision_recall_fscore_multilabel(y_true, y_pred, *_, **kwargs):\r\n    \"\"\"\r\n    Args:\r\n        y_true: (n_sample, n_classes)\r\n        y_pred: (n_sample, n_classes)\r\n    \"\"\"\r\n    example_based = kwargs.pop('example_based', False)\r\n    if example_based:\r\n        rs = []\r\n        ps = []\r\n        for yt, yp in zip(y_true, y_pred):\r\n            p, r, _, _ = precision_recall_fscore_support(y_true=yt, y_pred=yp,\r\n                                                         pos_label=1, average='binary')\r\n            rs.append(r)\r\n            ps.append(p)\r\n        r = np.average(rs)\r\n        p = np.average(ps)\r\n        f1 = divide(2 * r * p, r + p)\r\n    else:\r\n        raise NotImplementedError\r\n    return r, p, f1\r\n\r\n\r\n\"\"\"\r\nTest cases\r\n\"\"\"\r\n\r\n\r\ndef test_cm1():\r\n    cm = np.asarray([[20, 5], [10, 15]])\r\n    k = cohen_kappa(cm)\r\n    assert np.math.isclose(k, 0.4, rel_tol=1e-01)\r\n\r\n    k = cohen_kappa(cm, weights='linear')\r\n    assert np.math.isclose(k, 0.4, rel_tol=1e-01)\r\n\r\n\r\ndef test_cm2():\r\n    cm = np.array([\r\n        [236, 29, 7, 4, 8, 5, 3, 3, 1, 0, 5, 6, 1],\r\n        [45, 3724, 547, 101, 102, 16, 0, 0, 2, 0, 0, 11, 0],\r\n        [5, 251, 520, 132, 158, 11, 2, 1, 4, 0, 0, 4, 0],\r\n        [0, 9, 71, 78, 63, 14, 2, 0, 0, 0, 0, 1, 0],\r\n        [8, 37, 152, 144, 501, 200, 71, 11, 30, 3, 0, 18, 0],\r\n        [5, 6, 6, 24, 144, 178, 136, 34, 30, 1, 0, 20, 0],\r\n        [5, 2, 2, 3, 53, 115, 333, 106, 69, 4, 0, 36, 0],\r\n        [2, 0, 0, 0, 1, 9, 99, 247, 119, 8, 0, 26, 0],\r\n        [3, 2, 4, 7, 30, 54, 113, 124, 309, 78, 22, 72, 6],\r\n        [1, 0, 0, 0, 1, 0, 2, 0, 5, 46, 17, 25, 0],\r\n        [1, 0, 0, 0, 0, 0, 0, 0, 7, 53, 229, 28, 34],\r\n        [18, 16, 5, 5, 16, 10, 11, 25, 29, 38, 70, 1202, 99],\r\n        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 9, 11]\r\n    ]).transpose()\r\n    k = cohen_kappa(cm)\r\n    assert np.math.isclose(k, 0.5547, rel_tol=1e-04)\r\n\r\n    k = cohen_kappa(cm, weights='quadratic')\r\n    assert np.math.isclose(k, 0.9214, rel_tol=1e-04)\r\n\r\n    k = cohen_kappa(cm, weights='linear')\r\n    assert np.math.isclose(k, 0.8332, rel_tol=1e-04)\r\n\r\n\r\ndef test_kappa():\r\n    \"\"\"\r\n    Example 10.52\r\n\r\n    Bernald Rosner, Fundamentals of Biostatistics (8th ed). Cengage Learning. 2016. p.434\r\n    \"\"\"\r\n    cm = [[136, 92], [69, 240]]\r\n    k = cohen_kappa(np.array(cm))\r\n    assert np.math.isclose(k, 0.378, rel_tol=1e-03)\r\n\r\n\r\ndef test_precision_recall_fscore_multilabel():\r\n    y_true = np.array([[0, 0, 1, 0]])\r\n    y_pred = np.array([[0, 1, 1, 0]])\r\n    r, p, f1 = precision_recall_fscore_multilabel(y_true, y_pred, example_based=True)\r\n    assert r == 1\r\n    assert p == 0.5\r\n\r\n    y_true = np.array([[1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 1]])\r\n    y_pred = np.array([[1, 0, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1]])\r\n    r, p, f1 = precision_recall_fscore_multilabel(y_true, y_pred, example_based=True)\r\n    assert r == 1\r\n    assert np.isclose(p, 0.888, 1e-02)\r\n\r\n\r\nif __name__ == '__main__':\r\n    test_precision_recall_fscore_multilabel()\r\n"
  },
  {
    "path": "blue/ext/preprocessing.py",
    "content": "import csv\nimport re\n\nimport bioc\nimport en_core_web_sm\n\nnlp = en_core_web_sm.load()\n\n\ndef split_punct(text, start):\n    for m in re.finditer(r\"\"\"[\\w']+|[!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~]\"\"\", text):\n        yield m.group(), m.start() + start, m.end() + start\n\n\ndef tokenize_text(text, id):\n    sentences = []\n    doc = nlp(text)\n    for sent in doc.sents:\n        sentence = bioc.BioCSentence()\n        sentence.infons['filename'] = id\n        sentence.offset = sent.start_char\n        sentence.text = text[sent.start_char:sent.end_char]\n        sentences.append(sentence)\n        i = 0\n        for token in sent:\n            for t, start, end in split_punct(token.text, token.idx):\n                ann = bioc.BioCAnnotation()\n                ann.id = f'a{i}'\n                ann.text = t\n                ann.add_location(bioc.BioCLocation(start, end-start))\n                sentence.add_annotation(ann)\n                i += 1\n    return sentences\n\n\ndef print_ner_debug(sentences, start, end):\n    anns = []\n    for sentence in sentences:\n        for ann in sentence.annotations:\n            span = ann.total_span\n            if start <= span.offset <= end \\\n                    or start <= span.offset + span.length <= end:\n                anns.append(ann)\n    print('-' * 80)\n    if len(anns) != 0:\n        for ann in anns:\n            print(ann)\n    print('-' * 80)\n    ss = [s for s in sentences if s.offset <= start <= s.offset + len(s.text)]\n    if len(ss) != 0:\n        for s in ss:\n            print(s.offset, s.text)\n    else:\n        for s in sentences:\n            print(s.offset, s.text)\n\n\ndef write_bert_ner_file(dest, total_sentences):\n    cnt = 0\n    with open(dest, 'w') as fp:\n        writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n        for sentence in total_sentences:\n            for i, ann in enumerate(sentence.annotations):\n                if 'NE_label' not in ann.infons:\n                    ann.infons['NE_label'] = 'O'\n                elif ann.infons['NE_label'] == 'B':\n                    cnt += 1\n                if i == 0:\n                    writer.writerow([ann.text, sentence.infons['filename'],\n                                     ann.total_span.offset, ann.infons['NE_label']])\n                else:\n                    writer.writerow([ann.text, '-',\n                                     ann.total_span.offset, ann.infons['NE_label']])\n            fp.write('\\n')\n    return cnt\n\n\n"
  },
  {
    "path": "blue/ext/pstring.py",
    "content": "# -*- coding: utf-8 -*-\r\n\"\"\"\r\nCopyright (c) 2019, Yifan Peng\r\nAll rights reserved.\r\n\r\nRedistribution and use in source and binary forms, with or without modification,\r\nare permitted provided that the following conditions are met:\r\n\r\n* Redistributions of source code must retain the above copyright notice, this\r\n  list of conditions and the following disclaimer.\r\n\r\n* Redistributions in binary form must reproduce the above copyright notice, this\r\n  list of conditions and the following disclaimer in the documentation and/or\r\n  other materials provided with the distribution.\r\n\r\n* Neither the name of the copyright holder nor the names of its\r\n  contributors may be used to endorse or promote products derived from\r\n  this software without specific prior written permission.\r\n\r\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\r\nANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\r\nWARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\r\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR\r\nANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\r\n(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\r\nLOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON\r\nANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\r\n(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\r\nSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\r\n\"\"\"\r\nimport logging\r\nimport string\r\n\r\nimport sympy\r\n\r\nACCENTS = {\r\n    u'ά': u'a', u'Ά': u'Α',\r\n    u'έ': u'e', u'Έ': u'Ε',\r\n    u'ή': u'h', u'Ή': u'H',\r\n    u'ί': u'e', u'Ί': u'Ι',\r\n    u'ύ': u'u', u'Ύ': u'Y',\r\n    u'ό': u'o', u'Ό': u'O',\r\n    u'ώ': u'w', u'Ώ': u'w',\r\n    u'Ã': u'A', u'Å': u'A',\r\n    u'ç': u'c', u'ï': 'i',\r\n}\r\n\r\n# The possible string conversions for each case.\r\nGREEK_CONVERT_STRINGS = {\r\n    u\"αι\": [u\"ai\", u\"e\"],\r\n    u\"Αι\": [u\"Ai\", u\"E\"],\r\n    u\"ΑΙ\": [u\"AI\", u\"E\"],\r\n    u\"ει\": [u\"ei\", u\"i\"],\r\n    u\"Ει\": [u\"Ei\", u\"I\"],\r\n    u\"ΕΙ\": [u\"EI\", u\"I\"],\r\n    u\"οι\": [u\"oi\", u\"i\"],\r\n    u\"Οι\": [u\"Oi\", u\"I\"],\r\n    u\"ΟΙ\": [u\"OI\", u\"I\"],\r\n    u\"ου\": [u\"ou\", u\"oy\", u\"u\"],\r\n    u\"Ου\": [u\"Ou\", u\"Oy\", u\"U\"],\r\n    u\"ΟΥ\": [u\"OU\", u\"OY\", u\"U\"],\r\n    u\"ευ\": [u\"eu\", u\"ef\", u\"ev\", u\"ey\"],\r\n    u\"Ευ\": [u\"Eu\", u\"Ef\", u\"Ev\", u\"Ey\"],\r\n    u\"ΕΥ\": [u\"EU\", u\"EF\", u\"EV\", u\"EY\"],\r\n    u\"αυ\": [u\"au\", u\"af\", u\"av\", u\"ay\"],\r\n    u\"Αυ\": [u\"Au\", u\"Af\", u\"Av\", u\"Ay\"],\r\n    u\"ΑΥ\": [u\"AU\", u\"AF\", u\"av\", u\"AY\"],\r\n    u\"μπ\": [u\"mp\", u\"b\"],\r\n    u\"Μπ\": [u\"Mp\", u\"B\"],\r\n    u\"ΜΠ\": [u\"MP\", u\"B\"],\r\n    u\"γγ\": [u\"gg\", u\"g\"],\r\n    u\"Γγ\": [u\"Gg\", u\"G\"],\r\n    u\"ΓΓ\": [u\"GG\", u\"G\"],\r\n    u\"γκ\": [u\"gk\", u\"g\"],\r\n    u\"Γκ\": [u\"Gk\", u\"G\"],\r\n    u\"ΓΚ\": [u\"GK\", u\"G\"],\r\n    u\"ντ\": [u\"nt\", u\"d\"],\r\n    u\"Ντ\": [u\"Nt\", u\"D\"],\r\n    u\"ΝΤ\": [u\"NT\", u\"D\"],\r\n    u\"α\": [u\"a\"],\r\n    u\"Α\": [u\"A\"],\r\n    u\"β\": [u\"b\", u\"v\"],\r\n    u\"Β\": [u\"B\", u\"V\"],\r\n    u\"γ\": [u\"g\"],\r\n    u\"Γ\": [u\"G\"],\r\n    u\"δ\": [u\"d\"],\r\n    u\"Δ\": [u\"D\"],\r\n    u\"ε\": [u\"e\"],\r\n    u\"Ε\": [u\"E\"],\r\n    u\"ζ\": [u\"z\"],\r\n    u\"Ζ\": [u\"Z\"],\r\n    u\"η\": [u\"h\", u\"i\"],\r\n    u\"Η\": [u\"H\", u\"I\"],\r\n    u\"θ\": [u\"th\", u\"8\"],\r\n    u\"Θ\": [u\"TH\", u\"8\"],\r\n    u\"ι\": [u\"i\"],\r\n    u\"Ι\": [u\"I\"],\r\n    u\"κ\": [u\"k\"],\r\n    u\"Κ\": [u\"K\"],\r\n    u\"λ\": [u\"l\"],\r\n    u\"Λ\": [u\"L\"],\r\n    u\"μ\": [u\"m\"],\r\n    u\"Μ\": [u\"M\"],\r\n    u\"ν\": [u\"n\"],\r\n    u\"Ν\": [u\"N\"],\r\n    u\"ξ\": [u\"x\", u\"ks\"],\r\n    u\"Ξ\": [u\"X\", u\"KS\"],\r\n    u\"ο\": [u\"o\"],\r\n    u\"Ο\": [u\"O\"],\r\n    u\"π\": [u\"p\"],\r\n    u\"Π\": [u\"P\"],\r\n    u\"ρ\": [u\"r\"],\r\n    u\"Ρ\": [u\"R\"],\r\n    u\"σ\": [u\"s\"],\r\n    u\"Σ\": [u\"S\"],\r\n    u\"ς\": [u\"s\"],\r\n    u\"τ\": [u\"t\"],\r\n    u\"Τ\": [u\"T\"],\r\n    u\"υ\": [u\"y\", u\"u\", u\"i\"],\r\n    u\"Υ\": [u\"Y\", u\"U\", u\"I\"],\r\n    u\"φ\": [u\"f\", u\"ph\"],\r\n    u\"Φ\": [u\"F\", u\"PH\"],\r\n    u\"χ\": [u\"x\", u\"h\", u\"ch\"],\r\n    u\"Χ\": [u\"X\", u\"H\", u\"CH\"],\r\n    u\"ψ\": [u\"ps\"],\r\n    u\"Ψ\": [u\"PS\"],\r\n    u\"ω\": [u\"w\", u\"o\", u\"v\"],\r\n    u\"Ω\": [u\"w\", u\"O\", u\"V\"],\r\n}\r\n\r\nOTHERS = {\r\n    u'\\xb7': '*',  # MIDDLE DOT\r\n    u'\\xb1': '+',  # PLUS-MINUS SIGN\r\n    u'\\xae': 'r',  # REGISTERED SIGN\r\n    u'\\u2002': ' ',  # EN SPACE\r\n    u'\\xa9': 'c',  # COPYRIGHT SIGN\r\n    u'\\xa0': ' ',  # NO-BREAK SPACE\r\n    u'\\u2009': ' ',  # THIN SPACE\r\n    u'\\u025b': 'e',  # LATIN SMALL LETTER OPEN E\r\n    u'\\u0303': '~',  # COMBINING TILDE\r\n    u'\\u043a': 'k',  # CYRILLIC SMALL LETTER KA\r\n    u'\\u2005': ' ',  # FOUR-PER-EM SPACE\r\n    u'\\u200a': ' ',  # HAIR SPACE\r\n    u'\\u2026': '.',  # HORIZONTAL ELLIPSIS\r\n    u'\\u2033': '\"',  # DOUBLE PRIME\r\n    u'\\u2034': '\"',  # TRIPLE PRIME\r\n    u'\\u2075': '5',  # SUPERSCRIPT FIVE\r\n    u'\\u2077': '7',  # SUPERSCRIPT SEVEN\r\n    u'\\u2079': '9',  # SUPERSCRIPT NINE\r\n    u'\\u207a': '+',  # SUPERSCRIPT PLUS SIGN\r\n    u'\\u207b': '-',  # SUPERSCRIPT MINUS\r\n    u'\\u2080': '0',  # SUBSCRIPT ZERO\r\n    u'\\u2081': '1',  # SUBSCRIPT ONE\r\n    u'\\u2082': '2',  # SUBSCRIPT TWO\r\n    u'\\u2083': '3',  # SUBSCRIPT THREE\r\n    u'\\u2084': '4',  # SUBSCRIPT FOUR\r\n    u'\\u2085': '5',  # SUBSCRIPT FIVE\r\n    u'\\u2122': 'T',  # TRADE MARK SIGN\r\n    u'\\u2192': '>',  # RIGHTWARDS ARROW\r\n    u'\\u2217': '*',  # STERISK OPERATOR\r\n    u'\\u223c': '~',  # TILDE OPERATOR\r\n    u'\\u2248': '=',  # ALMOST EQUAL TO\r\n    u'\\u2264': '<',  # LESS-THAN OR EQUAL TO\r\n    u'\\u2265': '>',  # GREATER-THAN OR EQUAL TO\r\n    u'\\u22c5': '*',  # DOT OPERATOR\r\n    u'\\ue232': 'x',  #\r\n    u'\\ue2f6': 'x',  # Chinese character\r\n    u'\\xb0': '*',  # DEGREE SIGN\r\n    u'\\xb2': '2',  # SUPERSCRIPT TWO\r\n    u'\\xb3': '3',  # SUPERSCRIPT THREE\r\n    u'\\xb4': '\\'',  # ACUTE ACCENT\r\n    u'\\xb5': 'm',  # MICRO SIGN\r\n    u'\\xb9': '1',  # SUPERSCRIPT ONE\r\n    u'\\xc3': 'A',  # LATIN CAPITAL LETTER A WITH TILDE\r\n    u'\\xc5': 'A',  # LATIN CAPITAL LETTER A WITH RING ABOVE\r\n    u'\\xd7': '*',  # MULTIPLICATION SIGN\r\n    u'\\xe7': 'c',  # LATIN SMALL LETTER C WITH CEDILLA\r\n    u'\\xef': 'i',  # LATIN SMALL LETTER I WITH DIAERESIS\r\n    u'\\xf8': 'm',  # LATIN SMALL LETTER O WITH STROKE\r\n    u'\\xfc': 'u',  # LATIN SMALL LETTER U WITH DIAERESIS\r\n    u'\\xf6': 'o',  # LATIN SMALL LETTER O WITH DIAERESIS\r\n    u'\\u2194': '<',  # LEFT RIGHT ARROW\r\n    u'\\xe1': 'a',  # LATIN SMALL LETTER A WITH ACUTE\r\n    u'\\u221e': '~',  # INFINITY\r\n    u'\\u2193': '<',  # DOWNWARDS ARROW\r\n    u'\\u2022': '*',  # BULLET\r\n    u'\\u2211': 'E',  # N-ARY SUMMATION\r\n    u'\\xdf': 'b',  # LATIN SMALL LETTER SHARP S\r\n    u'\\xff': 'y',  # LATIN SMALL LETTER Y WITH DIAERESIS\r\n    u'\\u2550': '=',  # BOX DRAWINGS DOUBLE HORIZONTAL\r\n    u'\\u208b': '-',  # SUBSCRIPT MINUS\r\n    u'\\u226b': '>',  # MUCH GREATER-THAN\r\n    u'\\u2a7e': '>',  # GREATER-THAN OR SLANTED EQUAL TO\r\n    u'\\uf8ff': '*',  # Private Use, Last\r\n    u'\\xe9': 'e',  # LATIN SMALL LETTER E WITH ACUTE\r\n    u'\\u0192': 'f',  # LATIN SMALL LETTER F WITH HOOK\r\n    u'\\u3008': '(',  # LEFT ANGLE BRACKET\r\n    u'\\u3009': ')',  # RIGHT ANGLE BRACKET\r\n    u'\\u0153': 'o',  # LATIN SMALL LIGATURE OE\r\n    u'\\u2a7d': '<',  # LESS-THAN OR SLANTED EQUAL TO\r\n    u'\\u2243': '=',  # ASYMPTOTICALLY EQUAL TO\r\n    u'\\u226a': '<',  # much less-than\r\n}\r\n\r\n\r\ndef printable(s: str, greeklish=False, verbose=False, replacement=' ') -> str:\r\n    \"\"\"\r\n    Return string of ASCII string which is considered printable.\r\n    \"\"\"\r\n    out = ''\r\n    for c in s:\r\n        if c in string.printable:\r\n            out += c\r\n        else:\r\n            if greeklish:\r\n                if c in ACCENTS:\r\n                    out += ACCENTS[c]\r\n                elif c in GREEK_CONVERT_STRINGS:\r\n                    out += GREEK_CONVERT_STRINGS[c][0]\r\n                elif c in OTHERS:\r\n                    out += OTHERS[c]\r\n                elif verbose:\r\n                    logging.warning('Unknown char: %r', sympy.pretty(c))\r\n                    out += replacement\r\n            else:\r\n                if verbose:\r\n                    logging.warning('Cannot convert char: %s', c)\r\n                out += replacement\r\n    return out\r\n"
  },
  {
    "path": "blue/ext/pubtator.py",
    "content": "\"\"\"\nLoads str/file-obj to a list of Pubtator objects\n\"\"\"\nimport logging\nimport re\nfrom typing import List\n\n\nclass Pubtator:\n\n    def __init__(self, pmid: str=None, title: str=None, abstract: str=None):\n        self.pmid = pmid\n        self.title = title\n        self.abstract = abstract\n        self.annotations = []  # type: List[PubtatorAnn]\n        self.relations = []  # type: List[PubtatorRel]\n\n    def __str__(self):\n        text = self.pmid + '|t|' + self.title + '\\n'\n        if self.abstract:\n            text += self.pmid + '|a|' + self.abstract + '\\n'\n        for ann in self.annotations:\n            text += '{}\\n'.format(ann)\n        for rel in self.relations:\n            text += '{}\\n'.format(rel)\n        return text\n\n    def __iter__(self):\n        yield 'pmid', self.pmid\n        yield 'title', self.title\n        yield 'abstract', self.abstract\n        yield 'annotations', [dict(a) for a in self.annotations]\n        yield 'relations', [dict(a) for a in self.relations]\n\n    @property\n    def text(self):\n        \"\"\"\n        str: text\n        \"\"\"\n        text = self.title\n        if self.abstract:\n            text += '\\n' + self.abstract\n        return text\n\n\nclass PubtatorAnn:\n    def __init__(self, pmid, start, end, text, type, id):\n        self.pmid = pmid\n        self.start = start\n        self.end = end\n        self.text = text\n        self.type = type\n        self.id = id\n        self.line = None\n\n    def __str__(self):\n        return f'{self.pmid}\\t{self.start}\\t{self.end}\\t{self.text}\\t{self.type}\\t{self.id}'\n\n    def __iter__(self):\n        yield 'pmid', self.pmid\n        yield 'start', self.start\n        yield 'end', self.end\n        yield 'text', self.text\n        yield 'type', self.type\n        yield 'id', self.id\n\n\nclass PubtatorRel:\n    def __init__(self, pmid, type, id1, id2):\n        self.pmid = pmid\n        self.type = type\n        self.id1 = id1\n        self.id2 = id2\n        self.line = None\n\n    def __str__(self):\n        return '{self.pmid}\\t{self.type}\\t{self.id1}\\t{self.id2}'.format(self=self)\n\n    def __iter__(self):\n        yield 'pmid', self.pmid\n        yield 'type', self.type\n        yield 'id1', self.id1\n        yield 'id2', self.id2\n\n\nABSTRACT_PATTERN = re.compile(r'(.*?)\\|a\\|(.*)')\nTITLE_PATTERN = re.compile(r'(.*?)\\|t\\|(.*)')\n\n\ndef loads(s: str) -> List[Pubtator]:\n    \"\"\"\n    Parse s (a str) to a list of Pubtator documents\n\n    Returns:\n        list: a list of PubTator documents\n    \"\"\"\n    return list(__iterparse(s.splitlines()))\n\n\ndef load(fp) -> List[Pubtator]:\n    \"\"\"\n    Parse file-like object to a list of Pubtator documents\n\n    Args:\n        fp: file-like object\n\n    Returns:\n        list: a list of PubTator documents\n    \"\"\"\n    return loads(fp.read())\n\n\ndef __iterparse(line_iterator):\n    \"\"\"\n    Iterative parse each line\n    \"\"\"\n    doc = Pubtator()\n    i = 0\n    for i, line in enumerate(line_iterator, 1):\n        if i % 100000 == 0:\n            logging.debug('Read %d lines', i)\n        line = line.strip()\n        if not line:\n            if doc.pmid and (doc.title or doc.abstract):\n                yield doc\n            doc = Pubtator()\n            continue\n        matcher = TITLE_PATTERN.match(line)\n        if matcher:\n            doc.pmid = matcher.group(1)\n            doc.title = matcher.group(2)\n            continue\n        matcher = ABSTRACT_PATTERN.match(line)\n        if matcher:\n            doc.pmid = matcher.group(1)\n            doc.abstract = matcher.group(2)\n            continue\n        toks = line.split('\\t')\n        if len(toks) >= 6:\n            annotation = PubtatorAnn(toks[0], int(toks[1]), int(toks[2]), toks[3],\n                                     toks[4], toks[5])\n            annotation.line = i\n            doc.annotations.append(annotation)\n        if len(toks) == 4:\n            relation = PubtatorRel(toks[0], toks[1], toks[2], toks[3])\n            relation.line = i\n            doc.relations.append(relation)\n\n    if doc.pmid and (doc.title or doc.abstract):\n        yield doc\n    logging.debug('Read %d lines', i)\n"
  },
  {
    "path": "blue/gs/__init__.py",
    "content": ""
  },
  {
    "path": "blue/gs/create_cdr_test_gs.py",
    "content": "import fire\nimport jsonlines\nimport tqdm\nimport logging\nfrom blue.ext import pubtator\nfrom ext.data_structure import Span, Annotation\n\n\ndef create_test_gs(input, output, type):\n    assert type in ('Chemical', 'Disease'), \\\n        'entity_type has to be Chemical or Disease'\n\n    with open(input) as fp:\n        docs = pubtator.load(fp)\n\n    with jsonlines.open(output, 'w') as writer:\n        for doc in tqdm.tqdm(docs):\n            for i, ann in enumerate(doc.annotations):\n                if ann.type != type:\n                    continue\n                expected_text = ann.text\n                actual_text = doc.text[ann.start:ann.end]\n                if expected_text != actual_text:\n                    logging.warning('{}:{}: Text does not match. Expected {}. Actual {}'.format(\n                        output, ann.line, repr(expected_text), repr(actual_text)))\n                    continue\n                a = Annotation(ann.pmid + f'.T{i}', doc.pmid,\n                               [Span(ann.start, ann.end, ann.text)],\n                               ann.type)\n                writer.write(a.to_obj())\n\n\nif __name__ == '__main__':\n    fire.Fire(create_test_gs)\n"
  },
  {
    "path": "blue/gs/create_chemprot_test_gs.py",
    "content": "import collections\nimport csv\nfrom pathlib import Path\n\nimport fire\nimport jsonlines\n\n\ndef _read_entities(pathname):\n    d = collections.defaultdict(list)\n    with open(pathname, encoding='utf8') as fp:\n        for line in fp:\n            toks = line.strip().split()\n            d[toks[0]].append(toks)\n    return d\n\n\ndef _read_relations(pathname):\n    d = collections.defaultdict(list)\n    with open(pathname, encoding='utf8') as fp:\n        for line in fp:\n            toks = line.strip().split()\n            arg1 = toks[2][toks[2].find(':') + 1:]\n            arg2 = toks[3][toks[3].find(':') + 1:]\n            d[toks[0], arg1, arg2].append(toks)\n    return d\n\n\ndef create_test_gs(entities, relations, output):\n    entities = _read_entities(entities)\n    relations = _read_relations(relations)\n\n    counter = collections.Counter()\n    with open(output, 'w') as fp:\n        writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['id', 'docid', 'arg1', 'arg2', 'label'])\n        for docid, ents in entities.items():\n            chemicals = [e for e in ents if e[2] == 'CHEMICAL']\n            genes = [e for e in ents if e[2] != 'CHEMICAL']\n            i = 0\n            for c in chemicals:\n                for g in genes:\n                    k = (docid, c[1], g[1])\n                    if k in relations:\n                        for l in relations[k]:\n                            label = l[1]\n                            writer.writerow([f'{docid}.R{i}', docid, k[1], k[2], label])\n                            counter[label] += 1\n                            i += 1\n                    else:\n                        writer.writerow([f'{docid}.R{i}', docid, k[1], k[2], 'false'])\n                        i += 1\n    for k, v in counter.items():\n        print(k, v)\n\n\nif __name__ == '__main__':\n    fire.Fire(create_test_gs)\n"
  },
  {
    "path": "blue/gs/create_clefe_test_gs.py",
    "content": "import functools\nimport logging\nimport os\nimport re\nfrom pathlib import Path\n\nimport jsonlines\nimport tqdm\nimport fire\n\nfrom ext.data_structure import Span, Annotation\n\n\ndef pattern_repl(matchobj, prefix):\n    \"\"\"\n    Replace [**Patterns**] with prefix+spaces.\n    \"\"\"\n    s = matchobj.group(0).lower()\n    return prefix.rjust(len(s))\n\n\ndef _proprocess_text(text):\n    # noinspection PyTypeChecker\n    text = re.sub(r'\\[\\*\\*.*?\\*\\*\\]', functools.partial(pattern_repl, prefix='PATTERN'),\n                  text)\n    # noinspection PyTypeChecker\n    text = re.sub(r'(\\|{4})|___|~~', functools.partial(pattern_repl, prefix=''), text)\n    return text\n\n\ndef create_test_gs(reports_dir, anns_dir, output):\n    anns_dir = Path(anns_dir)\n    with jsonlines.open(output, 'w') as writer:\n        with os.scandir(reports_dir) as it:\n            for entry in tqdm.tqdm(it):\n                text_file = Path(entry)\n                with open(text_file) as fp:\n                    text = fp.read()\n                text = _proprocess_text(text)\n\n                ann_file = anns_dir / text_file.name\n                if not ann_file.exists():\n                    logging.warning(f'{text_file.stem}: Cannot find ann file {ann_file}')\n                    continue\n\n                with open(ann_file) as fp:\n                    for i, line in enumerate(fp):\n                        line = line.strip()\n                        toks = line.split('||')\n                        type = toks[1]\n                        spans = []\n                        for i in range(3, len(toks), 2):\n                            start = int(toks[i])\n                            end = int(toks[i + 1])\n                            spans.append(Span(start, end, text[start:end]))\n                        a = Annotation(text_file.stem + f'.T{i}', text_file.stem, spans, type)\n                        writer.write(a.to_obj())\n\n\nif __name__ == '__main__':\n    fire.Fire(create_test_gs)\n\n"
  },
  {
    "path": "blue/gs/create_ddi_test_gs.py",
    "content": "import collections\nimport csv\nimport os\nimport re\n\nimport fire\nfrom lxml import etree\n\n\ndef create_test_gs(input_dir, output):\n    counter = collections.Counter()\n    with open(output, 'w') as fp:\n        writer = csv.writer(fp, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['id', 'docid', 'arg1', 'arg2', 'label'])\n        for root, dirs, files in os.walk(input_dir):\n            for name in files:\n                pathname = os.path.join(root, name)\n                tree = etree.parse(pathname)\n                docid = tree.xpath('/document')[0].get('id')\n                for stag in tree.xpath('/document/sentence'):\n                    entities = {}\n                    for etag in stag.xpath('entity'):\n                        m = re.match('(\\d+)-(\\d+)', etag.get('charOffset'))\n                        assert m is not None\n                        entities[etag.get('id')] = {\n                            'start': int(m.group(1)),\n                            'end': int(m.group(2)),\n                            'type': etag.get('type'),\n                            'id': etag.get('id'),\n                            'text': etag.get('text')\n                        }\n                    for rtag in stag.xpath('pair'):\n                        if rtag.get('ddi') == 'false':\n                            label = 'DDI-false'\n                        else:\n                            label = 'DDI-{}'.format(rtag.get('type'))\n\n                        e1 = entities.get(rtag.get('e1'))\n                        e2 = entities.get(rtag.get('e2'))\n                        writer.writerow([rtag.get(\"id\"), docid, e1['id'], e2['id'], label])\n                        counter[label] += 1\n    for k, v in counter.items():\n        print(k, v)\n\n\nif __name__ == '__main__':\n    fire.Fire(create_test_gs)\n"
  },
  {
    "path": "blue/gs/create_hoc.py",
    "content": "import csv\nfrom pathlib import Path\n\nimport fire\nimport tqdm\n\n\ndef split_doc(docid_file, data_dir, dest):\n    with open(docid_file) as fp:\n        docids = [line.strip() for line in fp]\n\n    with open(dest, 'w', encoding='utf8') as fout:\n        writer = csv.writer(fout, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['index', 'sentence1', 'sentence2', 'label'])\n\n        for docid in tqdm.tqdm(docids):\n            with open(data_dir / f'{docid}.txt', encoding='utf8') as fp:\n                for i, line in enumerate(fp):\n                    idx = f'{docid}_s{i}'\n                    toks = line.strip().split('\\t')\n                    text = toks[0]\n                    labels = set(l[1:-1] for l in toks[1][1:-1].split(', '))\n                    labels = ','.join(sorted(labels))\n                    writer.writerow([idx, text, labels])\n\n\ndef create_hoc(hoc_dir):\n    hoc_dir = Path(hoc_dir)\n    text_dir = hoc_dir / 'HoCCorpus'\n    for name in ['train', 'dev', 'test']:\n        print('Creating', name)\n        split_doc(hoc_dir / f'{name}_docid.txt', text_dir, hoc_dir / f'{name}.tsv')\n\n\nif __name__ == '__main__':\n    fire.Fire(create_hoc)\n"
  },
  {
    "path": "blue/gs/create_i2b2_test_gs.py",
    "content": "import collections\nimport csv\nimport itertools\nimport os\nimport re\nfrom pathlib import Path\nfrom typing import Match\n\nimport bioc\nimport fire\nimport jsonlines\nimport tqdm\n\nfrom ext.data_structure import Annotation, Span\n\nlabels = ['PIP', 'TeCP', 'TeRP', 'TrAP', 'TrCP', 'TrIP', 'TrNAP', 'TrWP', 'false']\n\n\ndef read_text(pathname):\n    with open(pathname) as fp:\n        text = fp.read()\n    sentences = []\n    offset = 0\n    for sent in text.split('\\n'):\n        sentence = bioc.BioCSentence()\n        sentence.infons['filename'] = pathname.stem\n        sentence.offset = offset\n        sentence.text = sent\n        sentences.append(sentence)\n        i = 0\n        for m in re.finditer('\\S+', sent):\n            if i == 0 and m.start() != 0:\n                # add fake\n                ann = bioc.BioCAnnotation()\n                ann.id = f'a{i}'\n                ann.text = ''\n                ann.add_location(bioc.BioCLocation(offset, 0))\n                sentence.add_annotation(ann)\n                i += 1\n            ann = bioc.BioCAnnotation()\n            ann.id = f'a{i}'\n            ann.text = m.group()\n            ann.add_location(bioc.BioCLocation(m.start() + offset, len(m.group())))\n            sentence.add_annotation(ann)\n            i += 1\n        offset += len(sent) + 1\n    return sentences\n\n\ndef _get_ann_offset(sentences, match_obj: Match,\n                    start_line_group, start_token_group,\n                    end_line_group, end_token_group,\n                    text_group):\n    assert match_obj.group(start_line_group) == match_obj.group(end_line_group)\n    sentence = sentences[int(match_obj.group(start_line_group)) - 1]\n\n    start_token_idx = int(match_obj.group(start_token_group))\n    end_token_idx = int(match_obj.group(end_token_group))\n    start = sentence.annotations[start_token_idx].total_span.offset\n    end = sentence.annotations[end_token_idx].total_span.end\n    text = match_obj.group(text_group)\n\n    actual = sentence.text[start - sentence.offset:end - sentence.offset].lower()\n    expected = text.lower()\n    assert actual == expected, 'Cannot match at %s:\\n%s\\n%s\\nFind: %r, Matched: %r' \\\n                               % (\n                                   sentence.infons['filename'], sentence.text, match_obj.string,\n                                   actual,\n                                   expected)\n    return start, end, text\n\n\ndef read_annotations(pathname, sentences):\n    anns = []\n    pattern = re.compile(r'c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)\\|\\|t=\"(.*?)\"(\\|\\|a=\"(.*?)\")?')\n    with open(pathname) as fp:\n        for i, line in enumerate(fp):\n            line = line.strip()\n            m = pattern.match(line)\n            assert m is not None\n\n            start, end, text = _get_ann_offset(sentences, m, 2, 3, 4, 5, 1)\n            ann = {\n                'start': start,\n                'end': end,\n                'type': m.group(6),\n                'a': m.group(7),\n                'text': text,\n                'line': int(m.group(2)) - 1,\n                'id': f'{pathname.name}.l{i}'\n            }\n            if len(m.groups()) == 9:\n                ann['a'] = m.group(8)\n            anns.append(ann)\n    return anns\n\n\ndef _find_anns(anns, start, end):\n    for ann in anns:\n        if ann['start'] == start and ann['end'] == end:\n            return ann\n    raise ValueError\n\n\ndef read_relations(pathname, sentences, cons):\n    pattern = re.compile(\n        r'c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)\\|\\|r=\"(.*?)\"\\|\\|c=\"(.*?)\" (\\d+):(\\d+) (\\d+):(\\d+)')\n\n    relations = []\n    with open(pathname) as fp:\n        for line in fp:\n            line = line.strip()\n            m = pattern.match(line)\n            assert m is not None\n\n            start, end, text = _get_ann_offset(sentences, m, 2, 3, 4, 5, 1)\n            ann1 = _find_anns(cons, start, end)\n            start, end, text = _get_ann_offset(sentences, m, 8, 9, 10, 11, 7)\n            ann2 = _find_anns(cons, start, end)\n            relations.append({\n                'docid': pathname.stem,\n                'label': m.group(6),\n                'Arg1': ann1['id'],\n                'Arg2': ann2['id'],\n                'string': line\n            })\n    return relations\n\n\ndef find_relations(relations, ann1, ann2):\n    labels = []\n    for i in range(len(relations) - 1, -1, -1):\n        r = relations[i]\n        if (r['Arg1'] == ann1['id'] and r['Arg2'] == ann2['id']) \\\n                or (r['Arg1'] == ann2['id'] and r['Arg2'] == ann1['id']):\n            del relations[i]\n            labels.append(r['label'])\n    return labels\n\n\ndef create_test_gs(input_dir, output_dir):\n    top_dir = Path(input_dir)\n    dest = Path(output_dir)\n\n    counter = collections.Counter()\n    with jsonlines.open(dest / 'test_ann_gs.jsonl', 'w') as writer_ann, \\\n            open(dest / 'test_rel_gs.tsv', 'w') as fp_rel:\n        writer_rel = csv.writer(fp_rel, delimiter='\\t', lineterminator='\\n')\n        writer_rel.writerow(['id', 'docid', 'arg1', 'arg2', 'label'])\n        with os.scandir(top_dir / 'txt') as it:\n            for entry in tqdm.tqdm(it):\n                if not entry.name.endswith('.txt'):\n                    continue\n                text_pathname = Path(entry.path)\n                docid = text_pathname.stem\n\n                sentences = read_text(text_pathname)\n                # read assertions\n                cons = read_annotations(top_dir / 'concept' / f'{text_pathname.stem}.con',\n                                        sentences)\n                for con in cons:\n                    a = Annotation(con['id'], docid,\n                                   [Span(con['start'], con['end'], con['text'])], con['type'])\n                    writer_ann.write(a.to_obj())\n\n                # read relations\n                relations = read_relations(top_dir / 'rel' / f'{text_pathname.stem}.rel',\n                                           sentences, cons)\n                for i, (con1, con2) in enumerate(itertools.combinations(cons, 2)):\n                    if con1['line'] != con2['line']:\n                        continue\n                    labels = find_relations(relations, con1, con2)\n                    if len(labels) == 0:\n                        writer_rel.writerow([f'{docid}.R{i}', docid, con1[\"id\"], con2[\"id\"],\n                                             'false'])\n                        counter['false'] += 1\n                    else:\n                        for l in labels:\n                            writer_rel.writerow([f'{docid}.R{i}', docid, con1[\"id\"], con2[\"id\"], l])\n                            counter[l] += 1\n    for k, v in counter.items():\n        print(k, v)\n\n\nif __name__ == '__main__':\n    fire.Fire(create_test_gs)\n"
  },
  {
    "path": "blue/gs/create_mednli_test_gs.py",
    "content": "import csv\nimport json\n\nimport fire\nimport tqdm\n\nfrom blue.ext import pstring\n\n\ndef create_mednli_test_gs(input, output):\n    with open(input, encoding='utf8') as fin, open(output, 'w', encoding='utf8') as fout:\n        writer = csv.writer(fout, delimiter='\\t', lineterminator='\\n')\n        writer.writerow(['index', 'label'])\n        for line in tqdm.tqdm(fin):\n            line = pstring.printable(line, greeklish=True)\n            obj = json.loads(line)\n            writer.writerow([obj['pairID'], obj['sentence1'], obj['sentence2'], obj['gold_label']])\n\n\nif __name__ == '__main__':\n    fire.Fire(create_mednli_test_gs)\n"
  },
  {
    "path": "blue_plus/README.md",
    "content": "# BLUE+ Protocol\n\n\n## Task/Data description\nPlease provide a high-level description of your dataset to be included for BLUE+, with more details at your own web site. We also need a published reference for your dataset.\n\n\n## Original data\nPlease prepare your dataset as follows, in order to be included in BLUE+:\n* Each data instance should have a unique ID.\n* The data needs to be split into training, validation, and test sets.\n\n\n## Evaluation script\nThe evaluation script takes as input the test data and method output and generates detailed evaluation results. For example, for F1-metrics, TP, FP, FN, precision, and recall are required. Please provide your evaluation scripts to be included at BLUE+.\n\n\n## Previous state-of-the-art (SOTA) results\nThe previous SOTA results should be provided from a published study, with a corresponding reference. The results should be verified by the evaluation script above.\n\n\n## BERT Results\nTo be part of the benchmarking datasets, please report the results of your benchmarking task with the recent BERT model. Specifically, please provide:\n\n 1. BERT-format files of the training, validation, and test sets\n 2. BERT results\n 3. Scripts to train and test BERT models. The scripts can be hosted at your own preferred repository. The scripts are available to users for model re-training and results verification.\n\nIf the data belongs to one of the tasks in BLUE (sentence similarity, named entity recognition, relation extraction, document classification, text inference), please follow the examples at NCBI_BERT ([https://github.com/ncbi-nlp/NCBI_BERT](https://github.com/ncbi-nlp/NCBI_BERT))\n\n\n## Steps\n 1. Pick a name for your dataset (e.g. CRAFT)\n 2. Fork the BLUE_Benchmark project on GitHub ([https://github.com/ncbi-nlp/BLUE_Benchmark](https://github.com/ncbi-nlp/BLUE_Benchmark)), and create a new branch (e.g. craft)\n 3. In your branch, create a subfolder (e.g., CRAFT) in the ‘blue_plus’ folder with at least the following files:\n\t - your_dataset.yml - The configuration file\n\t\t * Dataset name\n\t\t * Dataset description\n\t\t * Version\n\t\t * The citation to use for this dataset\n\t\t * Links to download the original data, BERT-formatted data and its results\n\t\t * Your dataset license information\n\t - your_dataset.py - downloading the datasets and evaluating the results.\n\t\t * Implement a class to inherit the abstract class BaseDataset in [dataset.py](https://github.com/ncbi-nlp/BLUE_Benchmark/blob/master/blue_plus/dataset.py).\n\t\t * The method `download` should download the data sets from the official internet distribution location (ie, “links” in the configuration)\n\t\t * The method `evaluate` should evaluate the results\n\t\t * CLI entry points to download the data\n\t- requirements.txt - a list of packages the script your_dataset.py relies on.\n 4. Send a “pull request” back to BLUE-PLUS\n\nAn example dataset can be found at [https://github.com/ncbi-nlp/BLUE_Benchmark/tree/master/blue_plus](https://github.com/ncbi-nlp/BLUE_Benchmark/tree/master/blue_plus)\n\nIt may take up to 2-3 weeks to review your pull request. We may propose changes or request missing or additional information. Pull requests must be approved first before they can be merged. After the approval, we will include your dataset and results in the benchmark.\n"
  },
  {
    "path": "blue_plus/__init__.py",
    "content": ""
  },
  {
    "path": "blue_plus/dataset.py",
    "content": "import yaml\n\n\nclass BaseDataset(object):\n    \"\"\"Abstract dataset class\"\"\"\n\n    def __init__(self, config_file):\n        print(config_file)\n        with open(config_file, encoding='utf8') as fp:\n            self.config = yaml.load(fp)\n        self.name = self.config['name'] if 'name' in self.config else ''\n        self.description = self.config['description'] if 'description' in self.config else ''\n        self.version = self.config['version'] if 'version' in self.config else ''\n        self.citation = self.config['citation'] if 'citation' in self.config else ''\n        self.links = self.config['links'] if 'links' in self.config else ''\n\n    @property\n    def full_name(self):\n        \"\"\"Full canonical name: (<name>_<version>).\"\"\"\n        return '{}_{}'.format(self.name, self.version)\n\n    def download(self, download_dir='blue_plus_data', override=False):\n        \"\"\"Downloads and prepares dataset for reading.\n\n        Args:\n          download_dir: string\n            directory where downloaded files are stored.\n            Defaults to \"blue_plus_data/<full_name>\".\n          override: bool\n            True to override the data\n        Raises:\n          IOError: if there is not enough disk space available.\n        Returns:\n          successful: bool\n            True if download complete\n        \"\"\"\n        raise NotImplementedError\n\n    def evaluate(self, test_file, prediction_file, output_file):\n        \"\"\"Evaluate the predictions.\n\n        Args:\n          test_file: string\n            location of the file containing the gold standards.\n          prediction_file: string\n            location of the file containing the predictions.\n          output_file: string\n            location of the file to store the evaluation results.\n        Returns:\n          results: string or pandas DataFrame that containing the evaluation results.\n        \"\"\"\n        raise NotImplementedError\n"
  },
  {
    "path": "blue_plus/example_dataset/__init__.py",
    "content": ""
  },
  {
    "path": "blue_plus/example_dataset/biosses.yml",
    "content": "# dataset name\nname: BIOSSES\n\n# description of this dataset.\ndescription: A corpus of sentence pairs selected from the Biomedical Summarization Track Training\n  Dataset in the biomedical domain.\n\nversion: 1.0\n\n# The citation to use for this dataset.\ncitation: \"Sogancioglu G, Ozturk H, Ozgur A. BIOSSES: a semantic sentence similarity estimation\n  system for the biomedical domain. Bioinformatics. 2017 Jul 12;33(14):i49-58.\"\n\n# Homepages of the dataset\nlinks:\n  # original dataset\n  train.tsv: http://pengyifan.com/tmp/BIOSSES/train.tsv\n  dev.tsv: http://pengyifan.com/tmp/BIOSSES/dev.tsv\n  test.tsv: http://pengyifan.com/tmp/BIOSSES/test.tsv\n  test_results.tsv: http://pengyifan.com/tmp/BIOSSES/test_results.tsv\n  # license information\n  # license.txt\n  # BERT version\n  #  bert_train: bert_train.csv\n  #  bert_dev: bert_dev.csv\n  #  bert_test: bert_test.csv\n  #  bert_test_results: bert_test_results.csv\n\n"
  },
  {
    "path": "blue_plus/example_dataset/biosses_dataset.py",
    "content": "import logging\nimport os\nimport sys\nimport urllib.request\nfrom pathlib import Path\n\nimport pandas as pd\nimport yaml\nfrom scipy.stats import pearsonr\n\nsys.path.append(os.path.abspath(os.path.join(__file__, os.pardir, os.pardir, os.pardir)))\nfrom blue_plus.dataset import BaseDataset\n\n\nclass BIOSSES_Dataset(BaseDataset):\n    def download(self, download_dir='blue_plus_data', override=False):\n        download_dir = Path(download_dir)\n        for local_name, url in self.links.items():\n            local_data_path = download_dir / self.full_name / local_name\n            data_exists = local_data_path.exists()\n            if data_exists and not override:\n                logging.info(\"Reusing dataset %s (%s)\", self.name, local_data_path)\n                continue\n            logging.info('Downloading dataset %s (%s) to %s', self.name, url, local_data_path)\n            urllib.request.urlretrieve(url, local_data_path)\n\n    def evaluate(self, test_file, prediction_file, results_file):\n        true_df = pd.read_csv(test_file, sep='\\t')\n        pred_df = pd.read_csv(prediction_file, sep='\\t')\n        assert len(true_df) == len(pred_df), \\\n            f'Gold line no {len(true_df)} vs Prediction line no {len(pred_df)}'\n\n        p, _ = pearsonr(true_df['score'], pred_df['score'])\n        print('Pearson: {:.3f}'.format(p))\n        with open(results_file, 'w') as fp:\n            fp.write('Pearson: {:.3f}'.format(p))\n\n    def evaluate_bert(self, test_file, prediction_file, results_file):\n        return self.evaluate(test_file, prediction_file, results_file)\n\n    def prepare_bert_format(self, input_file, output_file):\n        \"\"\"Optional\"\"\"\n        df = pd.read_csv(input_file, sep='\\t')\n        df = df['sentence1', 'sentence2', 'score']\n        df.to_csv(output_file, sep='\\t', index=None)\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n    dir = os.path.dirname(os.path.abspath(__file__))\n    d = BIOSSES_Dataset(os.path.join(dir, 'biosses.yml'))\n    print('Name:       ', d.full_name)\n    print('Description:', d.description)\n    print('Citation:   ', d.citation)\n\n    dir = Path('blue_plus_data') / d.full_name\n    dir.mkdir(parents=True, exist_ok=True)\n    d.download(override=True)\n    d.evaluate(dir / 'test.tsv', dir / 'test_results.tsv', dir / 'test_results.txt')\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "blue_plus/example_dataset/requirements.txt",
    "content": "pyaml>=18.11.0\nscipy>=1.2.1\npandas>=0.20.1\n"
  },
  {
    "path": "requirements.txt",
    "content": "pyyaml==5.1"
  }
]