Repository: keon/awesome-nlp Branch: main Commit: a7bf3eb8495c Files: 6 Total size: 89.0 KB Directory structure: gitextract_3ri7ug8v/ ├── CREDITS.md ├── LICENSE ├── PULL_REQUEST_TEMPLATE.md ├── README-ZH-TW.md ├── README.md └── contributing.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: CREDITS.md ================================================ # Credits Awesome NLP was seeded with curated content from the lot of repositories, some of which are listed below | [Back to Top](#contents) - Indonesian-NLP Section was seeded from [id-nlp-resource](https://github.com/kmkurn/id-nlp-resource) - [ai-reading-list](https://github.com/m0nologuer/AI-reading-list) - [nlp-reading-group](https://github.com/clulab/nlp-reading-group/wiki/Fall-2015-Reading-Schedule) - [awesome-spanish-nlp](https://github.com/dav009/awesome-spanish-nlp) - [jjangsangy's awesome-nlp](https://gist.github.com/jjangsangy/8759f163bc3558779c46) - [awesome-machine-learning](https://github.com/josephmisiti/awesome-machine-learning/blob/master/README.md) - [DL4NLP](https://github.com/andrewt3000/DL4NLP) - [awesome-persian-nlp-ir](https://github.com/mhbashari/awesome-persian-nlp-ir) ================================================ FILE: LICENSE ================================================ Creative Commons Legal Code CC0 1.0 Universal CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER. Statement of Purpose The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work"). Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others. For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in the Work and the meaning and intended legal effect of CC0 on those rights. 1. Copyright and Related Rights. A Work made available under CC0 may be protected by copyright and related or neighboring rights ("Copyright and Related Rights"). Copyright and Related Rights include, but are not limited to, the following: i. the right to reproduce, adapt, distribute, perform, display, communicate, and translate a Work; ii. moral rights retained by the original author(s) and/or performer(s); iii. publicity and privacy rights pertaining to a person's image or likeness depicted in a Work; iv. rights protecting against unfair competition in regards to a Work, subject to the limitations in paragraph 4(a), below; v. rights protecting the extraction, dissemination, use and reuse of data in a Work; vi. database rights (such as those arising under Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, and under any national implementation thereof, including any amended or successor version of such directive); and vii. other similar, equivalent or corresponding rights throughout the world based on applicable law or treaty, and any national implementations thereof. 2. Waiver. To the greatest extent permitted by, but not in contravention of, applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and unconditionally waives, abandons, and surrenders all of Affirmer's Copyright and Related Rights and associated claims and causes of action, whether now known or unknown (including existing as well as future claims and causes of action), in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each member of the public at large and to the detriment of Affirmer's heirs and successors, fully intending that such Waiver shall not be subject to revocation, rescission, cancellation, termination, or any other legal or equitable action to disrupt the quiet enjoyment of the Work by the public as contemplated by Affirmer's express Statement of Purpose. 3. Public License Fallback. Should any part of the Waiver for any reason be judged legally invalid or ineffective under applicable law, then the Waiver shall be preserved to the maximum extent permitted taking into account Affirmer's express Statement of Purpose. In addition, to the extent the Waiver is so judged Affirmer hereby grants to each affected person a royalty-free, non transferable, non sublicensable, non exclusive, irrevocable and unconditional license to exercise Affirmer's Copyright and Related Rights in the Work (i) in all territories worldwide, (ii) for the maximum duration provided by applicable law or treaty (including future time extensions), (iii) in any current or future medium and for any number of copies, and (iv) for any purpose whatsoever, including without limitation commercial, advertising or promotional purposes (the "License"). The License shall be deemed effective as of the date CC0 was applied by Affirmer to the Work. Should any part of the License for any reason be judged legally invalid or ineffective under applicable law, such partial invalidity or ineffectiveness shall not invalidate the remainder of the License, and in such case Affirmer hereby affirms that he or she will not (i) exercise any of his or her remaining Copyright and Related Rights in the Work or (ii) assert any associated claims and causes of action with respect to the Work, in either case contrary to Affirmer's express Statement of Purpose. 4. Limitations and Disclaimers. a. No trademark or patent rights held by Affirmer are waived, abandoned, surrendered, licensed or otherwise affected by this document. b. Affirmer offers the Work as-is and makes no representations or warranties of any kind concerning the Work, express, implied, statutory or otherwise, including without limitation warranties of title, merchantability, fitness for a particular purpose, non infringement, or the absence of latent or other defects, accuracy, or the present or absence of errors, whether or not discoverable, all to the greatest extent permissible under applicable law. c. Affirmer disclaims responsibility for clearing rights of other persons that may apply to the Work or any use thereof, including without limitation any person's Copyright and Related Rights in the Work. Further, Affirmer disclaims responsibility for obtaining any necessary consents, permissions or other rights required for any use of the Work. d. Affirmer understands and acknowledges that Creative Commons is not a party to this document and has no duty or obligation with respect to this CC0 or use of the Work. ================================================ FILE: PULL_REQUEST_TEMPLATE.md ================================================ # Contribution Guidelines Please ensure your pull request adheres to the following guidelines: - Search previous suggestions before making a new one, as yours may be a duplicate. - Make an individual pull request for each suggestion. - Use the following format: [Bookmark Title](link): Description. - The title usually consists of several words. For bookmarks of an article or tutorial, use the title of the article. - The description consists of one-two sentences about the bookmark. It is required. - If the resource has 2 or more links please provide the Github repo link. - Additions should be added to the bottom of the relevant category. - New categories or improvements to the existing categorization are welcome. - Check your spelling and grammar. - Make sure your text editor is set to remove trailing whitespace. - The pull request and commit should have a useful title. Thank you for your suggestions! ================================================ FILE: README-ZH-TW.md ================================================ # 令人讚嘆的自然語言處理 [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) > 專門用於自然語言處理的精選資源列表 ![Awesome NLP Logo](/images/logo.jpg) > * 原文地址:[令人讚嘆的自然語言處理](https://github.com/keon/awesome-nlp) > * 原文作者:[Keon](https://github.com/keon), [Martin](https://github.com/outpark), [Nirant](https://github.com/NirantK), [Dhruv](https://github.com/the-ethan-hunt) > * 翻譯:[NeroCube](https://github.com/NeroCube) _請在提交之前閱讀 [貢獻指南](contributing.md) 。請隨時創建 [拉取請求](https://github.com/keonkim/awesome-nlp/pulls)._ ## 內容 * [研究摘要和趨勢](#研究摘要和趨勢) * [教學](#教學) * [閱讀內容](#閱讀內容) * [影片和課程](#影片和課程) * [書籍](#書籍) * [函式庫](#函式庫) * [Node.js](#user-content-node-js) * [Python](#user-content-python) * [C++](#user-content-c++) * [Java](#user-content-java) * [Kotlin](#user-content-kotlin) * [Scala](#user-content-scala) * [R](#user-content-r) * [Clojure](#user-content-clojure) * [Ruby](#user-content-ruby) * [Rust](#user-content-rust) * [服務](#服務) * [註釋工具](#註釋工具) * [資料集](#資料集) * [自然語言處理-韓文](#自然語言處理-韓文) * [自然語言處理-阿拉伯語](#自然語言處理-阿拉伯語) * [自然語言處理-中文](#自然語言處理-中文) * [自然語言處理-德文](#自然語言處理-德文) * [自然語言處理-西班牙語](#自然語言處理-西班牙語) * [自然語言處理-印度語](#自然語言處理-印度語) * [自然語言處理-泰語](#自然語言處理-泰語) * [自然語言處理-丹麥語](#自然語言處理-丹麥語) * [自然語言處理-越南語](#自然語言處理-越南語) * [自然語言處理-印度尼西亞](#自然語言處理-印度尼西亞) * [其他語言](#其他語言) * [貢獻](#貢獻) ## 研究摘要和趨勢 * [自然語言處理-概述](https://nlpoverview.com/) 是應用於自然語言深度學習技術的最新概述,包括理論,實現,應用和最先進的結果。對於研究人員來說,這是一個偉大的Deep NLP簡介。 * [自然語言處理-進展](https://nlpprogress.com/) 追隨自然語言處理的進展,包括資料集和常見自然語言處理任務的當前最新技術。 * [自然語言處理的 ImageNet 時刻已經到來](https://thegradient.pub/nlp-imagenet/) * [ACL 2018 亮點: 在更具挑戰性的設置中理解表示和評估](http://ruder.io/acl-2018-highlights/) * [ACL 2017 的四個深度學習趨勢。第一部分:語言結構和詞語嵌入](https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-1.html) * [ACL 2017 的四個深度學習趨勢。第二部分:可解釋性和注意力](https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-2.html) * [2017 年 EMNLP 的亮點:激動人心的資料集,集群的回歸與其他更多!](http://blog.aylien.com/highlights-emnlp-2017-exciting-datasets-return-clusters/) * [深度學習自然語言處理 (NLP): 進展與趨勢](https://tryolabs.com/blog/2017/12/12/deep-learning-for-nlp-advancements-and-trends-in-2017/?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI) * [自然語言生成的現狀調查](https://arxiv.org/abs/1703.09902) ## 教學 [返回頂部](#內容) ### 閱讀內容 通用機器學習 * 來自 Google 高級創意工程師 Jason 的[機器學習 101](https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/edit?usp=sharing) ,為工程師和管理階層解釋機器學習。 * a16z [AI 劇本](https://aiplaybook.a16z.com/) 是一個很好的鏈接,可以轉發給您的經理或演示內容。 * [繼器學習部落格](https://bmcfee.github.io/#home) by Brian McFee * [Ruder's 部落格](http://ruder.io/#open) 由 [Sebastian Ruder](https://twitter.com/seb_ruder) 進行評論得最好的自然語言處理研究。 自然語言處理介紹與指南 * [理解和實施自然語言處理](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/) 的終極指南。 * [Hackernoon 的自然語言處理簡介](https://hackernoon.com/learning-ai-if-you-suck-at-math-p7-the-magic-of-natural-language-processing-f3819a689386) 適用於那些賞味數學的人-用他們自己的話來說。 * [Vik Paruchari 的自然語言處理教學](http://www.vikparuchuri.com/blog/natural-language-processing-tutorial/) * [自然語言處理: 一份簡介](https://academic.oup.com/jamia/article/18/5/544/829676) 來自牛津大學。 * [使用 Pytorch 進行自然語言處理的深度學習](https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html) * [動手做 NLTK 教學](https://github.com/hb20007/hands-on-nltk-tutorial) - 以 Jupyter 筆記本形式的實踐 NLTK 教學。 部落格與簡報 * 部落格: [深度學習, 自然語言處理, 與呈現法](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) * 部落格: [圖解 BERT, ELMo, 與 co. (自然語言處理是如何破解遷移學習的)](https://jalammar.github.io/illustrated-bert/) 與 [圖解轉換器](https://jalammar.github.io/illustrated-transformer/) * 部落格: Hal Daumé III 的[自然語言處理](https://nlpers.blogspot.com/) * [Radim Řehůřek 的教學](https://radimrehurek.com/gensim/tutorial.html) 使用 Python 與 [gensim](https://radimrehurek.com/gensim/index.html) 處理語言語料庫。 * [arXiv: 自然語言處理 (大部分) 來自 Scratch](https://arxiv.org/pdf/1103.0398.pdf) * [Karpathy 的遞歸神經網絡的不合理有效性](https://karpathy.github.io/2015/05/21/rnn-effectiveness) ### 影片和課程 #### 深度學習與自然語言處理 用於自然語言處理的詞嵌入, 遞歸神經網絡, 長短期記憶神經網絡與卷積神經網路 | [返回頂部](#內容) * Udacity 的[人工智慧入門](https://www.udacity.com/course/intro-to-artificial-intelligence--cs271) 課程涉及到自然語言處理。 * Udacity 的[深度學習](https://udacity.com/course/deep-learning--ud730) 使用Tensorflow 使用深度學習的 NLP 任務的部分(包括 Word2Vec,RNN的 和 LSTMs)。 * 牛津大學的[深度自然語言處理](https://github.com/oxford-cs-deepnlp-2017/lectures)有影片,演講投影片和閱讀素材。 * 斯坦福大學的[自然語言處理深度學習 (cs224-n)](https://web.stanford.edu/class/cs224n/) 由 Richard Socher 和 Christopher Manning 完成。 * Coursera 的[自然語言處理](https://www.coursera.org/learn/language-processing) 由國立研究大學高等經濟學院完成。 * 卡內基梅隆大學的語言技術研究所[自然語言處理的神經網路](http://phontron.com/class/nn4nlp2017/)。 #### 經典自然語言處理 自然語言處理的貝葉斯,統計和語言學方法| | [返回頂部](#內容) * [統計機器翻譯](http://mt-class.org) - 機器翻譯課程,具有很棒的作業和投影片。 * [使用 Python 3 進行 NLTK 自然語言處理](https://www.youtube.com/playlist?list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL) 由 Harrison Kinsley(sentdex) 使用 NLTK 程式碼實現的好教學。 * 由 Jordan Boyd-Graber 在馬里蘭大學的[計算語言學 I](https://www.youtube.com/playlist?list=PLegWUnz91WfuPebLI97-WueAP90JO-15i)講座。 * 由 Yandex 數據學院的[深度自然語言處理課程](https://github.com/yandexdataschool/nlp_course)涵蓋從文本嵌入到機器翻譯的重要思想,包括序列建模,語言模型等。 ### 書籍 * Dan Jurafsy 教授的[語音和語言處理](https://web.stanford.edu/~jurafsky/slp3/) * [R 中的文字探勘](https://www.tidytextmining.com) * [Python 的自然語言處理](https://www.nltk.org/book/) ## 函式庫 [返回頂部](#內容) * **Node.js and Javascript** - 用於自然語言的 Node.js 函式庫 | [返回頂部](#內容) * [Twitter-text](https://github.com/twitter/twitter-text) - 使用 JavaScript 實現的 Twitter 文本處理庫。 * [Knwl.js](https://github.com/benhmoore/Knwl.js) - JS中的自然語言處理器。 * [Retext](https://github.com/retextjs/retext) - 用於分析和操縱自然語言的可​​擴展系統。 * [NLP Compromise](https://github.com/spencermountain/compromise) - 瀏覽器中的自然語言處理。 * [Natural](https://github.com/NaturalNode/natural) - 節點的一般自然語言設施。 - [Poplar](https://github.com/synyi/poplar) - 一種基於 Web 的自然語言處理註釋工具(NLP)。 * **Python** - 用於自然語言的 Python 函式庫 | [返回頂部](#內容) * [TextBlob](http://textblob.readthedocs.org/) - 為專研常見的自然語言處理(NLP)任務提供一致的 API。 站在[自然語言工具包 (NLTK)](https://www.nltk.org/) 和 [模式](https://github.com/clips/pattern)膀上,並與兩者很好地配合 :+1: * [spaCy](https://github.com/explosion/spaCy) - 使用 Python 與 Cython 產業強度的自然語言處理 :+1: * [textacy](https://github.com/chartbeat-labs/textacy) - 在spaCy上構建的更高級別的自然與儼處理。 * [gensim](https://radimrehurek.com/gensim/index.html) - 用於從純文本進行無監督語義建模的 函式庫 :+1: * [scattertext](https://github.com/JasonKessler/scattertext) - 用於生成語料庫之間語言差異的 d3 可視化的 Python 函式庫。 * [AllenNLP](https://github.com/allenai/allennlp) - 一個架構在 PyTorch 上的自然語言處理函式庫,用於開發各種語言任務最先進的深度學習模型。 * [PyTorch-NLP](https://github.com/PetrochukM/PyTorch-NLP) - 自然語言處理研究工具包設計來支援快速建立更好的數據加載器,詞向量加載器,神經網路層表示,常見的自然語言處理指標(如BLEU)原型。 * [Rosetta](https://github.com/columbia-applied-data-science/rosetta) - 文本處理工具和包裝 (例如: Vowpal Wabbit) * [PyNLPl](https://github.com/proycon/pynlpl) - Python 自然語言處理函式庫. 適用於 Python 的通用自然語言處理函式庫。 還包含一些用於解析常見自然語言處理格式的特定模塊, 最常見的是用於 [FoLiA](https://proycon.github.io/folia/),還包括 ARPA 語言模型,Moses 短語表,GIZA ++對齊。 * [jPTDP](https://github.com/datquocnguyen/jPTDP) - 用於聯合詞性(POS)標記和依賴性解析的工具包。jPTDP 提供40多種語言的預訓練模型。 * [BigARTM](https://github.com/bigartm/bigartm) - 一個用於主題建模的快速函式庫。 * [Snips NLU](https://github.com/snipsco/snips-nlu) - 用於意圖解析的產品就緒函式庫。 * [Chazutsu](https://github.com/chakki-works/chazutsu) - 用於下載和解析標準自然語言處理研究數據集的函式庫。 * [Word Forms](https://github.com/gutfeeling/word_forms) - Word forms 可以準確生成所有可能的英語單詞形式。 * [Multilingual Latent Dirichlet Allocation (LDA)](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA) - 一種多語言和可擴展的文檔聚類管道。 * [NLP Architect](https://github.com/NervanaSystems/nlp-architect) - 用於探索 NLP 和 NLU 最先進的深度學習拓撲和技術的函式庫。 * [Flair](https://github.com/zalandoresearch/flair) - 一個非常簡單的框架,用於在 PyTorch 上構建最先進的多語言 NLP。包括 BERT,ELMo 和 Flair 嵌入。 * [Kashgari](https://github.com/BrikerMan/Kashgari) - 簡單的,基於 Keras 的多語言自然語言處理框架,允許您在5分鐘內構建模型,用於命名實體識別(NER),詞性標註(PoS)和文本分類任務。 包括 BERT 和 word2vec 嵌入。 * **C++** - C++ 函式庫 | [返回頂部](#內容) * [MIT 資訊提取工具包 ](https://github.com/mit-nlp/MITIE) - 用於命名實體識別和關係提取的 C,C++ 和Python 工具。 * [CRF++](https://taku910.github.io/crfpp/) - 條件隨機場(CRF)的開源專案,用於實現分割/標記順序數據和其他自然語言處理任務。 * [CRFsuite](http://www.chokkan.org/software/crfsuite/) - CRFsuite 實現用於標記順序數據的條件隨機字段(CRF)。 * [BLLIP Parser](https://github.com/BLLIP/bllip-parser) - BLLIP 自然語言解析器(也稱為 Charniak-Johnson 解析器) * [colibri-core](https://github.com/proycon/colibri-core) - C++ 函式庫,命令行工具和 Python 綁定用於快速且內存有效的方式提取和使用基本語言結構,如 n-gram 和 skipgrams。 * [ucto](https://github.com/LanguageMachines/ucto) - 適用於各種語言的基於 Unicode 的常規表達式標記生成器。工具和 C++函式庫。支持 FoLiA 格式。 * [libfolia](https://github.com/LanguageMachines/libfolia) - 用於 [FoLiA 格式](https://proycon.github.io/folia/)的 C++ 函式庫。 * [frog](https://github.com/LanguageMachines/frog) - 為荷蘭語開發的基於內存的自然語言處理套件:PoS 標記器,lemmatiser,依賴解析器,NER,淺層解析器,形態分析器。 * [MeTA](https://github.com/meta-toolkit/meta) - [MeTA : ModErn Text Analysis](https://meta-toolkit.org/) 是一個 C++ 數據科學工具包,可以幫助挖掘大文本數據。 * [Mecab (日文)](https://taku910.github.io/mecab/) * [Moses](http://statmt.org/moses/) * [StarSpace](https://github.com/facebookresearch/StarSpace) - 一個來自 Facebook 的函式庫用於創建單詞級,段級,文檔級和文本分類的嵌入 * **Java** - Java 自然語言處理函式庫 | [返回頂部](#內容) * [斯坦福大學 NLP](https://nlp.stanford.edu/software/index.shtml) * [OpenNLP](https://opennlp.apache.org/) * [NLP4J](https://emorynlp.github.io/nlp4j/) * [Java 中的 Word2vec](https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec) * [ReVerb](https://github.com/knowitall/reverb/) Web-Scale 開放信息提取。 * [OpenRegex](https://github.com/knowitall/openregex) 一種高效靈活的基於 token 的正則表達式語言和引擎。 * [CogcompNLP](https://github.com/CogComp/cogcomp-nlp) - 在伊利諾伊大學的認知計算組開發的核心函式庫。 * [MALLET](http://mallet.cs.umass.edu/) - 用於 LanguagE Toolkit 的機器學習 - 用於統計自然語言處理,文檔分類,聚類,主題建模,資訊提取和其他機器學習應用程序的文本包。 * [RDRPOSTagger](https://github.com/datquocnguyen/RDRPOSTagger) - 一個穩健的 POS 標記工具包(包括 Java 和 Python)以及40多種語言的預訓練模型。 * **Kotlin** - Kotlin 自然語言處理函式庫 | [返回頂部](#內容) * [Lingua](https://github.com/pemistahl/lingua/) 適用於 Kotlin 和 Java 的語言檢測函式庫,適用於長文本和短文本。 * [Kotidgy](https://github.com/meiblorn/kotidgy) — 一種用 Kotlin 編寫基於索引的文本數據生成器。 * **Scala** - Scala 自然語言處理函式庫 | [返回頂部](#內容) * [Saul](https://github.com/CogComp/saul) - 用於開發自然語言處理系統的函式庫,包括內置模塊,如 SRL,POS 等。 * [ATR4S](https://github.com/ispras/atr4s) - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 * [tm](https://github.com/ispras/tm) - 基於正則化多語言 [PLSA](https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis) 的主題建模實現。 * [word2vec-scala](https://github.com/Refefer/word2vec-scala) - word2vec 模型的 Scala 接口; 包括對詞距離和詞類比等向量的操作。 * [Epic](https://github.com/dlwh/epic) - Epic 是一個用 Scala 編寫的高性能統計解析器,以及用於構建複雜結構化預測模型的框架。 * **R** - R 自然語言處理函式庫 | [返回頂部](#內容) * [text2vec](https://github.com/dselivanov/text2vec) - R 中的快速矢量化,主題建模,距離和 GloVe 字嵌入。 * [wordVectors](https://github.com/bmschmidt/wordVectors) - 用於創建和探索 word2vec 和其他單詞嵌入模型的 R 包。 * [RMallet](https://github.com/mimno/RMallet) - 與 Java 機器學習工具 MALLET 接口的 R 包。 * [dfr-browser](https://github.com/agoldst/dfr-browser) - 創建用於在 Web 瀏覽器中瀏覽文本主題模型的 d3 可視化。 * [dfrtopics](https://github.com/agoldst/dfrtopics) - 用於探索文本主題模型的 R 包。 * [sentiment_classifier](https://github.com/kevincobain2000/sentiment_classifier) - 使用Word Sense Disambiguation 和 WordNet Reader 的情感分類。 * [jProcessing](https://github.com/kevincobain2000/jProcessing) - 日本自然語言處理庫,具有日語情感分類。 * **Clojure** | [返回頂部](#內容) * [Clojure-openNLP](https://github.com/dakrone/clojure-opennlp) - Clojure 中的自然語言處理(opennlp)。 * [Infections-clj](https://github.com/r0man/inflections-clj) - 用於 Clojure 和 ClojureScript 的類似 Rails 的變形函式庫。 * [postagga](https://github.com/fekr/postagga) - 用於解析 Clojure 和 ClojureScript 中的自然語言的函式庫。 * **Ruby** | [返回頂部](#內容) * Kevin Dias 的 [自然語言處理(NLP)Ruby 函式庫,工具和軟件的集合](https://github.com/diasks2/ruby-nlp) * [Ruby 中實用的自然語言處理](https://github.com/arbox/nlp-with-ruby) * **Rust** | [返回頂部](#內容) * [whatlang](https://github.com/greyblake/whatlang-rs) — 基於三元組的自然語言識別函式庫。 - [snips-nlu-rs](https://github.com/snipsco/snips-nlu-rs) - 用於意圖解析的生產就緒等級函示庫。 ### 服務 自然語言處理作為具有更高級功能的 API,例如 NER,主題標記等 | [返回頂部](#內容) - [Wit-ai](https://github.com/wit-ai/wit) - 應用程序和設備的自然語言界面。 - [IBM Watson 的自然語意理解](https://github.com/watson-developer-cloud/natural-language-understanding-nodejs) - API 和 Github 演示。 - [Amazon 理解](https://aws.amazon.com/comprehend/) - NLP 和 ML 套件涵蓋了最常見的任務,如 NER,標記和情感分析。 - [Google 雲端自然語言 API](https://cloud.google.com/natural-language/) - 至少9種語言的語法分析,NER,情感分析和內容標記包括英語和中文(簡體和繁體)。 - [ParallelDots](https://www.paralleldots.com/text-analysis-apis) - 高層次文本分析 API 服務,從情感分析到意圖分析。 - [Microsoft 認知服務](https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/) - [TextRazor](https://www.textrazor.com/) - [Rosette](https://www.rosette.com/) - [Textalytic](https://www.textalytic.com) - 瀏覽器中的自然語言處理,包括情感分析,命名實體提取,POS標記,詞頻,主題建模,文字雲等。 ### 註釋工具 - [GATE](https://gate.ac.uk/overview.html) - 通用架構和文本工程已有15年歷史,免費開源。 - [Anafora](https://github.com/weitechen/anafora) 是免費的開源,基於 Web 的原始文本註釋工具。 - [brat](https://brat.nlplab.org/) - brat 快速註解工具是一個用於協作文本註釋的在線環境。 - [tagtog](https://www.tagtog.net/), 需花 $。 - [prodigy](https://prodi.gy/) 是一個由主動學習驅動的註釋工具,需花 $。 - [LightTag](https://lighttag.io) - 為團隊提供託管和管理的文本註釋工具,需花 $。 ## 技術 ### 文本嵌入 [返回頂部](#內容) 文本嵌入允許深度學習在較小的數據集上有效。這些通常是深入學習的第一步輸入和自然語言處理中最流行的遷移學習方式。嵌入只是簡單的向量,比實際值的字符串表示更為通用的方式。Word嵌入被認為是大多數深度NLP任務的一個很好的起點。 單詞嵌入中最流行的名字是 Google(Mikolov)的 word2vec 和史丹佛的 PenVe(Pennington,Socher 和Manning)。fastText 似乎是一種非常流行的多語言子詞嵌入。 #### 詞嵌入 [返回頂部](#內容) |嵌入 |論文| 組織| gensim - 培訓支援 |部落格| |---|---|---|---|---| |word2vec|[官方實作](https://code.google.com/archive/p/word2vec/), T.Mikolove et al. 2013. 分散式詞語表達及其組合性。[pdf](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) |Google|是 :heavy_check_mark:| colah 在[深度學習,自然語言處理和陳述](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)中的視覺會解釋; gensim 的[理解 word2vec](https://rare-technologies.com/making-sense-of-word2vec) | |GloVe|Jeffrey Pennington, Richard Socher 與 Christopher D. Manning. 2014. GloVe: 全局向量的字詞表示 [pdf](https://nlp.stanford.edu/pubs/glove.pdf)|史丹佛|否 :negative_squared_cross_mark:|acoyler 的 [GloVe 早報](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/) | |fastText|[官方實作](https://github.com/facebookresearch/fastText), T. Mikolov et al. 2017. 使用子詞資訊豐富單詞向量。 [pdf](https://arxiv.org/abs/1607.04606)|Facebook|是 :heavy_check_mark:|[Fasttext: 深入解析](https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3)| 給初學者的筆記: - 經驗法則: **fastText >> GloVe > word2vec** - 你可以找到許多語言[預訓練 fasttext 向量](https://fasttext.cc/docs/en/pretrained-vectors.html)。 - 如果你對 word2vec 和 GloVe 背後的邏輯和直覺感興趣: [詞向量的驚人力量](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)並很好地介紹這些主題。 - [arXiv: 高效文本分類的錦囊妙方](https://arxiv.org/abs/1607.01759), 與 [arXiv: FastText.zip: 壓縮文本分類模型](https://arxiv.org/abs/1612.03651) 作為 fasttext 的一部分發布。 #### 基於句子和語言模型的詞嵌入 [返回頂部](#內容) - _ElMo_ 從[深度情境詞表示](https://arxiv.org/abs/1802.05365) - [PyTorch 實作](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) - [TF 實作](https://github.com/allenai/bilm-tf) - _ULimFit_ Jeremy Howard 與 Sebastian Ruder 的[通用語言模型進行文本分類微調](https://arxiv.org/abs/1801.06146) - _InferSent_ facebook 的 [自然語言推論資料的通用語句表示監督是學習](https://arxiv.org/abs/1705.02364) - _CoVe_ from [在翻譯中學習: 情境詞相量](https://arxiv.org/abs/1708.00107) - _來自[文件與句子的分散式表達](https://cs.stanford.edu/~quocle/paragraph_vector.pdf). 參閱 [gensim 的 doc2vec 教學](https://rare-technologies.com/doc2vec-tutorial/) - [sense2vec](https://arxiv.org/abs/1511.06388) - 關於詞義消歧。 - [跳過思考象量](https://arxiv.org/abs/1506.06726) - 單詞表示方法。 - [自適應 skip-gram](https://arxiv.org/abs/1502.07257) - 類似的方法,具有自適應屬性。 - [序列到序列學習](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) - 機器翻譯的詞向量。 ### 回答問題與知識提取 [返回頂部](#內容) - Facebook 透過維基百科 [DrQA: 打開領域為題解答](https://github.com/facebookresearch/DrQA) - DocQA: AllenAI 的[簡單而有效的多段閱讀理解](https://github.com/allenai/document-qa) - [用於自然語言問答的馬爾可夫邏輯網絡](https://arxiv.org/pdf/1507.03045v1.pdf) - [基於模板的資訊提取沒有用到模板](https://www.usna.edu/Users/cs/nchamber/pubs/acl2011-chambers-templates.pdf) - [矩陣分解與通用模式的關係提取](https://www.anthology.aclweb.org/N/N13/N13-1008.pdf) - [Privee:自動分析Web隱私策略的體系結構](https://www.sebastianzimmeck.de/zimmeckAndBellovin2014Privee.pdf) - [教學機器閱讀和理解](https://arxiv.org/abs/1506.03340) - DeepMind paper - [走向形式分佈語義:用張量模擬邏輯演算](https://www.aclweb.org/anthology/S13-1001) - [MLN 教學的演示投影片](https://github.com/clulab/nlp-reading-group/blob/master/fall-2015-resources/mln-summary-20150918.ppt) - [MLNs 的 QA 應用演示投影片](https://github.com/clulab/nlp-reading-group/blob/master/fall-2015-resources/Markov%20Logic%20Networks%20for%20Natural%20Language%20Question%20Answering.pdf) - [演示投影片](https://github.com/clulab/nlp-reading-group/blob/master/fall-2015-resources/poon-paper.pdf) ## 資料集 [返回頂部](#內容) - [nlp-datasets](https://github.com/niderhoff/nlp-datasets) 很好的自然語言資料集集合 ## 多語言自然語言處理框架 [返回頂部](#內容) - [UDPipe](https://github.com/ufal/udpipe) 是一個可訓練的管道,用於標記,標記,解釋和解析通用樹庫和其他 CoNLL-U 文件。主要用 C++ 編寫,為多語言NLP處理提供快速可靠的解決方案。 - [NLP-Cube](https://github.com/adobe/NLP-Cube) : 自然語言處理流水線 - 句子分裂,標記化,詞形還原,詞性標註和依賴性分析。用 Dynet 2.0 用 Python 編寫的新平台。提供獨立(CLI / Python 綁定)和服務器功能(REST API)。 ## 自然語言處理-韓文 [返回頂部](#內容) ### 函式庫 - [KoNLPy](http://konlpy.org) - 用於韓語自然語言處理的Python包。 - [Mecab (Korean)](https://eunjeon.blogspot.com/) - 韓文的自然語言處理 C++ 函式庫 - [KoalaNLP](https://koalanlp.github.io/koalanlp/) - 韓國自然語言處理的 Scala 函式庫。 - [KoNLP](https://cran.r-project.org/package=KoNLP) - 韓文的自然語言處理 R 包。 ### 部落格與教學 - [dsindex 的部落格](https://dsindex.github.io/) - [韓國江原大學的自然語言處理課程](http://cs.kangwon.ac.kr/~leeck/NLP/) ### 資料集 - [KAIST 語料庫](http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus) - 韓國高等科學技術研究所的語料庫。 - [韓國 Naver 情感電影語料庫](https://github.com/e9t/nsmc/) - [朝鮮日報檔案館](http://srchdb1.chosun.com/pdf/i_archive/) - 來自韓國主要報紙之一的朝鮮日報的韓文數據集。 ## 自然語言處理-阿拉伯語 [返回頂部](#內容) ### 函式庫 - [goarabic](https://github.com/01walid/goarabic) - Go包用於阿拉伯語文本處理。 - [jsastem](https://github.com/ejtaal/jsastem) - 用於阿拉伯詞幹的Javascript。 - [PyArabic](https://pypi.org/project/PyArabic/) - 阿拉伯語的 Python 函式庫。 ### 資料集 - [多域數據集](https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces) - 阿拉伯語情感分析的最大可用多域資源。 - [LABR](https://github.com/mohamedadaly/labr) - LArge阿拉伯書籍評論數據集。 - [Arabic 停用詞](https://github.com/mohataher/arabic-stop-words) - 來自各種資源的阿拉伯語停用詞列表。 ## 自然語言處理-中文 [返回頂部](#內容) ### 函式庫 - [jieba](https://github.com/fxsjy/jieba#jieba-1) - 中文詞彙分割實用程序的 Python 包。 - [SnowNLP](https://github.com/isnowfy/snownlp) - 中文自然語言處理 Python 包。 - [FudanNLP](https://github.com/FudanNLP/fnlp) - 用於中文文本處理的 Java 函式庫。 ## 自然語言處理-德文 [返回頂部](#內容) - [德文-自然語言處理](https://github.com/adbar/German-NLP) - 開發的開放式訪問/開源/現成資源和工具列表,特別關注德語。 ## 自然語言處理-西班牙語 [返回頂部](#內容) ### 資料 - [哥倫比亞政治演說](https://github.com/dav009/LatinamericanTextResources) - [哥本哈根樹庫](https://mbkromann.github.io/copenhagen-dependency-treebank/) - [西班牙語十億字語料庫與 Word2Vec 嵌入](https://github.com/crscardellino/sbwce) ## 自然語言處理-印度語 [返回頂部](#內容) ### 印地語 ### 資料, 文集與樹庫 - [印地語依賴樹庫](https://ltrc.iiit.ac.in/treebank_H2014/) - 印地語和烏爾都語的多代表性多層樹庫。 - [在印地語的普遍依賴性樹庫](https://universaldependencies.org/treebanks/hi_hdtb/index.html) - [並行通用依賴樹庫印地語](http://universaldependencies.org/treebanks/hi_pud/index.html) - 上述樹庫的一小部分。 ## 自然語言處理-泰語 [返回頂部](#內容) ### 函式庫 - [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) - Python 包中的泰語自然語言處理。 - [JTCC](https://github.com/wittawatj/jtcc) - Java 中的字符集群庫。 - [CutKum](https://github.com/pucktada/cutkum) - 在 TensorFlow 中使用深度學習進行分詞。 - [泰語工具包](https://pypi.python.org/pypi/tltk/) - 基於 Wirote Aroonmanakun 於2002年撰寫的一篇論文,其中包括數據集。 - [SynThai](https://github.com/KenjiroAI/SynThai) - 在 Python 中使用深度學習進行分詞和 POS 標記。 ### 資料 - [Inter-BEST](https://www.nectec.or.th/corpus/index.php?league=pm) - 具有500萬個單詞分詞的文本語料庫。 - [Prime Minister 29](https://github.com/PyThaiNLP/lexicon-thai/tree/master/thai-corpus/Prime%20Minister%2029) - 數據集包含現任泰國總理的演講。 ## 自然語言處理-丹麥語 [返回頂部](#內容) - [丹麥的命名實體識別](https://github.com/ITUnlp/daner) ## 自然語言處理-越南語 [返回頂部](#內容) ### 函式庫 - [underthesea](https://github.com/undertheseanlp/underthesea) - 越南自然語言處理工具包。 - [vn.vitk](https://github.com/phuonglh/vn.vitk) - 越南文本處理工具包。 - [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) - 越南自然語言處理工具包。 ### 資料 - [越南樹庫](https://vlsp.hpda.vn/demo/?page=resources&lang=en) - 選區解析任務的10,000個句子。 - [BKTreeBank](https://arxiv.org/pdf/1710.05519.pdf) - 越南依賴樹庫。 - [UD_Vietnamese](https://github.com/UniversalDependencies/UD_Vietnamese-VTB) - 越南通用依賴樹庫。 - [VIVOS](https://ailab.hcmus.edu.vn/vivos/) - 一個免費的越南語言語料庫,由 AILab 的15小時錄音講話組成。 - [VNTQcorpus(big).txt](http://viet.jnlp.org/download-du-lieu-tu-vung-corpus) - 新聞中的175萬句話。 ## 自然語言處理-印度尼西亞 [返回頂部](#內容) ### 資料集 - [ILPS](http://ilps.science.uva.nl/resources/bahasa/) 的Kompas 和 Tempo 系列。 - [用於PoS標記的PANL10N](http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip): 39K句子和900K字標記。 - [用於PoS標記的IDN](https://github.com/famrashel/idn-tagged-corpus): 該語料庫包含10K個句子和250K個單詞標記。 - [印度尼西亞樹庫](https://github.com/famrashel/idn-treebank)和 [普遍依賴 - 印度尼西亞語](https://github.com/UniversalDependencies/UD_Indonesian-GSD) - [IndoSum](https://github.com/kata-ai/indosum) 用於文本摘要和分類。 - [Wordnet-Bahasa](http://wn-msa.sourceforge.net/) - 大型,免費的語義詞典。 ### 函式庫與嵌入 - 自然語言工具包 [bahasa](https://github.com/kangfend/bahasa) - [印尼語嵌入](https://github.com/galuhsahid/indonesian-word-embedding) - 預訓練的訓練 [印尼 fastText 文本嵌入](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.id.zip) 的維基百科。 ## 其他語言 [返回頂部](#內容) - 俄語: [pymorphy2](https://github.com/kmike/pymorphy2) - - 俄語好的詞性標記。 - 亞洲語言: ElasticSearch 中的泰語,老撾語,中文,日語和韓語 [ICU Tokenizer](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html) 實現。 - 古代語言: [CLTK](https://github.com/cltk/cltk): 古典語言工具包是一個 Python 函式庫和用於在古代語言中進行自然語言處理的文本集合。 - Dutch: [python-frog](https://github.com/proycon/python-frog) - Python 綁定到 Frog,一個荷蘭語的自然語言處理套件。(pos 標記,詞形還原,依賴解析,NER - 希伯來語: [NLPH_Resources](https://github.com/NLPH/NLPH_Resources) - 希伯來語自然語言處理的論文,語料庫和語言資源的集合。 ## 貢獻 初始策展人和來源的[貢獻](./CREDITS.md)。 ================================================ FILE: README.md ================================================ # awesome-nlp [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) A curated list of resources dedicated to Natural Language Processing ![Awesome NLP Logo](/images/logo.jpg) Read this in [English](./README.md), [Traditional Chinese](./README-ZH-TW.md) _Please read the [contribution guidelines](contributing.md) before contributing. Please add your favourite NLP resource by raising a [pull request](https://github.com/keonkim/awesome-nlp/pulls)_ ## Contents * [Research Summaries and Trends](#research-summaries-and-trends) * [Prominent NLP Research Labs](#prominent-nlp-research-labs) * [Tutorials](#tutorials) * [Reading Content](#reading-content) * [Videos and Courses](#videos-and-online-courses) * [Books](#books) * [Libraries](#libraries) * [Node.js](#node-js) * [Python](#python) * [C++](#c++) * [Java](#java) * [Kotlin](#kotlin) * [Scala](#scala) * [R](#R) * [Clojure](#clojure) * [Ruby](#ruby) * [Rust](#rust) * [NLP++](#NLP++) * [Julia](#julia) * [Services](#services) * [Annotation Tools](#annotation-tools) * [Datasets](#datasets) * [NLP in Korean](#nlp-in-korean) * [NLP in Arabic](#nlp-in-arabic) * [NLP in Chinese](#nlp-in-chinese) * [NLP in German](#nlp-in-german) * [NLP in Polish](#nlp-in-polish) * [NLP in Spanish](#nlp-in-spanish) * [NLP in Indic Languages](#nlp-in-indic-languages) * [NLP in Thai](#nlp-in-thai) * [NLP in Danish](#nlp-in-danish) * [NLP in Vietnamese](#nlp-in-vietnamese) * [NLP for Dutch](#nlp-for-dutch) * [NLP in Indonesian](#nlp-in-indonesian) * [NLP in Urdu](#nlp-in-urdu) * [NLP in Persian](#nlp-in-persian) * [NLP in Ukrainian](#nlp-in-ukrainian) * [NLP in Hungarian](#nlp-in-hungarian) * [NLP in Portuguese](#nlp-in-portuguese) * [Other Languages](#other-languages) * [Citation](#citation) * [Credits](#credits) ## Research Summaries and Trends * [NLP-Overview](https://nlpoverview.com/) is an up-to-date overview of deep learning techniques applied to NLP, including theory, implementations, applications, and state-of-the-art results. This is a great Deep NLP Introduction for researchers. * [NLP-Progress](https://nlpprogress.com/) tracks the progress in Natural Language Processing, including the datasets and the current state-of-the-art for the most common NLP tasks * [NLP's ImageNet moment has arrived](https://thegradient.pub/nlp-imagenet/) * [ACL 2018 Highlights: Understanding Representation and Evaluation in More Challenging Settings](http://ruder.io/acl-2018-highlights/) * [Four deep learning trends from ACL 2017. Part One: Linguistic Structure and Word Embeddings](https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-1.html) * [Four deep learning trends from ACL 2017. Part Two: Interpretability and Attention](https://www.abigailsee.com/2017/08/30/four-deep-learning-trends-from-acl-2017-part-2.html) * [Highlights of EMNLP 2017: Exciting Datasets, Return of the Clusters, and More!](http://blog.aylien.com/highlights-emnlp-2017-exciting-datasets-return-clusters/) * [Deep Learning for Natural Language Processing (NLP): Advancements & Trends](https://tryolabs.com/blog/2017/12/12/deep-learning-for-nlp-advancements-and-trends-in-2017/?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI) * [Survey of the State of the Art in Natural Language Generation](https://arxiv.org/abs/1703.09902) ## Prominent NLP Research Labs [Back to Top](#contents) * [The Berkeley NLP Group](http://nlp.cs.berkeley.edu/index.shtml) - Notable contributions include a tool to reconstruct long dead languages, referenced [here](https://www.bbc.com/news/science-environment-21427896) and by taking corpora from 637 languages currently spoken in Asia and the Pacific and recreating their descendant. * [Language Technologies Institute, Carnegie Mellon University](http://www.cs.cmu.edu/~nasmith/nlp-cl.html) - Notable projects include [Avenue Project](http://www.cs.cmu.edu/~avenue/), a syntax driven machine translation system for endangered languages like Quechua and Aymara and previously, [Noah's Ark](http://www.cs.cmu.edu/~ark/) which created [AQMAR](http://www.cs.cmu.edu/~ark/AQMAR/) to improve NLP tools for Arabic. * [NLP research group, Columbia University](http://www1.cs.columbia.edu/nlp/index.cgi) - Responsible for creating BOLT ( interactive error handling for speech translation systems) and an un-named project to characterize laughter in dialogue. * [The Center or Language and Speech Processing, John Hopkins University](http://clsp.jhu.edu/) - Recently in the news for developing speech recognition software to create a diagnostic test or Parkinson's Disease, [here](https://www.clsp.jhu.edu/2019/03/27/speech-recognition-software-and-machine-learning-tools-are-being-used-to-create-diagnostic-test-for-parkinsons-disease/#.XNFqrIkzYdU). * [Computational Linguistics and Information Processing Group, University of Maryland](https://wiki.umiacs.umd.edu/clip/index.php/Main_Page) - Notable contributions include [Human-Computer Cooperation or Word-by-Word Question Answering](http://www.umiacs.umd.edu/~jbg/projects/IIS-1652666) and modeling development of phonetic representations. * [Penn Natural Language Processing, University of Pennsylvania](https://nlp.cis.upenn.edu/)- Famous for creating the [Penn Treebank](https://www.seas.upenn.edu/~pdtb/). * [The Stanford Nautral Language Processing Group](https://nlp.stanford.edu/)- One of the top NLP research labs in the world, notable for creating [Stanford CoreNLP](https://nlp.stanford.edu/software/corenlp.shtml) and their [coreference resolution system](https://nlp.stanford.edu/software/dcoref.shtml) ## Tutorials [Back to Top](#contents) ### Reading Content General Machine Learning * [Machine Learning 101](https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/edit?usp=sharing) from Google's Senior Creative Engineer explains Machine Learning for engineer's and executives alike * [AI Playbook](https://aiplaybook.a16z.com/) - a16z AI playbook is a great link to forward to your managers or content for your presentations * [Ruder's Blog](http://ruder.io/#open) by [Sebastian Ruder](https://twitter.com/seb_ruder) for commentary on the best of NLP Research * [How To Label Data](https://www.lighttag.io/how-to-label-data/) guide to managing larger linguistic annotation projects * [Depends on the Definition](https://www.depends-on-the-definition.com/) collection of blog posts covering a wide array of NLP topics with detailed implementation Introductions and Guides to NLP * [Understand & Implement Natural Language Processing](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/) * [NLP in Python](http://github.com/NirantK/nlp-python-deep-learning) - Collection of Github notebooks * [Natural Language Processing: An Introduction](https://academic.oup.com/jamia/article/18/5/544/829676) - Oxford * [Deep Learning for NLP with Pytorch](https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html) * [Hands-On NLTK Tutorial](https://github.com/hb20007/hands-on-nltk-tutorial) - NLTK Tutorials, Jupyter notebooks * [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](https://www.nltk.org/book/) - An online and print book introducing NLP concepts using NLTK. The book's authors also wrote the NLTK library. * [Train a new language model from scratch](https://huggingface.co/blog/how-to-train) - Hugging Face 🤗 * [The Super Duper NLP Repo (SDNLPR)](https://notebooks.quantumstat.com/): Collection of Colab notebooks covering a wide array of NLP task implementations. * [Advanced NLP with spaCy](https://course.spacy.io/en/) - Free online course covering text processing, large-scale data analysis, processing pipelines, and training neural network models for custom NLP tasks. * [Kaggle NLP Learning Guide](https://www.kaggle.com/learn-guide/natural-language-processing) - Beginner-friendly tutorials including getting started guides, deep learning for NLP, and visual explanations of techniques like BERT, GloVe, and TF-IDF. Blogs and Newsletters * [Deep Learning, NLP, and Representations](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) * [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](https://jalammar.github.io/illustrated-bert/) and [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) * [Natural Language Processing](https://nlpers.blogspot.com/) by Hal Daumé III * [arXiv: Natural Language Processing (Almost) from Scratch](https://arxiv.org/pdf/1103.0398.pdf) * [Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness) * [Machine Learning Mastery: Deep Learning for Natural Language Processing](https://machinelearningmastery.com/category/natural-language-processing) * [Visual NLP Paper Summaries](https://amitness.com/categories/#nlp) ### Videos and Online Courses [Back to Top](#contents) * [Advanced Natural Language Processing](https://people.cs.umass.edu/~miyyer/cs685_f20/) - CS 685, UMass Amherst CS * [Deep Natural Language Processing](https://github.com/oxford-cs-deepnlp-2017/lectures) - Lectures series from Oxford * [Deep Learning for Natural Language Processing (cs224-n)](https://web.stanford.edu/class/cs224n/) - Richard Socher and Christopher Manning's Stanford Course * [Neural Networks for NLP](http://phontron.com/class/nn4nlp2017/) - Carnegie Mellon Language Technology Institute there * [Deep NLP Course](https://github.com/yandexdataschool/nlp_course) by Yandex Data School, covering important ideas from text embedding to machine translation including sequence modeling, language models and so on. * [fast.ai Code-First Intro to Natural Language Processing](https://www.fast.ai/2019/07/08/fastai-nlp/) - This covers a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, GRUs, and the Transformer), as well as addressing urgent ethical issues, such as bias and disinformation. Find the Jupyter Notebooks [here](https://github.com/fastai/course-nlp) * [Machine Learning University - Accelerated Natural Language Processing](https://www.youtube.com/playlist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw) - Lectures go from introduction to NLP and text processing to Recurrent Neural Networks and Transformers. Material can be found [here](https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp). * [Applied Natural Language Processing](https://www.youtube.com/playlist?list=PLH-xYrxjfO2WyR3pOAB006CYMhNt4wTqp)- Lecture series from IIT Madras taking from the basics all the way to autoencoders and everything. The github notebooks for this course are also available [here](https://github.com/Ramaseshanr/anlp) * [DeepLearning.AI Natural Language Processing Specialization](https://www.deeplearning.ai/courses/natural-language-processing-specialization/) - 4-course program covering sentiment analysis, word embeddings, RNNs, LSTMs, attention mechanisms, and Transformer models like BERT and T5 for tasks including machine translation and summarization. ### Books * [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) - free, by Prof. Dan Jurafsy * [Natural Language Processing](https://github.com/jacobeisenstein/gt-nlp-class) - free, NLP notes by Dr. Jacob Eisenstein at GeorgiaTech * [NLP with PyTorch](https://github.com/joosthub/PyTorchNLPBook) - Brian & Delip Rao * [Text Mining in R](https://www.tidytextmining.com) * [Natural Language Processing with Python](https://www.nltk.org/book/) * [Practical Natural Language Processing](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/) * [Natural Language Processing with Spark NLP](https://www.oreilly.com/library/view/natural-language-processing/9781492047759/) * [Deep Learning for Natural Language Processing](https://www.manning.com/books/deep-learning-for-natural-language-processing) by Stephan Raaijmakers * [Real-World Natural Language Processing](https://www.manning.com/books/real-world-natural-language-processing) - by Masato Hagiwara * [Natural Language Processing in Action, Second Edition](https://www.manning.com/books/natural-language-processing-in-action-second-edition) - by Hobson Lane and Maria Dyshel * [Transformers in Action](https://www.manning.com/books/transformers-in-action) - by Nicole Koenigstein * [The Math Behind Artificial Intelligence](https://www.freecodecamp.org/news/the-math-behind-artificial-intelligence-book) - bt Tiago MOnteiro | A free FreeCodeCamp book teaching the math behind AI in plain English from an engineering point of view. It covers linear algebra, calculus, probability & statistics, and optimization theory with analogies, real-life applications, and Python code examples. ## Libraries [Back to Top](#contents) * **Node.js and Javascript** - Node.js Libaries for NLP | [Back to Top](#contents) * [Twitter-text](https://github.com/twitter/twitter-text) - A JavaScript implementation of Twitter's text processing library * [Knwl.js](https://github.com/benhmoore/Knwl.js) - A Natural Language Processor in JS * [Retext](https://github.com/retextjs/retext) - Extensible system for analyzing and manipulating natural language * [NLP Compromise](https://github.com/spencermountain/compromise) - Natural Language processing in the browser * [Natural](https://github.com/NaturalNode/natural) - general natural language facilities for node * [Poplar](https://github.com/synyi/poplar) - A web-based annotation tool for natural language processing (NLP) * [NLP.js](https://github.com/axa-group/nlp.js) - An NLP library for building bots * [node-question-answering](https://github.com/huggingface/node-question-answering) - Fast and production-ready question answering w/ DistilBERT in Node.js * **Python** - Python NLP Libraries | [Back to Top](#contents) - [sentimental-onix](https://github.com/sloev/sentimental-onix) Sentiment models for spacy using onnx - [TextAttack](https://github.com/QData/TextAttack) - Adversarial attacks, adversarial training, and data augmentation in NLP - [TextBlob](http://textblob.readthedocs.org/) - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of [Natural Language Toolkit (NLTK)](https://www.nltk.org/) and [Pattern](https://github.com/clips/pattern), and plays nicely with both :+1: - [spaCy](https://github.com/explosion/spaCy) - Industrial strength NLP with Python and Cython :+1: - [Speedster](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) - Automatically apply SOTA optimization techniques to achieve the maximum inference speed-up on your hardware - [textacy](https://github.com/chartbeat-labs/textacy) - Higher level NLP built on spaCy - [gensim](https://radimrehurek.com/gensim/index.html) - Python library to conduct unsupervised semantic modelling from plain text :+1: - [scattertext](https://github.com/JasonKessler/scattertext) - Python library to produce d3 visualizations of how language differs between corpora - [GluonNLP](https://github.com/dmlc/gluon-nlp) - A deep learning toolkit for NLP, built on MXNet/Gluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks. - [AllenNLP](https://github.com/allenai/allennlp) - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. - [PyTorch-NLP](https://github.com/PetrochukM/PyTorch-NLP) - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU - [Rosetta](https://github.com/columbia-applied-data-science/rosetta) - Text processing tools and wrappers (e.g. Vowpal Wabbit) - [PyNLPl](https://github.com/proycon/pynlpl) - Python Natural Language Processing Library. General purpose NLP library for Python, handles some specific formats like ARPA language models, Moses phrasetables, GIZA++ alignments. - [foliapy](https://github.com/proycon/foliapy) - Python library for working with [FoLiA](https://proycon.github.io/folia/), an XML format for linguistic annotation. - [PySS3](https://github.com/sergioburdisso/pyss3) - Python package that implements a novel white-box machine learning model for text classification, called SS3. Since SS3 has the ability to visually explain its rationale, this package also comes with easy-to-use interactive visualizations tools ([online demos](http://tworld.io/ss3/)). - [jPTDP](https://github.com/datquocnguyen/jPTDP) - A toolkit for joint part-of-speech (POS) tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages. - [BigARTM](https://github.com/bigartm/bigartm) - a fast library for topic modelling - [Snips NLU](https://github.com/snipsco/snips-nlu) - A production ready library for intent parsing - [Chazutsu](https://github.com/chakki-works/chazutsu) - A library for downloading&parsing standard NLP research datasets - [Word Forms](https://github.com/gutfeeling/word_forms) - Word forms can accurately generate all possible forms of an English word - [Multilingual Latent Dirichlet Allocation (LDA)](https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA) - A multilingual and extensible document clustering pipeline - [Natural Language Toolkit (NLTK)](https://www.nltk.org/) - A library containing a wide variety of NLP functionality, supporting over 50 corpora. - [NLP Architect](https://github.com/NervanaSystems/nlp-architect) - A library for exploring the state-of-the-art deep learning topologies and techniques for NLP and NLU - [Flair](https://github.com/zalandoresearch/flair) - A very simple framework for state-of-the-art multilingual NLP built on PyTorch. Includes BERT, ELMo and Flair embeddings. - [Kashgari](https://github.com/BrikerMan/Kashgari) - Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding. - [FARM](https://github.com/deepset-ai/FARM) - Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering. - [Haystack](https://github.com/deepset-ai/haystack) - End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more! - [PraisonAI](https://github.com/MervinPraison/PraisonAI) - Multi-AI Agents framework with 100+ LLM support via LiteLLM, MCP integration, agentic workflows, and built-in memory for NLP tasks. - [Rita DSL](https://github.com/zaibacu/rita-dsl) - a DSL, loosely based on [RUTA on Apache UIMA](https://uima.apache.org/ruta.html). Allows to define language patterns (rule-based NLP) which are then translated into [spaCy](https://spacy.io/), or if you prefer less features and lightweight - regex patterns. - [Transformers](https://github.com/huggingface/transformers) - Natural Language Processing for TensorFlow 2.0 and PyTorch. - [Tokenizers](https://github.com/huggingface/tokenizers) - Tokenizers optimized for Research and Production. - [fairSeq](https://github.com/pytorch/fairseq) Facebook AI Research implementations of SOTA seq2seq models in Pytorch. - [corex_topic](https://github.com/gregversteeg/corex_topic) - Hierarchical Topic Modeling with Minimal Domain Knowledge - [Sockeye](https://github.com/awslabs/sockeye) - Neural Machine Translation (NMT) toolkit that powers Amazon Translate. - [DL Translate](https://github.com/xhlulu/dl-translate) - A deep learning-based translation library for 50 languages, built on `transformers` and Facebook's mBART Large. - [Jury](https://github.com/obss/jury) - Evaluation of NLP model outputs offering various automated metrics. - [python-ucto](https://github.com/proycon/python-ucto) - Unicode-aware regular-expression based tokenizer for various languages. Python binding to C++ library, supports [FoLiA format](https://proycon.github.io/folia). - [Pearmut](https://github.com/zouharvi/pearmut) - Human annotation tool for multilingual NLP tasks, such as machine translation. - **C++** - C++ Libraries | [Back to Top](#contents) - [InsNet](https://github.com/chncwang/InsNet) - A neural network library for building instance-dependent NLP models with padding-free dynamic batching. - [MIT Information Extraction Toolkit](https://github.com/mit-nlp/MITIE) - C, C++, and Python tools for named entity recognition and relation extraction - [CRF++](https://taku910.github.io/crfpp/) - Open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data & other Natural Language Processing tasks. - [CRFsuite](http://www.chokkan.org/software/crfsuite/) - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. - [BLLIP Parser](https://github.com/BLLIP/bllip-parser) - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser) - [colibri-core](https://github.com/proycon/colibri-core) - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way. - [ucto](https://github.com/LanguageMachines/ucto) - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format. - [libfolia](https://github.com/LanguageMachines/libfolia) - C++ library for the [FoLiA format](https://proycon.github.io/folia/) - [frog](https://github.com/LanguageMachines/frog) - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer. - [MeTA](https://github.com/meta-toolkit/meta) - [MeTA : ModErn Text Analysis](https://meta-toolkit.org/) is a C++ Data Sciences Toolkit that facilitates mining big text data. - [Mecab (Japanese)](https://taku910.github.io/mecab/) - [Moses](http://statmt.org/moses/) - [StarSpace](https://github.com/facebookresearch/StarSpace) - a library from Facebook for creating embeddings of word-level, paragraph-level, document-level and for text classification - [QSMM](http://qsmm.org) - adaptive probabilistic top-down and bottom-up parsers - **Java** - Java NLP Libraries | [Back to Top](#contents) - [Stanford NLP](https://nlp.stanford.edu/software/index.shtml) - [OpenNLP](https://opennlp.apache.org/) - [NLP4J](https://emorynlp.github.io/nlp4j/) - [Word2vec in Java](https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec) - [ReVerb](https://github.com/knowitall/reverb/) Web-Scale Open Information Extraction - [OpenRegex](https://github.com/knowitall/openregex) An efficient and flexible token-based regular expression language and engine. - [CogcompNLP](https://github.com/CogComp/cogcomp-nlp) - Core libraries developed in the U of Illinois' Cognitive Computation Group. - [MALLET](http://mallet.cs.umass.edu/) - MAchine Learning for LanguagE Toolkit - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. - [RDRPOSTagger](https://github.com/datquocnguyen/RDRPOSTagger) - A robust POS tagging toolkit available (in both Java & Python) together with pre-trained models for 40+ languages. - **Kotlin** - Kotlin NLP Libraries | [Back to Top](#contents) - [Lingua](https://github.com/pemistahl/lingua/) A language detection library for Kotlin and Java, suitable for long and short text alike - [Kotidgy](https://github.com/meiblorn/kotidgy) — an index-based text data generator written in Kotlin - **Scala** - Scala NLP Libraries | [Back to Top](#contents) - [Saul](https://github.com/CogComp/saul) - Library for developing NLP systems, including built in modules like SRL, POS, etc. - [ATR4S](https://github.com/ispras/atr4s) - Toolkit with state-of-the-art [automatic term recognition](https://en.wikipedia.org/wiki/Terminology_extraction) methods. - [tm](https://github.com/ispras/tm) - Implementation of topic modeling based on regularized multilingual [PLSA](https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis). - [word2vec-scala](https://github.com/Refefer/word2vec-scala) - Scala interface to word2vec model; includes operations on vectors like word-distance and word-analogy. - [Epic](https://github.com/dlwh/epic) - Epic is a high performance statistical parser written in Scala, along with a framework for building complex structured prediction models. - [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) - Spark NLP is a natural language processing library built on top of Apache Spark ML that provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. - **R** - R NLP Libraries | [Back to Top](#contents) - [text2vec](https://github.com/dselivanov/text2vec) - Fast vectorization, topic modeling, distances and GloVe word embeddings in R. - [wordVectors](https://github.com/bmschmidt/wordVectors) - An R package for creating and exploring word2vec and other word embedding models - [RMallet](https://github.com/mimno/RMallet) - R package to interface with the Java machine learning tool MALLET - [dfr-browser](https://github.com/agoldst/dfr-browser) - Creates d3 visualizations for browsing topic models of text in a web browser. - [dfrtopics](https://github.com/agoldst/dfrtopics) - R package for exploring topic models of text. - [sentiment_classifier](https://github.com/kevincobain2000/sentiment_classifier) - Sentiment Classification using Word Sense Disambiguation and WordNet Reader - [jProcessing](https://github.com/kevincobain2000/jProcessing) - Japanese Natural Langauge Processing Libraries, with Japanese sentiment classification - [corporaexplorer](https://kgjerde.github.io/corporaexplorer/) - An R package for dynamic exploration of text collections - [tidytext](https://github.com/juliasilge/tidytext) - Text mining using tidy tools - [spacyr](https://github.com/quanteda/spacyr) - R wrapper to spaCy NLP - [CRAN Task View: Natural Language Processing](https://github.com/cran-task-views/NaturalLanguageProcessing/) - **Clojure** | [Back to Top](#contents) - [Clojure-openNLP](https://github.com/dakrone/clojure-opennlp) - Natural Language Processing in Clojure (opennlp) - [Infections-clj](https://github.com/r0man/inflections-clj) - Rails-like inflection library for Clojure and ClojureScript - [postagga](https://github.com/fekr/postagga) - A library to parse natural language in Clojure and ClojureScript - **Ruby** | [Back to Top](#contents) - Kevin Dias's [A collection of Natural Language Processing (NLP) Ruby libraries, tools and software](https://github.com/diasks2/ruby-nlp) - [Practical Natural Language Processing done in Ruby](https://github.com/arbox/nlp-with-ruby) - **Rust** | [Back to Top](#contents) - [adk-rust](https://github.com/zavora-ai/adk-rust) - Production-ready AI agent development kit with model-agnostic design (Gemini, OpenAI, Anthropic), multiple agent types, and MCP support - [whatlang](https://github.com/greyblake/whatlang-rs) — Natural language recognition library based on trigrams - [snips-nlu-rs](https://github.com/snipsco/snips-nlu-rs) - A production ready library for intent parsing - [rust-bert](https://github.com/guillaume-be/rust-bert) - Ready-to-use NLP pipelines and Transformer-based models - **NLP++** - NLP++ Language | [Back to Top](#contents) - [VSCode Language Extension](https://marketplace.visualstudio.com/items?itemName=dehilster.nlp) - NLP++ Language Extension for VSCode - [nlp-engine](https://github.com/VisualText/nlp-engine) - NLP++ engine to run NLP++ code on Linux including a full English parser - [VisualText](http://visualtext.org) - Homepage for the NLP++ Language - [NLP++ Wiki](http://wiki.naturalphilosophy.org/index.php?title=NLP%2B%2B) - Wiki entry for the NLP++ language - **Julia** | [Back to Top](#contents) - [CorpusLoaders](https://github.com/JuliaText/CorpusLoaders.jl) - A variety of loaders for various NLP corpora - [Languages](https://github.com/JuliaText/Languages.jl) - A package for working with human languages - [TextAnalysis](https://github.com/JuliaText/TextAnalysis.jl) - Julia package for text analysis - [TextModels](https://github.com/JuliaText/TextModels.jl) - Neural Network based models for Natural Language Processing - [WordTokenizers](https://github.com/JuliaText/WordTokenizers.jl) - High performance tokenizers for natural language processing and other related tasks - [Word2Vec](https://github.com/JuliaText/Word2Vec.jl) - Julia interface to word2vec ### Services NLP as API with higher level functionality such as NER, Topic tagging and so on | [Back to Top](#contents) - [Vedika API](https://vedika.io) - AI-powered Vedic astrology API with multi-agent swarm intelligence - [Wit-ai](https://github.com/wit-ai/wit) - Natural Language Interface for apps and devices - [IBM Watson's Natural Language Understanding](https://github.com/watson-developer-cloud/natural-language-understanding-nodejs) - API and Github demo - [Amazon Comprehend](https://aws.amazon.com/comprehend/) - NLP and ML suite covers most common tasks like NER, tagging, and sentiment analysis - [Google Cloud Natural Language API](https://cloud.google.com/natural-language/) - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Chinese (Simplified and Traditional). - [ParallelDots](https://www.paralleldots.com/text-analysis-apis) - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis - [Microsoft Cognitive Service](https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/) - [TextRazor](https://www.textrazor.com/) - [Rosette](https://www.rosette.com/) - [Textalytic](https://www.textalytic.com) - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more - [NLP Cloud](https://nlpcloud.io) - SpaCy NLP models (custom and pre-trained ones) served through a RESTful API for named entity recognition (NER), POS tagging, and more. - [Cloudmersive](https://cloudmersive.com/nlp-api) - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing ### Annotation Tools - [GATE](https://gate.ac.uk/overview.html) - General Architecture and Text Engineering is 15+ years old, free and open source - [Anafora](https://github.com/weitechen/anafora) is free and open source, web-based raw text annotation tool - [brat](https://brat.nlplab.org/) - brat rapid annotation tool is an online environment for collaborative text annotation - [doccano](https://github.com/chakki-works/doccano) - doccano is free, open-source, and provides annotation features for text classification, sequence labeling and sequence to sequence - [INCEpTION](https://inception-project.github.io) - A semantic annotation platform offering intelligent assistance and knowledge management - [tagtog](https://www.tagtog.net/), team-first web tool to find, create, maintain, and share datasets - costs $ - [prodigy](https://prodi.gy/) is an annotation tool powered by active learning, costs $ - [LightTag](https://lighttag.io) - Hosted and managed text annotation tool for teams, costs $ - [rstWeb](https://corpling.uis.georgetown.edu/rstweb/info/) - open source local or online tool for discourse tree annotations - [GitDox](https://corpling.uis.georgetown.edu/gitdox/) - open source server annotation tool with GitHub version control and validation for XML data and collaborative spreadsheet grids - [Label Studio](https://www.heartex.ai/) - Hosted and managed text annotation tool for teams, freemium based, costs $ - [Datasaur](https://datasaur.ai/) support various NLP tasks for individual or teams, freemium based - [Konfuzio](https://konfuzio.com/en/) - team-first hosted and on-prem text, image and PDF annotation tool powered by active learning, freemium based, costs $ - [UBIAI](https://ubiai.tools/) - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling, costs $ - [Shoonya](https://github.com/AI4Bharat/Shoonya-Backend) - Shoonya is free and open source data annotation platform with wide varities of organization and workspace level management system. Shoonya is data agnostic, can be used by teams to annotate data with various level of verification stages at scale. - [Annotation Lab](https://www.johnsnowlabs.com/annotation-lab/) - Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents. Not FOSS. - [FLAT](https://github.com/proycon/flat) - FLAT is a web-based linguistic annotation environment based around the [FoLiA format](http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Free and open source. ## Techniques ### Text Embeddings #### Word Embeddings - Thumb Rule: **fastText >> GloVe > word2vec** - [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) - [implementation](https://code.google.com/archive/p/word2vec/) - [explainer blog](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) - [glove](https://nlp.stanford.edu/pubs/glove.pdf) - [explainer blog](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/) - fasttext - [implementation](https://github.com/facebookresearch/fastText) - [paper](https://arxiv.org/abs/1607.04606) - [explainer blog](https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3) #### Sentence and Language Model Based Word Embeddings [Back to Top](#contents) - ElMo - [Deep Contextualized Word Representations](https://arxiv.org/abs/1802.05365) - [PyTorch implmentation](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) - [TF Implementation](https://github.com/allenai/bilm-tf) - ULMFiT - [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) by Jeremy Howard and Sebastian Ruder - InferSent - [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) by facebook - CoVe - [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) - Pargraph vectors - from [Distributed Representations of Sentences and Documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf). See [doc2vec tutorial at gensim](https://rare-technologies.com/doc2vec-tutorial/) - [sense2vec](https://arxiv.org/abs/1511.06388) - on word sense disambiguation - [Skip Thought Vectors](https://arxiv.org/abs/1506.06726) - word representation method - [Adaptive skip-gram](https://arxiv.org/abs/1502.07257) - similar approach, with adaptive properties - [Sequence to Sequence Learning](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) - word vectors for machine translation ### Question Answering and Knowledge Extraction [Back to Top](#contents) - [DrQA](https://github.com/facebookresearch/DrQA) - Open Domain Question Answering work by Facebook Research on Wikipedia data - [Document-QA](https://github.com/allenai/document-qa) - Simple and Effective Multi-Paragraph Reading Comprehension by AllenAI - [Template-Based Information Extraction without the Templates](https://www.usna.edu/Users/cs/nchamber/pubs/acl2011-chambers-templates.pdf) - [Privee: An Architecture for Automatically Analyzing Web Privacy Policies](https://www.sebastianzimmeck.de/zimmeckAndBellovin2014Privee.pdf) ## Datasets [Back to Top](#contents) - [nlp-datasets](https://github.com/niderhoff/nlp-datasets) great collection of nlp datasets - [gensim-data](https://github.com/RaRe-Technologies/gensim-data) - Data repository for pretrained NLP models and NLP corpora. - [tiny_qa_benchmark_pp](https://github.com/vincentkoc/tiny_qa_benchmark_pp/) - Repository of tiny NLP multi-lingual QA datasets and library to generate your own synthetic copies. ## Multilingual NLP Frameworks [Back to Top](#contents) - [UDPipe](https://github.com/ufal/udpipe) is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing. - [NLP-Cube](https://github.com/adobe/NLP-Cube) : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API). - [UralicNLP](https://github.com/mikahama/uralicNLP) is an NLP library mostly for many endangered Uralic languages such as Sami languages, Mordvin languages, Mari languages, Komi languages and so on. Also some non-endangered languages are supported such as Finnish together with non-Uralic languages such as Swedish and Arabic. UralicNLP can do morphological analysis, generation, lemmatization and disambiguation. ## NLP in Korean [Back to Top](#contents) ### Libraries - [KoNLPy](http://konlpy.org) - Python package for Korean natural language processing. - [Mecab (Korean)](https://eunjeon.blogspot.com/) - C++ library for Korean NLP - [KoalaNLP](https://koalanlp.github.io/koalanlp/) - Scala library for Korean Natural Language Processing. - [KoNLP](https://cran.r-project.org/package=KoNLP) - R package for Korean Natural language processing ### Blogs and Tutorials - [dsindex's blog](https://dsindex.github.io/) - [Kangwon University's NLP course in Korean](http://cs.kangwon.ac.kr/~leeck/NLP/) ### Datasets - [KAIST Corpus](http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus) - A corpus from the Korea Advanced Institute of Science and Technology in Korean. - [Naver Sentiment Movie Corpus in Korean](https://github.com/e9t/nsmc/) - [Chosun Ilbo archive](http://srchdb1.chosun.com/pdf/i_archive/) - dataset in Korean from one of the major newspapers in South Korea, the Chosun Ilbo. - [Chat data](https://github.com/songys/Chatbot_data) - Chatbot data in Korean - [Petitions](https://github.com/akngs/petitions) - Collect expired petition data from the Blue House National Petition Site. - [Korean Parallel corpora](https://github.com/j-min/korean-parallel-corpora) - Neural Machine Translation(NMT) Dataset for **Korean to French** & **Korean to English** - [KorQuAD](https://korquad.github.io/) - Korean SQuAD dataset with Wiki HTML source. Mentions both v1.0 and v2.1 at the time of adding to Awesome NLP ## NLP in Arabic [Back to Top](#contents) ### Libraries - [goarabic](https://github.com/01walid/goarabic) - Go package for Arabic text processing - [jsastem](https://github.com/ejtaal/jsastem) - Javascript for Arabic stemming - [PyArabic](https://pypi.org/project/PyArabic/) - Python libraries for Arabic - [RFTokenizer](https://github.com/amir-zeldes/RFTokenizer) - trainable Python segmenter for Arabic, Hebrew and Coptic ### Datasets - [Multidomain Datasets](https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces) - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis - [LABR](https://github.com/mohamedadaly/labr) - LArge Arabic Book Reviews dataset - [Arabic Stopwords](https://github.com/mohataher/arabic-stop-words) - A list of Arabic stopwords from various resources ## NLP in Chinese [Back to Top](#contents) ### Libraries - [jieba](https://github.com/fxsjy/jieba#jieba-1) - Python package for Words Segmentation Utilities in Chinese - [SnowNLP](https://github.com/isnowfy/snownlp) - Python package for Chinese NLP - [FudanNLP](https://github.com/FudanNLP/fnlp) - Java library for Chinese text processing - [HanLP](https://github.com/hankcs/HanLP) - The multilingual NLP library ### Anthology - [funNLP](https://github.com/fighting41love/funNLP) - Collection of NLP tools and resources mainly for Chinese ## NLP in German - [German-NLP](https://github.com/adbar/German-NLP) - Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German ## NLP in Polish - [Polish-NLP](https://github.com/ksopyla/awesome-nlp-polish) - A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets. ## NLP in Spanish [Back to Top](#contents) ### Libraries - [spanlp](https://github.com/jfreddypuentes/spanlp) - Python library to detect, censor and clean profanity, vulgarities, hateful words, racism, xenophobia and bullying in texts written in Spanish. It contains data of 21 Spanish-speaking countries. ### Data - [Columbian Political Speeches](https://github.com/dav009/LatinamericanTextResources) - [Copenhagen Treebank](https://mbkromann.github.io/copenhagen-dependency-treebank/) - [Spanish Billion words corpus with Word2Vec embeddings](https://github.com/crscardellino/sbwce) - [Compilation of Spanish Unannotated Corpora](https://github.com/josecannete/spanish-unannotated-corpora) ### Word and Sentence Embeddings - [Spanish Word Embeddings Computed with Different Methods and from Different Corpora](https://github.com/dccuchile/spanish-word-embeddings) - [Spanish Word Embeddings Computed from Large Corpora and Different Sizes Using fastText](https://github.com/BotCenter/spanishWordEmbeddings) - [Spanish Sentence Embeddings Computed from Large Corpora Using sent2vec](https://github.com/BotCenter/spanishSent2Vec) - [Beto - BERT for Spanish](https://github.com/dccuchile/beto) ## NLP in Indic languages [Back to Top](#contents) ### Data, Corpora and Treebanks - [Hindi Dependency Treebank](https://ltrc.iiit.ac.in/treebank_H2014/) - A multi-representational multi-layered treebank for Hindi and Urdu - [Universal Dependencies Treebank in Hindi](https://universaldependencies.org/treebanks/hi_hdtb/index.html) - [Parallel Universal Dependencies Treebank in Hindi](http://universaldependencies.org/treebanks/hi_pud/index.html) - A smaller part of the above-mentioned treebank. - [ISI FIRE Stopwords List (Hindi and Bangla)](https://www.isical.ac.in/~fire/data/) - [Peter Graham's Stopwords List](https://github.com/6/stopwords-json) - [NLTK Corpus](https://www.nltk.org/book/ch02.html) 60k Words POS Tagged, Bangla, Hindi, Marathi, Telugu - [Hindi Movie Reviews Dataset](https://github.com/goru001/nlp-for-hindi) ~1k Samples, 3 polarity classes - [BBC News Hindi Dataset](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1) 4.3k Samples, 14 classes - [IIT Patna Hindi ABSA Dataset](https://github.com/pnisarg/ABSA) 5.4k Samples, 12 Domains, 4k aspect terms, aspect and sentence level polarity in 4 classes - [Bangla ABSA](https://github.com/AtikRahman/Bangla_Datasets_ABSA) 5.5k Samples, 2 Domains, 10 aspect terms - [IIT Patna Movie Review Sentiment Dataset](https://www.iitp.ac.in/~ai-nlp-ml/resources.html) 2k Samples, 3 polarity labels #### Corpora/Datasets that need a login/access can be gained via email - [SAIL 2015](http://amitavadas.com/SAIL/) Twitter and Facebook labelled sentiment samples in Hindi, Bengali, Tamil, Telugu. - [IIT Bombay NLP Resources](http://www.cfilt.iitb.ac.in/Sentiment_Analysis_Resources.html) Sentiwordnet, Movie and Tourism parallel labelled corpora, polarity labelled sense annotated corpus, Marathi polarity labelled corpus. - [TDIL-IC aggregates a lot of useful resources and provides access to otherwise gated datasets](https://tdil-dc.in/index.php?option=com_catalogue&task=viewTools&id=83&lang=en) ### Language Models and Word Embeddings - [Hindi2Vec](https://nirantk.com/hindi2vec/) and [nlp-for-hindi](https://github.com/goru001/nlp-for-hindi) ULMFIT style languge model - [IIT Patna Bilingual Word Embeddings Hi-En](https://www.iitp.ac.in/~ai-nlp-ml/resources.html) - [Fasttext word embeddings in a whole bunch of languages, trained on Common Crawl](https://fasttext.cc/docs/en/crawl-vectors.html) - [Hindi and Bengali Word2Vec](https://github.com/Kyubyong/wordvectors) - [Hindi and Urdu Elmo Model](https://github.com/HIT-SCIR/ELMoForManyLangs) - [Sanskrit Albert](https://huggingface.co/surajp/albert-base-sanskrit) Trained on Sanskrit Wikipedia and OSCAR corpus ### Libraries and Tooling - [Multi-Task Deep Morphological Analyzer](https://github.com/Saurav0074/mt-dma) Deep Network based Morphological Parser for Hindi and Urdu - [Anoop Kunchukuttan](https://github.com/anoopkunchukuttan/indic_nlp_library) 18 Languages, whole host of features from tokenization to translation - [SivaReddy's Dependency Parser](http://sivareddy.in/downloads) Dependency Parser and Pos Tagger for Kannada, Hindi and Telugu. [Python3 Port](https://github.com/CalmDownKarm/sivareddydependencyparser) - [iNLTK](https://github.com/goru001/inltk) - A Natural Language Toolkit for Indic Languages (Indian subcontinent languages) built on top of Pytorch/Fastai, which aims to provide out of the box support for common NLP tasks. ## NLP in Thai [Back to Top](#contents) ### Libraries - [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) - Thai NLP in Python Package - [JTCC](https://github.com/wittawatj/jtcc) - A character cluster library in Java - [CutKum](https://github.com/pucktada/cutkum) - Word segmentation with deep learning in TensorFlow - [Thai Language Toolkit](https://pypi.python.org/pypi/tltk/) - Based on a paper by Wirote Aroonmanakun in 2002 with included dataset - [SynThai](https://github.com/KenjiroAI/SynThai) - Word segmentation and POS tagging using deep learning in Python ### Data - [Inter-BEST](https://www.nectec.or.th/corpus/index.php?league=pm) - A text corpus with 5 million words with word segmentation - [Prime Minister 29](https://github.com/PyThaiNLP/lexicon-thai/tree/master/thai-corpus/Prime%20Minister%2029) - Dataset containing speeches of the current Prime Minister of Thailand ## NLP in Danish - [Named Entity Recognition for Danish](https://github.com/ITUnlp/daner) - [DaNLP](https://github.com/alexandrainst/danlp) - NLP resources in Danish - [Awesome Danish](https://github.com/fnielsen/awesome-danish) - A curated list of awesome resources for Danish language technology ## NLP in Vietnamese ### Libraries - [underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit - [vn.vitk](https://github.com/phuonglh/vn.vitk) - A Vietnamese Text Processing Toolkit - [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) - A Vietnamese natural language processing toolkit - [PhoBERT](https://github.com/VinAIResearch/PhoBERT) - Pre-trained language models for Vietnamese - [pyvi](https://github.com/trungtv/pyvi) - Python Vietnamese Core NLP Toolkit - [VieNeu-TTS](https://github.com/pnnbao97/VieNeu-TTS) - An Advanced On-Device Vietnamese Text-to-Speech System With Instant Voice Cloning. ### Data - [Vietnamese treebank](https://vlsp.hpda.vn/demo/?page=resources&lang=en) - 10,000 sentences for the constituency parsing task - [BKTreeBank](https://arxiv.org/pdf/1710.05519.pdf) - a Vietnamese Dependency Treebank - [UD_Vietnamese](https://github.com/UniversalDependencies/UD_Vietnamese-VTB) - Vietnamese Universal Dependency Treebank - [VIVOS](https://ailab.hcmus.edu.vn/vivos/) - a free Vietnamese speech corpus consisting of 15 hours of recording speech by AILab - [VNTQcorpus(big).txt](http://viet.jnlp.org/download-du-lieu-tu-vung-corpus) - 1.75 million sentences in news - [ViText2SQL](https://github.com/VinAIResearch/ViText2SQL) - A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020 Findings) - [EVB Corpus](https://github.com/qhungngo/EVBCorpus) - 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles. ## NLP for Dutch [Back to Top](#contents) - [python-frog](https://github.com/proycon/python-frog) - Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER) - [SimpleNLG_NL](https://github.com/rfdj/SimpleNLG-NL) - Dutch surface realiser used for Natural Language Generation in Dutch, based on the SimpleNLG implementation for English and French. - [Alpino](https://github.com/rug-compling/alpino) - Dependency parser for Dutch (also does PoS tagging and Lemmatisation). - [Kaldi NL](https://github.com/opensource-spraakherkenning-nl/Kaldi_NL) - Dutch Speech Recognition models based on [Kaldi](http://kaldi-asr.org/). - [spaCy](https://spacy.io/) - [Dutch model](https://spacy.io/models/nl) available. - Industrial strength NLP with Python and Cython. ## NLP in Indonesian ### Datasets - Kompas and Tempo collections at [ILPS](http://ilps.science.uva.nl/resources/bahasa/) - [PANL10N for PoS tagging](http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip): 39K sentences and 900K word tokens - [IDN for PoS tagging](https://github.com/famrashel/idn-tagged-corpus): This corpus contains 10K sentences and 250K word tokens - [Indonesian Treebank](https://github.com/famrashel/idn-treebank) and [Universal Dependencies-Indonesian](https://github.com/UniversalDependencies/UD_Indonesian-GSD) - [IndoSum](https://github.com/kata-ai/indosum) for text summarization and classification both - [Wordnet-Bahasa](http://wn-msa.sourceforge.net/) - large, free, semantic dictionary - IndoBenchmark [IndoNLU](https://github.com/indobenchmark/indonlu) includes pre-trained language model (IndoBERT), FastText model, Indo4B corpus, and several NLU benchmark datasets ### Libraries & Embedding - Natural language toolkit [bahasa](https://github.com/kangfend/bahasa) - [Indonesian Word Embedding](https://github.com/galuhsahid/indonesian-word-embedding) - Pretrained [Indonesian fastText Text Embedding](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.id.zip) trained on Wikipedia - IndoBenchmark [IndoNLU](https://github.com/indobenchmark/indonlu) includes pretrained language model (IndoBERT), FastText model, Indo4B corpus, and several NLU benchmark datasets ## NLP in Urdu ### Datasets - [Collection of Urdu datasets](https://github.com/mirfan899/Urdu) for POS, NER and NLP tasks ### Libraries - [Natural Language Processing library](https://github.com/urduhack/urduhack) for ( 🇵🇰)Urdu language ## NLP in Persian [Back to Top](#contents) ### Libraries - [Hazm](https://github.com/roshan-research/hazm) - Persian NLP Toolkit. - [Parsivar](https://github.com/ICTRC/Parsivar): A Language Processing Toolkit for Persian - [Perke](https://github.com/AlirezaTheH/perke): Perke is a Python keyphrase extraction package for Persian language. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. - [Perstem](https://github.com/jonsafari/perstem): Persian stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger - [ParsiAnalyzer](https://github.com/NarimanN2/ParsiAnalyzer): Persian Analyzer For Elasticsearch - [virastar](https://github.com/aziz/virastar): Cleaning up Persian text! ### Datasets - [Bijankhan Corpus](https://dbrg.ut.ac.ir/بیژن%E2%80%8Cخان/): Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags. - [Uppsala Persian Corpus (UPC)](https://sites.google.com/site/mojganserajicom/home/upc): Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in [this table](https://sites.google.com/site/mojganserajicom/home/upc/Table_tag.pdf). - [Large-Scale Colloquial Persian](http://hdl.handle.net/11234/1-3195): Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at [LSCP webpage](https://iasbs.ac.ir/~ansari/lscp/). - [ArmanPersoNERCorpus](https://github.com/HaniehP/PersianNER): The dataset includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format. - [FarsiYar PersianNER](https://github.com/Text-Mining/Persian-NER): The dataset includes about 25,000,000 tokens and about 1,000,000 Persian sentences in total based on [Persian Wikipedia Corpus](https://github.com/Text-Mining/Persian-Wikipedia-Corpus). The NER tags are in IOB format. More than 1000 volunteers contributed tag improvements to this dataset via web panel or android app. They release updated tags every two weeks. - [PERLEX](http://farsbase.net/PERLEX.html): The first Persian dataset for relation extraction, which is an expert translated version of the “Semeval-2010-Task-8” dataset. Link to the relevant publication. - [Persian Syntactic Dependency Treebank](http://dadegan.ir/catalog/perdt): This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon. - [Uppsala Persian Dependency Treebank (UPDT)](http://stp.lingfil.uu.se/~mojgan/UPDT.html): Dependency-based syntactically annotated corpus. - [Hamshahri](https://dbrg.ut.ac.ir/hamshahri/): Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems. ## NLP in Ukrainian [Back to Top](#contents) - [awesome-ukrainian-nlp](https://github.com/asivokon/awesome-ukrainian-nlp) - a curated list of Ukrainian NLP datasets, models, etc. - [UkrainianLT](https://github.com/Helsinki-NLP/UkrainianLT) - another curated list with a focus on machine translation and speech processing ## NLP in Hungarian [Back to Top](#contents) - [awesome-hungarian-nlp](https://github.com/oroszgy/awesome-hungarian-nlp): A curated list of free resources dedicated to Hungarian Natural Language Processing. ## NLP in Portuguese [Back to Top](#contents) - [Portuguese-nlp](https://github.com/ajdavidl/Portuguese-NLP) - a List of resources and tools developed with focus on Portuguese. ## Other Languages - Russian: [pymorphy2](https://github.com/kmike/pymorphy2) - a good pos-tagger for Russian - Asian Languages: Thai, Lao, Chinese, Japanese, and Korean [ICU Tokenizer](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html) implementation in ElasticSearch - Ancient Languages: [CLTK](https://github.com/cltk/cltk): The Classical Language Toolkit is a Python library and collection of texts for doing NLP in ancient languages - Hebrew: [NLPH_Resources](https://github.com/NLPH/NLPH_Resources) - A collection of papers, corpora and linguistic resources for NLP in Hebrew [Back to Top](#contents) ## Citation If you find this repository useful, please consider citing this list: ```bibtex @misc{awesome-nlp, title = {Awesome NLP}, author = {Kim, Keon and Chelikavada, Krish}, year = {2018}, url = {https://github.com/keon/awesome-nlp}, note = {GitHub repository} } ``` ### Core Contributors and Maintainers - [Krish Chelikavada](https://linkedin.com/in/cskc1) - [Keon Kim](https://linkedin.com/in/keon) [Credits](./CREDITS.md) for initial curators and sources ## License [License](./LICENSE) - CC0 ================================================ FILE: contributing.md ================================================ # Contribution Guidelines ## The pull request should have a useful title Pull requests with `Update readme.md` as title are not informative enough. Write about what additions you've made and *why* do you think it is useful. ## Guidelines - Make an individual pull request for each suggestion - Use [title-casing](http://titlecapitalization.com) (AP style) - Use the following format: `[Relevant Link](link)` - Link additions should be added to the bottom of the relevant category - **New categories or improvements to the existing categorization are welcome** - Check your spelling and grammar - Make sure your text editor is set to remove trailing whitespace - The pull request and commit should have a useful title ## How to Contribute? You'll need a [GitHub account](https://github.com/join)! 1. Access the GitHub page: https://github.com/keon/awesome-nlp 2. Click on the `readme.md` file: ![Step 2 Click on Readme.md](https://cloud.githubusercontent.com/assets/170270/9402920/53a7e3ea-480c-11e5-9d81-aecf64be55eb.png) 3. Now click on the edit icon. ![Step 3 - Click on Edit](https://cloud.githubusercontent.com/assets/170270/9402927/6506af22-480c-11e5-8c18-7ea823530099.png) 4. You can start editing the text of the file in the in-browser editor. Make sure you follow guidelines above. You can use [GitHub Flavored Markdown](https://help.github.com/articles/github-flavored-markdown/). ![Step 4 - Edit the file](https://cloud.githubusercontent.com/assets/170270/9402932/7301c3a0-480c-11e5-81f5-7e343b71674f.png) 5. Say why you're proposing the changes, and then click on "Propose file change". ![Step 5 - Propose Changes](https://cloud.githubusercontent.com/assets/170270/9402937/7dd0652a-480c-11e5-9138-bd14244593d5.png) 6. Submit the [pull request](https://help.github.com/articles/using-pull-requests/)! ## Updating your Pull Request Sometimes, a maintainer of an awesome list will ask you to edit your Pull Request before it is included. This is normally due to spelling errors or because your PR didn't match the awesome-* list guidelines. [Here](https://github.com/RichardLitt/knowledge/blob/master/github/amending-a-commit-guide.md) is a write up on how to change a Pull Request, and the different ways you can do that. **Credits** These contributing guidelines are taken from [awesome's contributing guidelines](https://github.com/sindresorhus/awesome/blob/master/contributing.md)