[
  {
    "path": ".gitignore",
    "content": ".vscode/\n*.cpp\n*.pyd\n*.html\n**/__pycache__/\n*.egg-info\n*.txt*\nbuild\ntmp/\ndata/\nmodels/\n*stats"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018-2019 pkuseg authors\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# pkuseg：一个多领域中文分词工具包 [**(English Version)**](readme/readme_english.md)\r\n\r\npkuseg 是基于论文[[Luo et. al, 2019](#论文引用)]的工具包。其简单易用，支持细分领域分词，有效提升了分词准确度。\r\n\r\n\r\n\r\n## 目录\r\n\r\n* [主要亮点](#主要亮点)\r\n* [编译和安装](#编译和安装)\r\n* [各类分词工具包的性能对比](#各类分词工具包的性能对比)\r\n* [使用方式](#使用方式)\r\n* [论文引用](#论文引用)\r\n* [作者](#作者)\r\n* [常见问题及解答](#常见问题及解答)\r\n\r\n\r\n\r\n## 主要亮点\r\n\r\npkuseg具有如下几个特点：\r\n\r\n1. 多领域分词。不同于以往的通用中文分词工具，此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点，用户可以自由地选择不同的模型。 我们目前支持了新闻领域，网络领域，医药领域，旅游领域，以及混合领域的分词预训练模型。在使用中，如果用户明确待分词的领域，可加载对应的模型进行分词。如果用户无法确定具体领域，推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 [**example.txt**](https://github.com/lancopku/pkuseg-python/blob/master/example.txt)。\r\n2. 更高的分词准确率。相比于其他的分词工具包，当使用相同的训练数据和测试数据，pkuseg可以取得更高的分词准确率。\r\n3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。\r\n4. 支持词性标注。\r\n\r\n\r\n## 编译和安装\r\n\r\n- 目前**仅支持python3**\r\n- **为了获得好的效果和速度，强烈建议大家通过pip install更新到目前的最新版本**\r\n\r\n1. 通过PyPI安装(自带模型文件)：\r\n\t```\r\n\tpip3 install pkuseg\r\n\t之后通过import pkuseg来引用\r\n\t```\r\n   **建议更新到最新版本**以获得更好的开箱体验：\r\n   \t```\r\n\tpip3 install -U pkuseg\r\n\t```\r\n2. 如果PyPI官方源下载速度不理想，建议使用镜像源，比如：   \r\n   初次安装：\r\n\t```\r\n\tpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg\r\n\t```\r\n   更新：\r\n\t```\r\n\tpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg\r\n\t```\r\n   \r\n3. 如果不使用pip安装方式，选择从GitHub下载，可运行以下命令安装：\r\n\t```\r\n\tpython setup.py build_ext -i\r\n\t```\r\n\t\r\n   GitHub的代码并不包括预训练模型，因此需要用户自行下载或训练模型，预训练模型可详见[release](https://github.com/lancopku/pkuseg-python/releases)。使用时需设定\"model_name\"为模型文件。\r\n\r\n注意：**安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本**。如果非以上系统，请使用安装方式3进行本地编译安装。\r\n\t\r\n\r\n## 各类分词工具包的性能对比\r\n\r\n我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较，详细设置可参考[实验环境](readme/environment.md)。\r\n\r\n\r\n\r\n#### 细领域训练及测试结果\r\n\r\n以下是在不同数据集上的对比结果：\r\n\r\n| MSRA   | Precision | Recall |   F-score |\r\n| :----- | --------: | -----: | --------: |\r\n| jieba  |     87.01 |  89.88 |     88.42 |\r\n| THULAC |     95.60 |  95.91 |     95.71 |\r\n| pkuseg |     96.94 |  96.81 | **96.88** |\r\n\r\n\r\n| WEIBO  | Precision | Recall |   F-score |\r\n| :----- | --------: | -----: | --------: |\r\n| jieba  |     87.79 |  87.54 |     87.66 |\r\n| THULAC |     93.40 |  92.40 |     92.87 |\r\n| pkuseg |     93.78 |  94.65 | **94.21** |\r\n\r\n\r\n\r\n\r\n#### 默认模型在不同领域的测试效果\r\n\r\n考虑到很多用户在尝试分词工具的时候，大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能，我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意，这样的比较只是为了说明默认情况下的效果，并不一定是公平的。\r\n\r\n| Default | MSRA  | CTB8  | PKU   | WEIBO | All Average |\r\n| ------- | :---: | :---: | :---: | :---: | :---------: |\r\n| jieba  | 81.45 | 79.58 | 81.83 | 83.56 | 81.61       |\r\n| THULAC |\t85.55 | 87.84 | 92.29 | 86.65 | 88.08 |\r\n| pkuseg | 87.29 | 91.77 | 92.68 | 93.43 | **91.29**   |\r\n\r\n其中，`All Average`显示的是在所有测试集上F-score的平均。\r\n\r\n更多详细比较可参见[和现有工具包的比较](readme/comparison.md)。\r\n\r\n## 使用方式\r\n\r\n#### 代码示例\r\n\r\n以下代码示例适用于python交互式环境。\r\n\r\n代码示例1：使用默认配置进行分词（**如果用户无法确定分词领域，推荐使用默认模型分词**）\r\n```python3\r\nimport pkuseg\r\n\r\nseg = pkuseg.pkuseg()           # 以默认配置加载模型\r\ntext = seg.cut('我爱北京天安门')  # 进行分词\r\nprint(text)\r\n```\r\n\r\n代码示例2：细领域分词（**如果用户明确分词领域，推荐使用细领域模型分词**）\r\n```python3\r\nimport pkuseg\r\n\r\nseg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型\r\ntext = seg.cut('我爱北京天安门')              # 进行分词\r\nprint(text)\r\n```\r\n\r\n代码示例3：分词同时进行词性标注，各词性标签的详细含义可参考 [tags.txt](https://github.com/lancopku/pkuseg-python/blob/master/tags.txt)\r\n```python3\r\nimport pkuseg\r\n\r\nseg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能\r\ntext = seg.cut('我爱北京天安门')    # 进行分词和词性标注\r\nprint(text)\r\n```\r\n\r\n\r\n代码示例4：对文件分词\r\n```python3\r\nimport pkuseg\r\n\r\n# 对input.txt的文件分词输出到output.txt中\r\n# 开20个进程\r\npkuseg.test('input.txt', 'output.txt', nthread=20)     \r\n```\r\n\r\n其他使用示例可参见[详细代码示例](readme/interface.md)。\r\n\r\n\r\n\r\n#### 参数说明\r\n\r\n模型配置\r\n```\r\npkuseg.pkuseg(model_name = \"default\", user_dict = \"default\", postag = False)\r\n\tmodel_name\t\t模型路径。\r\n\t\t\t        \"default\"，默认参数，表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。\r\n\t\t\t\t\"news\", 使用新闻领域模型。\r\n\t\t\t\t\"web\", 使用网络领域模型。\r\n\t\t\t\t\"medicine\", 使用医药领域模型。\r\n\t\t\t\t\"tourism\", 使用旅游领域模型。\r\n\t\t\t        model_path, 从用户指定路径加载模型。\r\n\tuser_dict\t\t设置用户词典。\r\n\t\t\t\t\"default\", 默认参数，使用我们提供的词典。\r\n\t\t\t\tNone, 不使用词典。\r\n\t\t\t\tdict_path, 在使用默认词典的同时会额外使用用户自定义词典，可以填自己的用户词典的路径，词典格式为一行一个词（如果选择进行词性标注并且已知该词的词性，则在该行写下词和词性，中间用tab字符隔开）。\r\n\tpostag\t\t        是否进行词性分析。\r\n\t\t\t\tFalse, 默认参数，只进行分词，不进行词性标注。\r\n\t\t\t\tTrue, 会在分词的同时进行词性标注。\r\n```\r\n\r\n对文件进行分词\r\n```\r\npkuseg.test(readFile, outputFile, model_name = \"default\", user_dict = \"default\", postag = False, nthread = 10)\r\n\treadFile\t\t输入文件路径。\r\n\toutputFile\t\t输出文件路径。\r\n\tmodel_name\t\t模型路径。同pkuseg.pkuseg\r\n\tuser_dict\t\t设置用户词典。同pkuseg.pkuseg\r\n\tpostag\t\t\t设置是否开启词性分析功能。同pkuseg.pkuseg\r\n\tnthread\t\t\t测试时开的进程数。\r\n```\r\n\r\n模型训练\r\n```\r\npkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)\r\n\ttrainFile\t\t训练文件路径。\r\n\ttestFile\t\t测试文件路径。\r\n\tsavedir\t\t\t训练模型的保存路径。\r\n\ttrain_iter\t\t训练轮数。\r\n\tinit_model\t\t初始化模型，默认为None表示使用默认初始化，用户可以填自己想要初始化的模型的路径如init_model='./models/'。\r\n```\r\n\r\n\r\n\r\n#### 多进程分词\r\n\r\n当将以上代码示例置于文件中运行时，如涉及多进程功能，请务必使用`if __name__ == '__main__'`保护全局语句，详见[多进程分词](readme/multiprocess.md)。\r\n\r\n\r\n\r\n## 预训练模型\r\n\r\n从pip安装的用户在使用细领域分词功能时，只需要设置model_name字段为对应的领域即可，会自动下载对应的细领域模型。\r\n\r\n从github下载的用户则需要自己下载对应的预训练模型，并设置model_name字段为预训练模型路径。预训练模型可以在[release](https://github.com/lancopku/pkuseg-python/releases)部分下载。以下是对预训练模型的说明：\r\n\r\n- **news**: 在MSRA（新闻语料）上训练的模型。\r\n\r\n- **web**: 在微博（网络文本语料）上训练的模型。\r\n\r\n- **medicine**: 在医药领域上训练的模型。\r\n\r\n- **tourism**: 在旅游领域上训练的模型。\r\n\r\n- **mixed**: 混合数据集训练的通用模型。随pip包附带的是此模型。\r\n\r\n我们还通过领域自适应的方法，利用维基百科的未标注数据实现了几个细领域预训练模型的自动构建以及通用模型的优化，这些模型目前仅可以在release中下载：\r\n\r\n- **art**: 在艺术与文化领域上训练的模型。\r\n\r\n- **entertainment**: 在娱乐与体育领域上训练的模型。\r\n\r\n- **science**: 在科学领域上训练的模型。\r\n\r\n- **default_v2**: 使用领域自适应方法得到的优化后的通用模型，相较于默认模型规模更大，但泛化性能更好。\r\n\r\n\r\n\r\n欢迎更多用户可以分享自己训练好的细分领域模型。\r\n\r\n\r\n\r\n## 版本历史\r\n\r\n详见[版本历史](readme/history.md)。\r\n\r\n\r\n## 开源协议\r\n1. 本代码采用MIT许可证。\r\n2. 欢迎对该工具包提出任何宝贵意见和建议，请发邮件至jingjingxu@pku.edu.cn。\r\n\r\n\r\n\r\n## 论文引用\r\n\r\n该代码包主要基于以下科研论文，如使用了本工具，请引用以下论文：\r\n* Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, Xu Sun. [PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation](https://arxiv.org/abs/1906.11455). Arxiv. 2019.\r\n\r\n```\r\n\r\n@article{pkuseg,\r\n  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},\r\n  journal = {CoRR},\r\n  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},\r\n  url = {https://arxiv.org/abs/1906.11455},\r\n  volume = {abs/1906.11455},\r\n  year = 2019\r\n}\r\n```\r\n\r\n## 其他相关论文\r\n\r\n* Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.\r\n* Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.\r\n* Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.\r\n\r\n## 常见问题及解答\r\n\r\n\r\n1. [为什么要发布pkuseg？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#1-为什么要发布pkuseg)\r\n2. [pkuseg使用了哪些技术？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#2-pkuseg使用了哪些技术)\r\n3. [无法使用多进程分词和训练功能，提示RuntimeError和BrokenPipeError。](https://github.com/lancopku/pkuseg-python/wiki/FAQ#3-无法使用多进程分词和训练功能提示runtimeerror和brokenpipeerror)\r\n4. [是如何跟其它工具包在细领域数据上进行比较的？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#4-是如何跟其它工具包在细领域数据上进行比较的)\r\n5. [在黑盒测试集上进行比较的话，效果如何？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#5-在黑盒测试集上进行比较的话效果如何)\r\n6. [如果我不了解待分词语料的所属领域呢？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#6-如果我不了解待分词语料的所属领域呢)\r\n7. [如何看待在一些特定样例上的分词结果？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#7-如何看待在一些特定样例上的分词结果)\r\n8. [关于运行速度问题？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#8-关于运行速度问题)\r\n9. [关于多进程速度问题？](https://github.com/lancopku/pkuseg-python/wiki/FAQ#9-关于多进程速度问题)\r\n\r\n\r\n## 致谢\r\n\r\n感谢俞士汶教授（北京大学计算语言所）与邱立坤博士提供的训练数据集！\r\n\r\n## 作者\r\n\r\nRuixuan Luo （罗睿轩）,  Jingjing Xu（许晶晶）, Xuancheng Ren（任宣丞）, Yi Zhang（张艺）, Zhiyuan Zhang（张之远）, Bingzhen Wei（位冰镇）， Xu Sun （孙栩）  \r\n\r\n北京大学 [语言计算与机器学习研究组](http://lanco.pku.edu.cn/)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "pkuseg/__init__.py",
    "content": "from __future__ import print_function\nimport sys\n\nif sys.version_info[0] < 3:\n    print(\"pkuseg does not support python2\", file=sys.stderr)\n    sys.exit(1)\n\nimport os\nimport time\nimport pickle as pkl\nimport multiprocessing\n\nfrom multiprocessing import Process, Queue\n\nimport pkuseg.trainer as trainer\nimport pkuseg.inference as _inf\n\nfrom pkuseg.config import config\nfrom pkuseg.feature_extractor import FeatureExtractor\nfrom pkuseg.model import Model\nfrom pkuseg.download import download_model\nfrom pkuseg.postag import Postag\n\nclass TrieNode:\n    \"\"\"建立词典的Trie树节点\"\"\"\n\n    def __init__(self, isword):\n        self.isword = isword\n        self.usertag = ''\n        self.children = {}\n\n\nclass Preprocesser:\n    \"\"\"预处理器，在用户词典中的词强制分割\"\"\"\n\n    def __init__(self, dict_file):\n        \"\"\"初始化建立Trie树\"\"\"\n        if dict_file is None:\n            dict_file = []\n        self.dict_data = dict_file\n        if isinstance(dict_file, str):\n            with open(dict_file, encoding=\"utf-8\") as f:\n                lines = f.readlines()\n            self.trie = TrieNode(False)\n            for line in lines:\n                fields = line.strip().split('\\t')\n                word = fields[0].strip()\n                usertag = fields[1].strip() if len(fields) > 1 else ''\n                self.insert(word, usertag)\n        else:\n            self.trie = TrieNode(False)\n            for w_t in dict_file:\n                if isinstance(w_t, str):\n                    w = w_t.strip()\n                    t = ''\n                else:\n                    assert isinstance(w_t, tuple)\n                    assert len(w_t)==2\n                    w, t = map(lambda x:x.strip(), w_t)\n                self.insert(w, t)\n\n    def insert(self, word, usertag):\n        \"\"\"Trie树中插入单词\"\"\"\n        l = len(word)\n        now = self.trie\n        for i in range(l):\n            c = word[i]\n            if not c in now.children:\n                now.children[c] = TrieNode(False)\n            now = now.children[c]\n        now.isword = True\n        now.usertag = usertag\n\n    def solve(self, txt):\n        \"\"\"对文本进行预处理\"\"\"\n        outlst = []\n        iswlst = []\n        taglst = []\n        l = len(txt)\n        last = 0\n        i = 0\n        while i < l:\n            now = self.trie\n            j = i\n            found = False\n            usertag = ''\n            last_word_idx = -1 # 表示从当前位置i往后匹配，最长匹配词词尾的idx\n            while True:\n                c = txt[j]\n                if not c in now.children and last_word_idx != -1:\n                    found = True\n                    break\n                if not c in now.children and last_word_idx == -1:\n                    break\n                now = now.children[c]\n                if now.isword:\n                    last_word_idx = j\n                    usertag = now.usertag\n                j += 1\n                if j == l and last_word_idx == -1:\n                    break\n                if j == l and last_word_idx != -1 :\n                    j = last_word_idx + 1\n                    found = True\n                    break\n            if found:\n                if last != i:\n                    outlst.append(txt[last:i])\n                    iswlst.append(False)\n                    taglst.append('')\n                outlst.append(txt[i:j])\n                iswlst.append(True)\n                taglst.append(usertag)\n                last = j\n                i = j\n            else:\n                i += 1\n        if last < l:\n            outlst.append(txt[last:l])\n            iswlst.append(False)\n            taglst.append('')\n        return outlst, iswlst, taglst\n\nclass Postprocesser:\n    \"\"\"对分词结果后处理\"\"\"\n    def __init__(self, common_name, other_names):\n        if common_name is None and other_names is None:\n            self.do_process = False\n            return\n        self.do_process = True\n        if common_name is None:\n            self.common_words = set()\n        else:\n            # with open(common_name, encoding='utf-8') as f:\n            #     lines = f.readlines()\n            # self.common_words = set(map(lambda x:x.strip(), lines))\n            with open(common_name, \"rb\") as f:\n                all_words = pkl.load(f).strip().split(\"\\n\")\n            self.common_words = set(all_words)\n        if other_names is None:\n            self.other_words = set()\n        else:\n            self.other_words = set()\n            for other_name in other_names:\n                # with open(other_name, encoding='utf-8') as f:\n                #     lines = f.readlines()\n                # self.other_words.update(set(map(lambda x:x.strip(), lines)))\n                with open(other_name, \"rb\") as f:\n                    all_words = pkl.load(f).strip().split(\"\\n\")\n                self.other_words.update(set(all_words))\n\n    def post_process(self, sent, check_seperated):\n        for m in reversed(range(2, 8)): \n            end = len(sent)-m\n            if end < 0:\n                continue\n            i = 0\n            while (i < end + 1):\n                merged_words = ''.join(sent[i:i+m])\n                if merged_words in self.common_words:\n                    do_seg = True\n                elif merged_words in self.other_words:\n                    if check_seperated:\n                        seperated = all(((w in self.common_words) \n                            or (w in self.other_words)) for w in sent[i:i+m])\n                    else:\n                        seperated = False\n                    if seperated:\n                        do_seg = False\n                    else:\n                        do_seg = True\n                else:\n                    do_seg = False\n                if do_seg:\n                    for k in range(m):\n                        del sent[i]\n                    sent.insert(i, merged_words)\n                    i += 1\n                    end = len(sent) - m\n                else:\n                    i += 1 \n        return sent\n\n    def __call__(self, sent):\n        if not self.do_process:\n            return sent\n        return self.post_process(sent, check_seperated=True)\n\nclass pkuseg:\n    def __init__(self, model_name=\"default\", user_dict=\"default\", postag=False):\n        \"\"\"初始化函数，加载模型及用户词典\"\"\"\n        # print(\"loading model\")\n        # config = Config()\n        # self.config = config\n        self.postag = postag\n        if model_name in [\"default\"]:\n            config.modelDir = os.path.join(\n                os.path.dirname(os.path.realpath(__file__)),\n                \"models\",\n                model_name,\n            )\n        elif model_name in config.available_models:\n            config.modelDir = os.path.join(\n                config.pkuseg_home,\n                model_name,\n            )\n            download_model(config.model_urls[model_name], config.pkuseg_home, config.model_hash[model_name])\n        else:\n            config.modelDir = model_name\n        # config.fModel = os.path.join(config.modelDir, \"model.txt\")\n        if user_dict is None:\n            file_name = None\n            other_names = None\n        else:\n            if user_dict not in config.available_models:\n                file_name = user_dict\n            else:\n                file_name = None\n            if model_name in config.models_with_dict:\n                other_name = os.path.join(\n                    config.pkuseg_home,\n                    model_name,\n                    model_name+\"_dict.pkl\",\n                )\n                default_name = os.path.join(\n                    os.path.dirname(os.path.realpath(__file__)),\n                    \"dicts\", \"default.pkl\",\n                )\n                other_names = [other_name, default_name]\n            else:\n                default_name = os.path.join(\n                    os.path.dirname(os.path.realpath(__file__)),\n                    \"dicts\", \"default.pkl\",\n                )\n                other_names = [default_name]\n\n        self.preprocesser = Preprocesser(file_name)\n        # self.preprocesser = Preprocesser([])\n        self.postprocesser = Postprocesser(None, other_names)\n\n        self.feature_extractor = FeatureExtractor.load()\n        self.model = Model.load()\n\n        self.idx_to_tag = {\n            idx: tag for tag, idx in self.feature_extractor.tag_to_idx.items()\n        }\n\n        self.n_feature = len(self.feature_extractor.feature_to_idx)\n        self.n_tag = len(self.feature_extractor.tag_to_idx)\n\n        if postag:\n            download_model(config.model_urls[\"postag\"], config.pkuseg_home, config.model_hash[\"postag\"])\n            postag_dir = os.path.join(\n                config.pkuseg_home,\n                \"postag\",\n            )\n            self.tagger = Postag(postag_dir)\n\n        # print(\"finish\")\n\n    def _cut(self, text):\n        \"\"\"\n        直接对文本分词\n        \"\"\"\n\n        examples = list(self.feature_extractor.normalize_text(text))\n        length = len(examples)\n\n        all_feature = []  # type: List[List[int]]\n        for idx in range(length):\n            node_feature_idx = self.feature_extractor.get_node_features_idx(\n                idx, examples\n            )\n            # node_feature = self.feature_extractor.get_node_features(\n            #     idx, examples\n            # )\n\n            # node_feature_idx = []\n            # for feature in node_feature:\n            #     feature_idx = self.feature_extractor.feature_to_idx.get(feature)\n            #     if feature_idx is not None:\n            #         node_feature_idx.append(feature_idx)\n            # if not node_feature_idx:\n            #     node_feature_idx.append(0)\n\n            all_feature.append(node_feature_idx)\n\n        _, tags = _inf.decodeViterbi_fast(all_feature, self.model)\n\n        words = []\n        current_word = None\n        is_start = True\n        for tag, char in zip(tags, text):\n            if is_start:\n                current_word = char\n                is_start = False\n            elif \"B\" in self.idx_to_tag[tag]:\n                words.append(current_word)\n                current_word = char\n            else:\n                current_word += char\n        if current_word:\n            words.append(current_word)\n\n        return words\n\n    def cut(self, txt):\n        \"\"\"分词，结果返回一个list\"\"\"\n\n        txt = txt.strip()\n\n        ret = []\n        usertags = []\n\n        if not txt:\n            return ret\n\n        imary = txt.split()  # 根据空格分为多个片段\n\n        # 对每个片段分词\n        for w0 in imary:\n            if not w0:\n                continue\n\n            # 根据用户词典拆成更多片段\n            lst, isword, taglst = self.preprocesser.solve(w0)\n\n            for w, isw, usertag in zip(lst, isword, taglst):\n                if isw:\n                    ret.append(w)\n                    usertags.append(usertag)\n                    continue\n\n                output = self._cut(w)\n                post_output = self.postprocesser(output)\n                ret.extend(post_output)\n                usertags.extend(['']*len(post_output))\n        \n        if self.postag:\n            tags = self.tagger.tag(ret.copy())\n            for i, usertag in enumerate(usertags):\n                if usertag:\n                    tags[i] = usertag\n            ret = list(zip(ret, tags))\n        return ret\n\n\ndef train(trainFile, testFile, savedir, train_iter=20, init_model=None):\n    \"\"\"用于训练模型\"\"\"\n    # config = Config()\n    starttime = time.time()\n    if not os.path.exists(trainFile):\n        raise Exception(\"trainfile does not exist.\")\n    if not os.path.exists(testFile):\n        raise Exception(\"testfile does not exist.\")\n    if not os.path.exists(config.tempFile):\n        os.makedirs(config.tempFile)\n    if not os.path.exists(config.tempFile + \"/output\"):\n        os.mkdir(config.tempFile + \"/output\")\n    # config.runMode = \"train\"\n    config.trainFile = trainFile\n    config.testFile = testFile\n    config.modelDir = savedir\n    # config.fModel = os.path.join(config.modelDir, \"model.txt\")\n    config.nThread = 1\n    config.ttlIter = train_iter\n    config.init_model = init_model\n\n    os.makedirs(config.modelDir, exist_ok=True)\n\n    trainer.train(config)\n\n    # pkuseg.main.run(config)\n    # clearDir(config.tempFile)\n    print(\"Total time: \" + str(time.time() - starttime))\n\n\ndef _test_single_proc(\n    input_file, output_file, model_name=\"default\", user_dict=\"default\", postag=False, verbose=False\n):\n\n    times = []\n    times.append(time.time())\n    seg = pkuseg(model_name, user_dict, postag=postag)\n\n    times.append(time.time())\n    if not os.path.exists(input_file):\n        raise Exception(\"input_file {} does not exist.\".format(input_file))\n    with open(input_file, \"r\", encoding=\"utf-8\") as f:\n        lines = f.readlines()\n\n    times.append(time.time())\n    results = []\n    for line in lines:\n        if not postag:\n            results.append(\" \".join(seg.cut(line)))\n        else:\n            results.append(\" \".join(map(lambda x:\"/\".join(x), seg.cut(line))))\n\n    times.append(time.time())\n    with open(output_file, \"w\", encoding=\"utf-8\") as f:\n        f.write(\"\\n\".join(results))\n    times.append(time.time())\n\n    print(\"total_time:\\t{:.3f}\".format(times[-1] - times[0]))\n\n    if verbose:\n        time_strs = [\"load_model\", \"read_file\", \"word_seg\", \"write_file\"]\n        for key, value in zip(\n            time_strs,\n            [end - start for start, end in zip(times[:-1], times[1:])],\n        ):\n            print(\"{}:\\t{:.3f}\".format(key, value))\n\n\ndef _proc_deprecated(seg, lines, start, end, q):\n    for i in range(start, end):\n        l = lines[i].strip()\n        ret = seg.cut(l)\n        q.put((i, \" \".join(ret)))\n\n\ndef _proc(seg, in_queue, out_queue):\n    # TODO: load seg (json or pickle serialization) in sub_process\n    #       to avoid pickle seg online when using start method other\n    #       than fork\n    while True:\n        item = in_queue.get()\n        if item is None:\n            return\n        idx, line = item\n        if not seg.postag:\n            output_str = \" \".join(seg.cut(line))\n        else:\n            output_str = \" \".join(map(lambda x:\"/\".join(x), seg.cut(line)))\n        out_queue.put((idx, output_str))\n\n\ndef _proc_alt(model_name, user_dict, postag, in_queue, out_queue):\n    seg = pkuseg(model_name, user_dict, postag=postag)\n    while True:\n        item = in_queue.get()\n        if item is None:\n            return\n        idx, line = item\n        if not postag:\n            output_str = \" \".join(seg.cut(line))\n        else:\n            output_str = \" \".join(map(lambda x:\"/\".join(x), seg.cut(line)))\n        out_queue.put((idx, output_str))\n\n\ndef _test_multi_proc(\n    input_file,\n    output_file,\n    nthread,\n    model_name=\"default\",\n    user_dict=\"default\",\n    postag=False,\n    verbose=False,\n):\n\n    alt = multiprocessing.get_start_method() == \"spawn\"\n\n    times = []\n    times.append(time.time())\n\n    if alt:\n        seg = None\n    else:\n        seg = pkuseg(model_name, user_dict, postag)\n\n    times.append(time.time())\n    if not os.path.exists(input_file):\n        raise Exception(\"input_file {} does not exist.\".format(input_file))\n    with open(input_file, \"r\", encoding=\"utf-8\") as f:\n        lines = f.readlines()\n\n    times.append(time.time())\n    in_queue = Queue()\n    out_queue = Queue()\n    procs = []\n    for _ in range(nthread):\n        if alt:\n            p = Process(\n                target=_proc_alt,\n                args=(model_name, user_dict, postag, in_queue, out_queue),\n            )\n        else:\n            p = Process(target=_proc, args=(seg, in_queue, out_queue))\n        procs.append(p)\n\n    for idx, line in enumerate(lines):\n        in_queue.put((idx, line))\n\n    for proc in procs:\n        in_queue.put(None)\n        proc.start()\n\n    times.append(time.time())\n    result = [None] * len(lines)\n    for _ in result:\n        idx, line = out_queue.get()\n        result[idx] = line\n\n    times.append(time.time())\n    for p in procs:\n        p.join()\n\n    times.append(time.time())\n    with open(output_file, \"w\", encoding=\"utf-8\") as f:\n        f.write(\"\\n\".join(result))\n    times.append(time.time())\n\n    print(\"total_time:\\t{:.3f}\".format(times[-1] - times[0]))\n\n    if verbose:\n        time_strs = [\n            \"load_model\",\n            \"read_file\",\n            \"start_proc\",\n            \"word_seg\",\n            \"join_proc\",\n            \"write_file\",\n        ]\n\n        if alt:\n            times = times[1:]\n            time_strs = time_strs[1:]\n            time_strs[2] = \"load_modal & word_seg\"\n\n        for key, value in zip(\n            time_strs,\n            [end - start for start, end in zip(times[:-1], times[1:])],\n        ):\n            print(\"{}:\\t{:.3f}\".format(key, value))\n\n\ndef test(\n    input_file,\n    output_file,\n    model_name=\"default\",\n    user_dict=\"default\",\n    nthread=10,\n    postag=False,\n    verbose=False,\n):\n\n    if nthread > 1:\n        _test_multi_proc(\n            input_file, output_file, nthread, model_name, user_dict, postag, verbose\n        )\n    else:\n        _test_single_proc(\n            input_file, output_file, model_name, user_dict, postag, verbose\n        )\n\n"
  },
  {
    "path": "pkuseg/config.py",
    "content": "import os\nimport tempfile\n\n\nclass Config:\n    lineEnd = \"\\n\"\n    biLineEnd = \"\\n\\n\"\n    triLineEnd = \"\\n\\n\\n\"\n    undrln = \"_\"\n    blank = \" \"\n    tab = \"\\t\"\n    star = \"*\"\n    slash = \"/\"\n    comma = \",\"\n    delimInFeature = \".\"\n    B = \"B\"\n    num = \"0123456789.几二三四五六七八九十千万亿兆零１２３４５６７８９０％\"\n    letter = \"ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺａｂｃｄｅｆｇｈｉｇｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ／・－\"\n    mark = \"*\"\n    model_urls = {\n        \"postag\": \"https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip\",\n        \"medicine\": \"https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/medicine.zip\",\n        \"tourism\": \"https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/tourism.zip\",\n        \"news\": \"https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/news.zip\",\n        \"web\": \"https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/web.zip\",\n    }\n    model_hash = {\n        \"postag\": \"afdf15f4e39bc47a39be4c37e3761b0c8f6ad1783f3cd3aff52984aebc0a1da9\",\n        \"medicine\": \"773d655713acd27dd1ea9f97d91349cc1b6aa2fc5b158cd742dc924e6f239dfc\",\n        \"tourism\": \"1c84a0366fe6fda73eda93e2f31fd399923b2f5df2818603f426a200b05cbce9\",\n        \"news\": \"18188b68e76b06fc437ec91edf8883a537fe25fa606641534f6f004d2f9a2e42\",\n        \"web\": \"4867f5817f187246889f4db259298c3fcee07c0b03a2d09444155b28c366579e\",\n    }\n    available_models = [\"default\", \"medicine\", \"tourism\", \"web\", \"news\"]\n    models_with_dict = [\"medicine\", \"tourism\"]\n\n\n    def __init__(self):\n        # main setting\n        self.pkuseg_home = os.path.expanduser(os.getenv('PKUSEG_HOME', '~/.pkuseg'))\n        self.trainFile = os.path.join(\"data\", \"small_training.utf8\")\n        self.testFile = os.path.join(\"data\", \"small_test.utf8\")\n        self._tmp_dir = tempfile.TemporaryDirectory()\n        self.homepath = self._tmp_dir.name\n        self.tempFile = os.path.join(self.homepath, \".pkuseg\", \"temp\")\n        self.readFile = os.path.join(\"data\", \"small_test.utf8\")\n        self.outputFile = os.path.join(\"data\", \"small_test_output.utf8\")\n\n        self.modelOptimizer = \"crf.adf\"\n        self.rate0 = 0.05  # init value of decay rate in SGD and ADF training\n        # self.reg = 1\n        # self.regs = [1]\n        # self.regList = self.regs.copy()\n        self.random = (\n            0\n        )  # 0 for 0-initialization of model weights, 1 for random init of model weights\n        self.evalMetric = (\n            \"f1\"\n        )  # tok.acc (token accuracy), str.acc (string accuracy), f1 (F1-score)\n        self.trainSizeScale = 1  # for scaling the size of training data\n        self.ttlIter = 20  # of training iterations\n        self.nUpdate = 10  # for ADF training\n        self.outFolder = os.path.join(self.tempFile, \"output\")\n        self.save = 1  # save model file\n        self.rawResWrite = True\n        self.miniBatch = 1  # mini-batch in stochastic training\n        self.nThread = 10  # number of processes\n        # ADF training\n        self.upper = 0.995  # was tuned for nUpdate = 10\n        self.lower = 0.6  # was tuned for nUpdate = 10\n\n        # global variables\n        self.metric = None\n        self.reg = 1\n        self.outDir = self.outFolder\n        self.testrawDir = \"rawinputs/\"\n        self.testinputDir = \"inputs/\"\n        self.tempDir = os.path.join(self.homepath, \".pkuseg\", \"temp\")\n        self.testoutputDir = \"entityoutputs/\"\n\n        # self.GL_init = True\n        self.weightRegMode = \"L2\"  # choosing weight regularizer: L2, L1)\n\n        self.c_train = os.path.join(self.tempFile, \"train.conll.txt\")\n        self.f_train = os.path.join(self.tempFile, \"train.feat.txt\")\n\n        self.c_test = os.path.join(self.tempFile, \"test.conll.txt\")\n        self.f_test = os.path.join(self.tempFile, \"test.feat.txt\")\n\n        self.fTune = \"tune.txt\"\n        self.fLog = \"trainLog.txt\"\n        self.fResSum = \"summarizeResult.txt\"\n        self.fResRaw = \"rawResult.txt\"\n        self.fOutput = \"outputTag-{}.txt\"\n\n        self.fFeatureTrain = os.path.join(self.tempFile, \"ftrain.txt\")\n        self.fGoldTrain = os.path.join(self.tempFile, \"gtrain.txt\")\n        self.fFeatureTest = os.path.join(self.tempFile, \"ftest.txt\")\n        self.fGoldTest = os.path.join(self.tempFile, \"gtest.txt\")\n\n        self.modelDir = os.path.join(\n            os.path.dirname(os.path.realpath(__file__)), \"models\", \"ctb8\"\n        )\n        self.fModel = os.path.join(self.modelDir, \"model.txt\")\n\n        # feature\n        self.numLetterNorm = True\n        self.featureTrim = 0\n        self.wordFeature = True\n        self.wordMax = 6\n        self.wordMin = 2\n        self.nLabel = 5\n        self.order = 1\n\n    def globalCheck(self):\n        if self.evalMetric == \"f1\":\n            self.metric = \"f-score\"\n        elif self.evalMetric == \"tok.acc\":\n            self.metric = \"token-accuracy\"\n        elif self.evalMetric == \"str.acc\":\n            self.metric = \"string-accuracy\"\n        else:\n            raise Exception(\"invalid eval metric\")\n        assert self.rate0 > 0\n        assert self.trainSizeScale > 0\n        assert self.ttlIter > 0\n        assert self.nUpdate > 0\n        assert self.miniBatch > 0\n        assert self.reg > 0\n\n\nconfig = Config()\n"
  },
  {
    "path": "pkuseg/data.py",
    "content": "# from .config import Config\n# from pkuseg.feature_generator import\n# import os\nimport copy\nimport random\n\n\n# class dataFormat:\n#     def __init__(self, config):\n#         self.featureIndexMap = {}\n#         self.tagIndexMap = {}\n#         self.config = config\n\n#     def convert(self):\n#         config = self.config\n#         if config.runMode.find(\"train\") >= 0:\n#             self.getMaps(config.fTrain)\n#             self.saveFeature(config.modelDir + \"/featureIndex.txt\")\n#             self.convertFile(config.fTrain)\n#         else:\n#             self.readFeature(config.modelDir + \"/featureIndex.txt\")\n#             self.readTag(config.modelDir + \"/tagIndex.txt\")\n#         self.convertFile(config.fTest)\n#         if config.dev:\n#             self.convertFile(config.fDev)\n\n#     def saveFeature(self, file):\n#         featureList = list(self.featureIndexMap.keys())\n#         num = len(featureList) // 10\n#         for i in range(10):\n#             l = i * num\n#             r = (i + 1) * num if i < 9 else len(featureList)\n#             with open(file + \"_\" + str(i), \"w\", encoding=\"utf-8\") as sw:\n#                 for w in range(l, r):\n#                     word = featureList[w]\n#                     sw.write(word + \" \" + str(self.featureIndexMap[word]) + \"\\n\")\n\n#     def readFeature(self, file):\n#         featureList = []\n#         for i in range(10):\n#             featureList.append([])\n#             with open(file + \"_\" + str(i), encoding=\"utf-8\") as f:\n#                 lines = f.readlines()\n#             for line in lines:\n#                 featureList[i].append(line.strip())\n#         feature = []\n#         for i in range(10):\n#             for line in featureList[i]:\n#                 word, index = line.split(\" \")\n#                 self.featureIndexMap[word] = int(index)\n\n#     def readFeatureNormal(self, path):\n#         with open(path, encoding=\"utf-8\") as f:\n#             lines = f.readlines()\n#         for line in lines:\n#             u, v = line.split(\" \")\n#             self.featureIndexMap[u] = int(v)\n\n#     def readTag(self, path):\n#         with open(path, encoding=\"utf-8\") as f:\n#             lines = f.readlines()\n#         for line in lines:\n#             u, v = line.split(\" \")\n#             self.tagIndexMap[u] = int(v)\n\n#     def getMaps(self, file):\n#         config = self.config\n#         if not os.path.exists(file):\n#             print(\"file {} not exist!\".format(file))\n#         print(\"file {} converting...\".format(file))\n#         featureFreqMap = {}\n#         tagSet = set()\n#         with open(file, encoding=\"utf-8\") as f:\n#             lines = f.readlines()\n#         for line in lines:\n#             line = line.replace(\"\\t\", \" \")\n#             line = line.replace(\"\\r\", \"\").strip()\n#             if line == \"\":\n#                 continue\n#             ary = line.split(config.blank)\n#             for i in range(1, len(ary) - 1):\n#                 if ary[i] == \"\" or ary[i] == \"/\":\n#                     continue\n#                 if config.weightRegMode == \"GL\":\n#                     if not config.GL_init and config.groupTrim[i - 1]:\n#                         continue\n\n#                 ary2 = ary[i].split(config.slash)\n#                 feature = str(i) + \".\" + ary2[0]\n#                 if not feature in featureFreqMap:\n#                     featureFreqMap[feature] = 0\n#                 featureFreqMap[feature] += 1\n#             tag = ary[-1]\n#             tagSet.add(tag)\n#         sortList = []\n#         for k in featureFreqMap:\n#             sortList.append(k + \" \" + str(featureFreqMap[k]))\n#         if config.weightRegMode == \"GL\":\n#             sortList.sort(key=lambda x: (int(x.split(config.blank)[1].strip()), x))\n#             with open(\"featureTemp_sorted.txt\", \"w\", encoding=\"utf-8\") as f:\n#                 for x in sortList:\n#                     f.write(x + \"\\n\")\n#             config.groupStart = [0]\n#             config.groupEnd = []\n#             for k in range(1, len(sortList)):\n#                 thisAry = sortList[k].split(config.dot)\n#                 preAry = sortList[k - 1].split(config.dot)\n#                 s = thisAry[0]\n#                 preAry = preAry[0]\n#                 if s != preAry:\n#                     config.groupStart.append(k)\n#                     config.groupEnd.append(k)\n#             config.groupEnd.append(len(sortList))\n#         else:\n#             sortList.sort(\n#                 key=lambda x: (int(x.split(config.blank)[1].strip()), x), reverse=True\n#             )\n\n#         if config.weightRegMode == \"GL\" and config.GL_init:\n#             if nFeatTemp != len(config.groupStart):\n#                 raise Exception(\n#                     \"inconsistent # of features per line, check the feature file for consistency!\"\n#                 )\n#         with open(\n#             os.path.join(config.modelDir, \"featureIndex.txt\"), \"w\", encoding=\"utf-8\"\n#         ) as swFeat:\n#             for i, l in enumerate(sortList):\n#                 ary = l.split(config.blank)\n#                 self.featureIndexMap[ary[0]] = i\n#                 swFeat.write(\"{} {}\\n\".format(ary[0].strip(), i))\n#         with open(os.path.join(config.modelDir, \"tagIndex.txt\"), \"w\", encoding=\"utf-8\") as swTag:\n#             tagSortList = []\n#             for tag in tagSet:\n#                 tagSortList.append(tag)\n#             tagSortList.sort()\n#             for i, l in enumerate(tagSortList):\n#                 self.tagIndexMap[l] = i\n#                 swTag.write(\"{} {}\\n\".format(l, i))\n\n#     def convertFile(self, file):\n#         config = self.config\n#         if not os.path.exists(file):\n#             print(\"file {} not exist!\".format(file))\n#         print(\"file converting...\")\n#         if file == config.fTrain:\n#             swFeature = open(config.fFeatureTrain, \"w\", encoding=\"utf-8\")\n#             swGold = open(config.fGoldTrain, \"w\", encoding=\"utf-8\")\n#         else:\n#             swFeature = open(config.fFeatureTest, \"w\", encoding=\"utf-8\")\n#             swGold = open(config.fGoldTest, \"w\", encoding=\"utf-8\")\n#         swFeature.write(str(len(self.featureIndexMap)) + \"\\n\\n\")\n#         swGold.write(str(len(self.tagIndexMap)) + \"\\n\\n\")\n#         with open(file, encoding=\"utf-8\") as sr:\n#             readLines = sr.readlines()\n#         featureList = []\n#         goldList = []\n#         for k in range(len(readLines)):\n#             line = readLines[k]\n#             line = line.replace(\"\\t\", \"\").strip()\n#             featureLine = \"\"\n#             goldLine = \"\"\n#             if line == \"\":\n#                 featureLine = featureLine + \"\\n\"\n#                 goldLine = goldLine + \"\\n\\n\"\n#                 featureList.append(featureLine)\n#                 goldList.append(goldLine)\n#                 continue\n#             flag = 0\n#             ary = line.split(config.blank)\n#             tmp = []\n#             for i in ary:\n#                 if i != \"\":\n#                     tmp.append(i)\n#             ary = tmp\n#             for i in range(1, len(ary) - 1):\n#                 if ary[i] == \"/\":\n#                     continue\n#                 ary2 = ary[i].split(config.slash)\n#                 tmp = []\n#                 for j in ary2:\n#                     if j != \"\":\n#                         tmp.append(j)\n#                 ary2 = tmp\n#                 feature = str(i) + \".\" + ary2[0]\n#                 value = \"\"\n#                 real = False\n#                 if len(ary2) > 1:\n#                     value = ary2[1]\n#                     real = True\n#                 if not feature in self.featureIndexMap:\n#                     continue\n#                 flag = 1\n#                 fIndex = self.featureIndexMap[feature]\n#                 if not real:\n#                     featureLine = featureLine + str(fIndex) + \",\"\n#                 else:\n#                     featureLine = featureLine + str(fIndex) + \"/\" + value + \",\"\n#             if flag == 0:\n#                 featureLine = featureLine + \"0\"\n#             featureLine = featureLine + \"\\n\"\n#             tag = ary[-1]\n#             tIndex = self.tagIndexMap[tag]\n#             goldLine = goldLine + str(tIndex) + \",\"\n#             featureList.append(featureLine)\n#             goldList.append(goldLine)\n#         for i in range(len(featureList)):\n#             swFeature.write(featureList[i])\n#             swGold.write(goldList[i])\n#         swFeature.close()\n#         swGold.close()\n\n\nclass DataSet:\n    def __init__(self, n_tag=0, n_feature=0):\n        self.lst = []  # type: List[Example]\n        self.n_tag = n_tag\n        self.n_feature = n_feature\n        # if len(args) == 2:\n        #     if type(args[0]) == int:\n        #         self.nTag, self.nFeature = args\n        #     else:\n        #         self.load(args[0], args[1])\n\n    def __len__(self):\n        return len(self.lst)\n\n    def __iter__(self):\n        return self.iterator()\n\n    def __getitem__(self, x):\n        return self.lst[x]\n\n    def iterator(self):\n        for i in self.lst:\n            yield i\n\n    def append(self, x):\n        self.lst.append(x)\n\n    def clear(self):\n        self.lst = []\n\n    def randomShuffle(self):\n        cp = copy.deepcopy(self)\n        random.shuffle(cp.lst)\n        return cp\n\n    # def setDataInfo(self, X):\n    #     self.nTag = X.nTag\n    #     self.nFeature = X.nFeature\n\n    def resize(self, scale):\n        dataset = DataSet(self.n_tag, self.n_feature)\n        new_size = int(len(self) * scale)\n        old_size = len(self)\n        for i in range(new_size):\n            if i >= old_size:\n                i %= old_size\n            dataset.append(self[i])\n        return dataset\n\n    @classmethod\n    def load(cls, feature_idx_file, tag_idx_file):\n        dataset = cls.__new__(cls)\n\n        # def load(self, fileFeature, fileTag):\n        with open(feature_idx_file, encoding=\"utf-8\") as f_reader, open(\n            tag_idx_file, encoding=\"utf-8\"\n        ) as t_reader:\n\n            example_strs = f_reader.read().split(\"\\n\\n\")[:-1]\n            tags_strs = t_reader.read().split(\"\\n\\n\")[:-1]\n\n        assert len(example_strs) == len(\n            tags_strs\n        ), \"lengths do not match:\\t{}\\n{}\\n\".format(example_strs, tags_strs)\n\n        n_feature = int(example_strs[0])\n        n_tag = int(tags_strs[0])\n\n        dataset.n_feature = n_feature\n        dataset.n_tag = n_tag\n        dataset.lst = []\n\n        for example_str, tags_str in zip(example_strs[1:], tags_strs[1:]):\n            features = [\n                list(map(int, feature_line.split(\",\")))\n                for feature_line in example_str.split(\"\\n\")\n            ]\n            tags = tags_str.split(\",\")\n            example = Example(features, tags)\n            dataset.lst.append(example)\n\n        return dataset\n        # txt = srfileFeature.read()\n        # txt.replace(\"\\r\", \"\")\n        # fAry = txt.split(Config.biLineEnd)\n        # tmp = []\n        # for i in fAry:\n        #     if i != \"\":\n        #         tmp.append(i)\n        # fAry = tmp\n        # txt = srfileTag.read()\n        # txt.replace(\"\\r\", \"\")\n        # tAry = txt.split(Config.biLineEnd)\n        # tmp = []\n        # for i in tAry:\n        #     if i != \"\":\n        #         tmp.append(i)\n        # tAry = tmp\n\n        # assert len(fAry) == len(tAry)\n        # self.nFeature = int(fAry[0])\n        # self.nTag = int(tAry[0])\n        # for i in range(1, len(fAry)):\n        #     features = fAry[i]\n        #     tags = tAry[i]\n        #     seq = dataSeq()\n        #     seq.read(features, tags)\n        #     self.append(seq)\n\n    # @property\n    # def NTag(self):\n    #     return self.nTag\n\n\nclass Example:\n    def __init__(self, features, tags):\n        self.features = features  # type: List[List[int]]\n        self.tags = list(map(int, tags))  # type: List[int]\n        self.predicted_tags = None\n\n    def __len__(self):\n        return len(self.features)\n\n\n# class dataSeq:\n#     def __init__(self, *args):\n#         self.featureTemps = []\n#         self.yGold = []\n#         if len(args) == 2:\n#             self.featureTemps = copy.deepcopy(args[0])\n#             self.yGold = copy.deepcopy(args[1])\n#         elif len(args) == 3:\n#             x, n, length = args\n#             end = min(n + length, len(x))\n#             for i in range(n, end):\n#                 self.featureTemps.append(x.featureTemps[i])\n#                 yGold.append(x.yGold[i])\n\n#     def __len__(self):\n#         return len(self.featureTemps)\n\n#     def read(self, a, b):\n#         lineAry = a.split(Config.lineEnd)\n#         for im in lineAry:\n#             if im == \"\":\n#                 continue\n#             nodeList = []\n#             imAry = im.split(Config.comma)\n#             for imm in imAry:\n#                 if imm == \"\":\n#                     continue\n#                 if imm.find(\"/\") >= 0:\n#                     biAry = imm.split(Config.slash)\n#                     ft = featureTemp(int(biAry[0], float(biAry[1])))\n#                     nodeList.append(ft)\n#                 else:\n#                     ft = featureTemp(int(imm), 1)\n#                     nodeList.append(ft)\n#             self.featureTemps.append(nodeList)\n#         lineAry = b.split(Config.comma)\n#         for im in lineAry:\n#             if im == \"\":\n#                 continue\n#             self.yGold.append(int(im))\n\n#     # def load(self, feature):\n#     #     for imAry in feature:\n#     #         nodeList = []\n#     #         for imm in imAry:\n#     #             if imm == \"\":\n#     #                 continue\n#     #             if imm.find(\"/\") >= 0:\n#     #                 biAry = imm.split(Config.slash)\n#     #                 ft = featureTemp(int(biAry[0], float(biAry[1])))\n#     #                 nodeList.append(ft)\n#     #             else:\n#     #                 ft = featureTemp(int(imm), 1)\n#     #                 nodeList.append(ft)\n#     #         self.featureTemps.append(nodeList)\n#     #         self.yGold.append(0)\n\n#     def getFeatureTemp(self, *args):\n#         return (\n#             self.featureTemps if len(args) == 0 else self.featureTemps[args[0]]\n#         )\n\n#     def getTags(self, *args):\n#         return self.yGold if len(args) == 0 else self.yGold[args[0]]\n\n#     def setTags(self, lst):\n#         assert len(lst) == len(self.yGold)\n#         for i in range(len(lst)):\n#             self.yGold[i] = lst[i]\n\n\n# class dataSeqTest:\n#     def __init__(self, x, yOutput):\n#         self._x = x\n#         self._yOutput = yOutput\n"
  },
  {
    "path": "pkuseg/dicts/__init__.py",
    "content": ""
  },
  {
    "path": "pkuseg/download.py",
    "content": "from __future__ import absolute_import, division, print_function, unicode_literals\n\nimport hashlib\nimport os\nimport re\nimport shutil\nimport sys\nimport tempfile\nimport zipfile\n\ntry:\n    from requests.utils import urlparse\n    from requests import get as urlopen\n    requests_available = True\nexcept ImportError:\n    requests_available = False\n    if sys.version_info[0] == 2:\n        from urlparse import urlparse  # noqa f811\n        from urllib2 import urlopen  # noqa f811\n    else:\n        from urllib.request import urlopen\n        from urllib.parse import urlparse\ntry:\n    from tqdm import tqdm\nexcept ImportError:\n    tqdm = None  # defined below\n\nHASH_REGEX = re.compile(r'-([a-f0-9]*)\\.')\n\ndef download_model(url, model_dir, hash_prefix, progress=True):\n    if not os.path.exists(model_dir):\n        os.makedirs(model_dir)\n    parts = urlparse(url)\n    filename = os.path.basename(parts.path)\n    cached_file = os.path.join(model_dir, filename)\n    if not os.path.exists(cached_file):\n        sys.stderr.write('Downloading: \"{}\" to {}\\n'.format(url, cached_file))\n        _download_url_to_file(url, cached_file, hash_prefix, progress=progress)\n        unzip_file(cached_file, os.path.join(model_dir, filename.split('.')[0]))\n\n\ndef _download_url_to_file(url, dst, hash_prefix, progress):\n    if requests_available:\n        u = urlopen(url, stream=True, timeout=5)\n        file_size = int(u.headers[\"Content-Length\"])\n        u = u.raw\n    else:\n        u = urlopen(url, timeout=5)\n        meta = u.info()\n        if hasattr(meta, 'getheaders'):\n            file_size = int(meta.getheaders(\"Content-Length\")[0])\n        else:\n            file_size = int(meta.get_all(\"Content-Length\")[0])\n\n    f = tempfile.NamedTemporaryFile(delete=False)\n    try:\n        if hash_prefix is not None:\n            sha256 = hashlib.sha256()\n        with tqdm(total=file_size, disable=not progress) as pbar:\n            while True:\n                buffer = u.read(8192)\n                if len(buffer) == 0:\n                    break\n                f.write(buffer)\n                if hash_prefix is not None:\n                    sha256.update(buffer)\n                pbar.update(len(buffer))\n\n        f.close()\n        if hash_prefix is not None:\n            digest = sha256.hexdigest()\n            if digest[:len(hash_prefix)] != hash_prefix:\n                raise RuntimeError('invalid hash value (expected \"{}\", got \"{}\")'\n                                   .format(hash_prefix, digest))\n        shutil.move(f.name, dst)\n    finally:\n        f.close()\n        if os.path.exists(f.name):\n            os.remove(f.name)\n\n\nif tqdm is None:\n    # fake tqdm if it's not installed\n    class tqdm(object):\n\n        def __init__(self, total, disable=False):\n            self.total = total\n            self.disable = disable\n            self.n = 0\n\n        def update(self, n):\n            if self.disable:\n                return\n\n            self.n += n\n            sys.stderr.write(\"\\r{0:.1f}%\".format(100 * self.n / float(self.total)))\n            sys.stderr.flush()\n\n        def __enter__(self):\n            return self\n\n        def __exit__(self, exc_type, exc_val, exc_tb):\n            if self.disable:\n                return\n\n            sys.stderr.write('\\n')\n\ndef unzip_file(zip_name, target_dir):\n    if not os.path.exists(target_dir):\n        os.makedirs(target_dir)\n    file_zip = zipfile.ZipFile(zip_name, 'r')\n    for file in file_zip.namelist():\n        file_zip.extract(file, target_dir)\n    file_zip.close()\n\nif __name__ == '__main__':\n    url = 'https://github.com/lancopku/pkuseg-python/releases/download/v0.0.14/mixed.zip'\n    download_model(url, '.')\n    \n"
  },
  {
    "path": "pkuseg/feature_extractor.pyx",
    "content": "# distutils: language = c++\n# cython: infer_types=True\n# cython: language_level=3\nimport json\nimport os\nimport sys\nimport pickle\nfrom collections import Counter\nfrom itertools import product\n\nimport cython\nfrom pkuseg.config import config\n\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef get_slice_str(iterable, int start, int length, int all_len):\n    if start < 0 or start >= all_len:\n        return \"\"\n    if start + length >= all_len + 1:\n        return \"\"\n    return \"\".join(iterable[start : start + length])\n\n\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\n@cython.nonecheck(False)\ndef __get_node_features_idx(object config not None, int idx, list nodes not None, dict feature_to_idx not None, set unigram not None):\n\n    cdef:\n        list flist = []\n        Py_ssize_t i = idx\n        int length = len(nodes)\n        int word_max = config.wordMax\n        int word_min = config.wordMin\n        int word_range = word_max - word_min + 1\n\n\n    c = nodes[i]\n\n    # $$ starts feature\n    flist.append(0)\n\n    # 8 unigram/bgiram feature\n    feat  = 'c.' + c\n    if feat in feature_to_idx:\n        feature = feature_to_idx[feat]\n        flist.append(feature)\n\n\n    if i > 0:\n        prev_c = nodes[i-1]\n        feat = 'c-1.' + prev_c\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n        feat = 'c-1c.' + prev_c + '.' + c\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n    if i + 1 < length:\n        next_c = nodes[i+1]\n\n        feat = 'c1.' + next_c\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n        feat = 'cc1.' + c + '.' + next_c\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n\n    if i > 1:\n        prepre_char = nodes[i-2]\n        feat = 'c-2.' + prepre_char\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n        feat = 'c-2c-1.' + prepre_char + '.' + nodes[i-1]\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n\n\n    if i + 2 < length:\n        feat = 'c2.' + nodes[i+2]\n        if feat in feature_to_idx:\n            feature = feature_to_idx[feat]\n            flist.append(feature)\n\n\n    # no num/letter based features\n    if not config.wordFeature:\n        return flist\n\n\n    # 2 * (wordMax-wordMin+1) word features (default: 2*(6-2+1)=10 )\n    # the character starts or ends a word\n\n    prelst_in = []\n    for l in range(word_max, word_min - 1, -1):\n        # length 6 ... 2 (default)\n        # \"prefix including current c\" wordary[n-l+1, n]\n        # current character ends word\n        tmp = get_slice_str(nodes, i - l + 1, l, length)\n        if tmp in unigram:\n            feat = 'w-1.' + tmp\n            if feat in feature_to_idx:\n                feature = feature_to_idx[feat]\n                flist.append(feature)\n\n            prelst_in.append(tmp)\n        else:\n            prelst_in.append(\"**noWord\")\n\n\n\n    postlst_in = []\n    for l in range(word_max, word_min - 1, -1):\n        # \"suffix\" wordary[n, n+l-1]\n        # current character starts word\n        tmp = get_slice_str(nodes, i, l, length)\n        if tmp in unigram:\n            feat = 'w1.' + tmp\n            if feat in feature_to_idx:\n                feature = feature_to_idx[feat]\n                flist.append(feature)\n\n            postlst_in.append(tmp)\n        else:\n            postlst_in.append(\"**noWord\")\n\n\n    # these are not in feature list\n    prelst_ex = []\n    for l in range(word_max, word_min - 1, -1):\n        # \"prefix excluding current c\" wordary[n-l, n-1]\n        tmp = get_slice_str(nodes, i - l, l, length)\n        if tmp in unigram:\n            prelst_ex.append(tmp)\n        else:\n            prelst_ex.append(\"**noWord\")\n\n\n    postlst_ex = []\n    for l in range(word_max, word_min - 1, -1):\n        # \"suffix excluding current c\" wordary[n+1, n+l]\n        tmp = get_slice_str(nodes, i + 1, l, length)\n        if tmp in unigram:\n            postlst_ex.append(tmp)\n        else:\n            postlst_ex.append(\"**noWord\")\n\n\n    # this character is in the middle of a word\n    # 2*(wordMax-wordMin+1)^2 (default: 2*(6-2+1)^2=50)\n\n    for pre in prelst_ex:\n        for post in postlst_in:\n            feat = 'ww.l.' + pre + '*' + post\n            if feat in feature_to_idx:\n                feature = feature_to_idx[feat]\n                flist.append(feature)\n\n\n    for pre in prelst_in:\n        for post in postlst_ex:\n            feat = 'ww.r.' + pre + '*' + post\n            if feat in feature_to_idx:\n                feature = feature_to_idx[feat]\n                flist.append(feature)\n\n\n    return flist\n\n\nclass FeatureExtractor:\n\n    keywords = \"-._,|/*:\"\n\n    num = set(\"0123456789.\" \"几二三四五六七八九十千万亿兆零\" \"１２３４５６７８９０％\")\n    letter = set(\n        \"ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ\" \"ａｂｃｄｅｆｇｈｉｇｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ\" \"／・－\"\n    )\n\n    keywords_translate_table = str.maketrans(\"-._,|/*:\", \"&&&&&&&&\")\n\n    @classmethod\n    def keyword_rename(cls, text):\n        return text.translate(cls.keywords_translate_table)\n\n    @classmethod\n    def _num_letter_normalize_char(cls, character):\n        if character in cls.num:\n            return \"**Num\"\n        if character in cls.letter:\n            return \"**Letter\"\n        return character\n\n    @classmethod\n    def normalize_text(cls, text):\n        text = cls.keyword_rename(text)\n        for character in text:\n            if config.numLetterNorm:\n                yield cls._num_letter_normalize_char(character)\n            else:\n                yield character\n\n\n    def __init__(self):\n\n        self.unigram = set()  # type: Set[str]\n        self.bigram = set()  # type: Set[str]\n        self.feature_to_idx = {}  # type: Dict[str, int]\n        self.tag_to_idx = {}  # type: Dict[str, int]\n\n    def build(self, train_file):\n        with open(train_file, \"r\", encoding=\"utf8\") as reader:\n            lines = reader.readlines()\n\n        examples = []  # type: List[List[List[str]]]\n\n        # first pass to collect unigram and bigram and tag info\n        word_length_info = Counter()\n        specials = set()\n        for line in lines:\n            line = line.strip(\"\\n\\r\")  # .replace(\"\\t\", \" \")\n            if not line:\n                continue\n\n            line = self.keyword_rename(line)\n\n            # str.split() without sep sees consecutive whiltespaces as one separator\n            # e.g., '\\ra \\t　b \\r\\n'.split() = ['a', 'b']\n            words = [word for word in line.split()]\n\n            word_length_info.update(map(len, words))\n            specials.update(word for word in words if len(word)>=10)\n            self.unigram.update(words)\n\n            for pre, suf in zip(words[:-1], words[1:]):\n                self.bigram.add(\"{}*{}\".format(pre, suf))\n\n            example = [\n                self._num_letter_normalize_char(character)\n                for word in words\n                for character in word\n            ]\n            examples.append(example)\n\n        max_word_length = max(word_length_info.keys())\n        for length in range(1, max_word_length + 1):\n            print(\"length = {} : {}\".format(length, word_length_info[length]))\n        # print('special words: {}'.format(', '.join(specials)))\n        # second pass to get features\n\n        feature_freq = Counter()\n\n        for example in examples:\n            for i, _ in enumerate(example):\n                node_features = self.get_node_features(i, example)\n                feature_freq.update(\n                    feature for feature in node_features if feature != \"/\"\n                )\n\n        feature_set = (\n            feature\n            for feature, freq in feature_freq.most_common()\n            if freq > config.featureTrim\n        )\n       \n        tot = len(self.feature_to_idx)\n        for feature in feature_set:\n            if not feature in self.feature_to_idx:\n                self.feature_to_idx[feature] = tot\n                tot += 1\n        # self.feature_to_idx = {\n        #     feature: idx for idx, feature in enumerate(feature_set)\n        # }\n\n        if config.nLabel == 2:\n            B = B_single = \"B\"\n            I_first = I = I_end = \"I\"\n        elif config.nLabel == 3:\n            B = B_single = \"B\"\n            I_first = I = \"I\"\n            I_end = \"I_end\"\n        elif config.nLabel == 4:\n            B = \"B\"\n            B_single = \"B_single\"\n            I_first = I = \"I\"\n            I_end = \"I_end\"\n        elif config.nLabel == 5:\n            B = \"B\"\n            B_single = \"B_single\"\n            I_first = \"I_first\"\n            I = \"I\"\n            I_end = \"I_end\"\n\n        tag_set = {B, B_single, I_first, I, I_end}\n        self.tag_to_idx = {tag: idx for idx, tag in enumerate(sorted(tag_set))}\n\n\n\n    def get_node_features_idx(self, idx, nodes):\n        return __get_node_features_idx(config, idx, nodes, self.feature_to_idx, self.unigram)\n\n\n    def get_node_features(self, idx, wordary):\n        cdef int length = len(wordary)\n        w = wordary[idx]\n        flist = []\n\n        # 1 start feature\n        flist.append(\"$$\")\n\n        # 8 unigram/bgiram feature\n        flist.append(\"c.\" + w)\n        if idx > 0:\n            flist.append(\"c-1.\" + wordary[idx - 1])\n        else:\n            flist.append(\"/\")\n        if idx < len(wordary) - 1:\n            flist.append(\"c1.\" + wordary[idx + 1])\n        else:\n            flist.append(\"/\")\n        if idx > 1:\n            flist.append(\"c-2.\" + wordary[idx - 2])\n        else:\n            flist.append(\"/\")\n        if idx < len(wordary) - 2:\n            flist.append(\"c2.\" + wordary[idx + 2])\n        else:\n            flist.append(\"/\")\n        if idx > 0:\n            flist.append(\"c-1c.\" + wordary[idx - 1] + config.delimInFeature + w)\n        else:\n            flist.append(\"/\")\n        if idx < len(wordary) - 1:\n            flist.append(\"cc1.\" + w + config.delimInFeature + wordary[idx + 1])\n        else:\n            flist.append(\"/\")\n        if idx > 1:\n            flist.append(\n                \"c-2c-1.\"\n                + wordary[idx - 2]\n                + config.delimInFeature\n                + wordary[idx - 1]\n            )\n        else:\n            flist.append(\"/\")\n\n        # no num/letter based features\n        if not config.wordFeature:\n            return flist\n\n        # 2 * (wordMax-wordMin+1) word features (default: 2*(6-2+1)=10 )\n        # the character starts or ends a word\n        tmplst = []\n        for l in range(config.wordMax, config.wordMin - 1, -1):\n            # length 6 ... 2 (default)\n            # \"prefix including current c\" wordary[n-l+1, n]\n            # current character ends word\n            tmp = get_slice_str(wordary, idx - l + 1, l, length)\n            if tmp != \"\":\n                if tmp in self.unigram:\n                    flist.append(\"w-1.\" + tmp)\n                    tmplst.append(tmp)\n                else:\n                    flist.append(\"/\")\n                    tmplst.append(\"**noWord\")\n            else:\n                flist.append(\"/\")\n                tmplst.append(\"**noWord\")\n        prelst_in = tmplst\n\n        tmplst = []\n        for l in range(config.wordMax, config.wordMin - 1, -1):\n            # \"suffix\" wordary[n, n+l-1]\n            # current character starts word\n            tmp = get_slice_str(wordary, idx, l, length)\n            if tmp != \"\":\n                if tmp in self.unigram:\n                    flist.append(\"w1.\" + tmp)\n                    tmplst.append(tmp)\n                else:\n                    flist.append(\"/\")\n                    tmplst.append(\"**noWord\")\n            else:\n                flist.append(\"/\")\n                tmplst.append(\"**noWord\")\n        postlst_in = tmplst\n\n        # these are not in feature list\n        tmplst = []\n        for l in range(config.wordMax, config.wordMin - 1, -1):\n            # \"prefix excluding current c\" wordary[n-l, n-1]\n            tmp = get_slice_str(wordary, idx - l, l, length)\n            if tmp != \"\":\n                if tmp in self.unigram:\n                    tmplst.append(tmp)\n                else:\n                    tmplst.append(\"**noWord\")\n            else:\n                tmplst.append(\"**noWord\")\n        prelst_ex = tmplst\n\n        tmplst = []\n        for l in range(config.wordMax, config.wordMin - 1, -1):\n            # \"suffix excluding current c\" wordary[n+1, n+l]\n            tmp = get_slice_str(wordary, idx + 1, l, length)\n            if tmp != \"\":\n                if tmp in self.unigram:\n                    tmplst.append(tmp)\n                else:\n                    tmplst.append(\"**noWord\")\n            else:\n                tmplst.append(\"**noWord\")\n        postlst_ex = tmplst\n\n        # this character is in the middle of a word\n        # 2*(wordMax-wordMin+1)^2 (default: 2*(6-2+1)^2=50)\n\n        for pre in prelst_ex:\n            for post in postlst_in:\n                bigram = pre + \"*\" + post\n                if bigram in self.bigram:\n                    flist.append(\"ww.l.\" + bigram)\n                else:\n                    flist.append(\"/\")\n\n        for pre in prelst_in:\n            for post in postlst_ex:\n                bigram = pre + \"*\" + post\n                if bigram in self.bigram:\n                    flist.append(\"ww.r.\" + bigram)\n                else:\n                    flist.append(\"/\")\n\n        return flist\n\n    def convert_feature_file_to_idx_file(\n        self, feature_file, feature_idx_file, tag_idx_file\n    ):\n\n        with open(feature_file, \"r\", encoding=\"utf8\") as reader:\n            lines = reader.readlines()\n\n        with open(feature_idx_file, \"w\", encoding=\"utf8\") as f_writer, open(\n            tag_idx_file, \"w\", encoding=\"utf8\"\n        ) as t_writer:\n\n            f_writer.write(\"{}\\n\\n\".format(len(self.feature_to_idx)))\n            t_writer.write(\"{}\\n\\n\".format(len(self.tag_to_idx)))\n\n            tags_idx = []  # type: List[str]\n            features_idx = []  # type: List[List[str]]\n            for line in lines:\n                line = line.strip()\n                if not line:\n                    # sentence finish\n                    for feature_idx in features_idx:\n                        if not feature_idx:\n                            f_writer.write(\"0\\n\")\n                        else:\n                            f_writer.write(\",\".join(map(str, feature_idx)))\n                            f_writer.write(\"\\n\")\n                    f_writer.write(\"\\n\")\n\n                    t_writer.write(\",\".join(map(str, tags_idx)))\n                    t_writer.write(\"\\n\\n\")\n\n                    tags_idx = []\n                    features_idx = []\n                    continue\n\n                splits = line.split(\" \")\n                feature_idx = [\n                    self.feature_to_idx[feat]\n                    for feat in splits[:-1]\n                    if feat in self.feature_to_idx\n                ]\n                features_idx.append(feature_idx)\n                tags_idx.append(self.tag_to_idx[splits[-1]])\n\n    def convert_text_file_to_feature_file(\n        self, text_file, conll_file=None, feature_file=None\n    ):\n\n        if conll_file is None:\n            conll_file = \"{}.conll{}\".format(*os.path.split(text_file))\n        if feature_file is None:\n            feature_file = \"{}.feat{}\".format(*os.path.split(text_file))\n\n        if config.nLabel == 2:\n            B = B_single = \"B\"\n            I_first = I = I_end = \"I\"\n        elif config.nLabel == 3:\n            B = B_single = \"B\"\n            I_first = I = \"I\"\n            I_end = \"I_end\"\n        elif config.nLabel == 4:\n            B = \"B\"\n            B_single = \"B_single\"\n            I_first = I = \"I\"\n            I_end = \"I_end\"\n        elif config.nLabel == 5:\n            B = \"B\"\n            B_single = \"B_single\"\n            I_first = \"I_first\"\n            I = \"I\"\n            I_end = \"I_end\"\n\n        conll_line_format = \"{} {}\\n\"\n\n        with open(text_file, \"r\", encoding=\"utf8\") as reader, open(\n            conll_file, \"w\", encoding=\"utf8\"\n        ) as c_writer, open(feature_file, \"w\", encoding=\"utf8\") as f_writer:\n            for line in reader:\n                line = line.strip()\n                if not line:\n                    continue\n                words = self.keyword_rename(line).split()\n                example = []\n                tags = []\n                for word in words:\n                    word_length = len(word)\n                    for idx, character in enumerate(word):\n                        if word_length == 1:\n                            tag = B_single\n                        elif idx == 0:\n                            tag = B\n                        elif idx == word_length - 1:\n                            tag = I_end\n                        elif idx == 1:\n                            tag = I_first\n                        else:\n                            tag = I\n                        c_writer.write(conll_line_format.format(character, tag))\n\n                        if config.numLetterNorm:\n                            example.append(\n                                self._num_letter_normalize_char(character)\n                            )\n                        else:\n                            example.append(character)\n                        tags.append(tag)\n                c_writer.write(\"\\n\")\n\n                for idx, tag in enumerate(tags):\n                    features = self.get_node_features(idx, example)\n                    features = [\n                        (feature if feature in self.feature_to_idx else \"/\")\n                        for feature in features\n                    ]\n                    features.append(tag)\n                    f_writer.write(\" \".join(features))\n                    f_writer.write(\"\\n\")\n                f_writer.write(\"\\n\")\n\n    def save(self, model_dir=None):\n        if model_dir is None:\n            model_dir = config.modelDir\n        data = {}\n        data[\"unigram\"] = sorted(list(self.unigram))\n        data[\"bigram\"] = sorted(list(self.bigram))\n        data[\"feature_to_idx\"] = self.feature_to_idx\n        data[\"tag_to_idx\"] = self.tag_to_idx\n\n        with open(os.path.join(model_dir, 'features.pkl'), 'wb') as writer:\n            pickle.dump(data, writer, protocol=pickle.HIGHEST_PROTOCOL)\n\n\n        # with open(\n        #     os.path.join(config.modelDir, \"features.json\"), \"w\", encoding=\"utf8\"\n        # ) as writer:\n        #     json.dump(data, writer, ensure_ascii=False)\n\n    @classmethod\n    def load(cls, model_dir=None):\n        if model_dir is None:\n            model_dir = config.modelDir\n        extractor = cls.__new__(cls)\n\n        feature_path = os.path.join(model_dir, \"features.pkl\")\n        if os.path.exists(feature_path):\n            with open(feature_path, \"rb\") as reader:\n                data = pickle.load(reader)\n            extractor.unigram = set(data[\"unigram\"])\n            extractor.bigram = set(data[\"bigram\"])\n            extractor.feature_to_idx = data[\"feature_to_idx\"]\n            extractor.tag_to_idx = data[\"tag_to_idx\"]\n\n            return extractor\n\n\n        print(\n            \"WARNING: features.pkl does not exist, try loading features.json\",\n            file=sys.stderr,\n        )\n\n\n        feature_path = os.path.join(model_dir, \"features.json\")\n        if os.path.exists(feature_path):\n            with open(feature_path, \"r\", encoding=\"utf8\") as reader:\n                data = json.load(reader)\n            extractor.unigram = set(data[\"unigram\"])\n            extractor.bigram = set(data[\"bigram\"])\n            extractor.feature_to_idx = data[\"feature_to_idx\"]\n            extractor.tag_to_idx = data[\"tag_to_idx\"]\n            extractor.save(model_dir)\n            return extractor\n        print(\n            \"WARNING: features.json does not exist, try loading using old format\",\n            file=sys.stderr,\n        )\n\n        with open(\n            os.path.join(model_dir, \"unigram_word.txt\"),\n            \"r\",\n            encoding=\"utf8\",\n        ) as reader:\n            extractor.unigram = set([line.strip() for line in reader])\n\n        with open(\n            os.path.join(model_dir, \"bigram_word.txt\"),\n            \"r\",\n            encoding=\"utf8\",\n        ) as reader:\n            extractor.bigram = set(line.strip() for line in reader)\n\n        extractor.feature_to_idx = {}\n        feature_base_name = os.path.join(model_dir, \"featureIndex.txt\")\n        for i in range(10):\n            with open(\n                \"{}_{}\".format(feature_base_name, i), \"r\", encoding=\"utf8\"\n            ) as reader:\n                for line in reader:\n                    feature, index = line.split(\" \")\n                    feature = \".\".join(feature.split(\".\")[1:])\n                    extractor.feature_to_idx[feature] = int(index)\n\n        extractor.tag_to_idx = {}\n        with open(\n            os.path.join(model_dir, \"tagIndex.txt\"), \"r\", encoding=\"utf8\"\n        ) as reader:\n            for line in reader:\n                tag, index = line.split(\" \")\n                extractor.tag_to_idx[tag] = int(index)\n\n        print(\n            \"INFO: features.json is saved\",\n            file=sys.stderr,\n        )\n        extractor.save(model_dir)\n\n        return extractor\n"
  },
  {
    "path": "pkuseg/gradient.py",
    "content": "import pkuseg.model\nfrom typing import List\n\nimport pkuseg.inference as _inf\nimport pkuseg.data\n\n\ndef get_grad_SGD_minibatch(\n    grad: List[float], model: pkuseg.model.Model, X: List[pkuseg.data.Example]\n):\n    # if idset is not None:\n    #     idset.clear()\n    all_id_set = set()\n    errors = 0\n    for x in X:\n        error, id_set = get_grad_CRF(grad, model, x)\n        errors += error\n        all_id_set.update(id_set)\n\n    return errors, all_id_set\n\n\ndef get_grad_CRF(\n    grad: List[float], model: pkuseg.model.Model, x: pkuseg.data.Example\n):\n\n    id_set = set()\n\n    n_tag = model.n_tag\n    bel = _inf.belief(len(x), n_tag)\n    belMasked = _inf.belief(len(x), n_tag)\n\n    Ylist, YYlist, maskYlist, maskYYlist = _inf.getYYandY(model, x)\n    Z, sum_edge = _inf.get_beliefs(bel, model, x, Ylist, YYlist)\n    ZGold, sum_edge_masked = _inf.get_beliefs(belMasked, model, x, maskYlist, maskYYlist)\n\n    for i, node_feature_list in enumerate(x.features):\n        for feature_id in node_feature_list:\n            trans_id = model._get_node_tag_feature_id(feature_id, 0)\n            id_set.update(range(trans_id, trans_id + n_tag))\n            grad[trans_id:trans_id+n_tag] += bel.belState[i] - belMasked.belState[i]\n\n    backoff = model.n_feature * n_tag\n    grad[backoff:] += sum_edge - sum_edge_masked\n    id_set.update(range(backoff, backoff + n_tag * n_tag))\n\n    return Z - ZGold, id_set\n"
  },
  {
    "path": "pkuseg/inference.pyx",
    "content": "# distutils: language = c++\n# cython: infer_types=True\n# cython: language_level=3\ncimport cython\nimport numpy as np\ncimport numpy as np\n\n\nfrom libcpp.vector cimport vector\nfrom libc.math cimport exp, log\n\n\nnp.import_array()\n\n\nclass belief:\n    def __init__(self, nNodes, nStates):\n        self.belState = np.zeros((nNodes, nStates))\n        self.belEdge = np.zeros((nNodes, nStates * nStates))\n        self.Z = 0\n\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef get_beliefs(object bel, object m, object x, np.ndarray[double, ndim=2] Y, np.ndarray[double, ndim=2] YY):\n    cdef:\n        np.ndarray[double, ndim=2] belState = bel.belState\n        np.ndarray[double, ndim=2] belEdge = bel.belEdge\n        int nNodes = len(x)\n        int nTag = m.n_tag\n        double Z = 0\n        np.ndarray[double, ndim=1] alpha_Y = np.zeros(nTag)\n        np.ndarray[double, ndim=1] newAlpha_Y = np.zeros(nTag)\n        np.ndarray[double, ndim=1] tmp_Y = np.zeros(nTag)\n        np.ndarray[double, ndim=2] YY_trans = YY.transpose()\n        np.ndarray[double, ndim=1] YY_t_r = YY_trans.reshape(-1)\n        np.ndarray[double, ndim=1] sum_edge = np.zeros(nTag * nTag)\n\n    for i in range(nNodes - 1, 0, -1):\n        tmp_Y = belState[i] + Y[i]\n        belState[i-1] = logMultiply(YY, tmp_Y)\n\n    for i in range(nNodes):\n        if i > 0:\n            tmp_Y = alpha_Y.copy()\n            newAlpha_Y = logMultiply(YY_trans, tmp_Y) + Y[i]\n        else:\n            newAlpha_Y = Y[i].copy()\n        if i > 0:\n            tmp_Y = Y[i] + belState[i]\n            belEdge[i] = YY_t_r\n            for yPre in range(nTag):\n                for y in range(nTag):\n                    belEdge[i, y * nTag + yPre] += tmp_Y[y] + alpha_Y[yPre]\n        belState[i] = belState[i] + newAlpha_Y\n        alpha_Y = newAlpha_Y\n    Z = logSum(alpha_Y)\n    for i in range(nNodes):\n        belState[i] = np.exp(belState[i] - Z)\n    for i in range(1, nNodes):\n        sum_edge += np.exp(belEdge[i] - Z)\n    return Z, sum_edge\n\n\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef run_viterbi(np.ndarray[double, ndim=2] node_score, np.ndarray[double, ndim=2] edge_score):\n    cdef int i, y, y_pre, i_pre, tag, w=node_score.shape[0], h=node_score.shape[1]\n    cdef double ma, sc\n    cdef np.ndarray[double, ndim=2] max_score = np.zeros((w, h), dtype=np.float64)\n    cdef np.ndarray[int, ndim=2] pre_tag = np.zeros((w, h), dtype=np.int32)\n    cdef np.ndarray[unsigned char, ndim=2] init_check = np.zeros((w, h), dtype=np.uint8)\n    cdef np.ndarray[int, ndim=1] states = np.zeros(w, dtype=np.int32)\n    for y in range(h):\n        max_score[w-1, y] = node_score[w-1, y]\n    for i in range(w - 2, -1, -1):\n        for y in range(h):\n            for y_pre in range(h):\n                i_pre = i + 1\n                sc = max_score[i_pre, y_pre] + node_score[i, y] + edge_score[y, y_pre]\n                if not init_check[i, y]:\n                    init_check[i, y] = 1\n                    max_score[i, y] = sc\n                    pre_tag[i, y] = y_pre\n                elif sc >= max_score[i, y]:\n                    max_score[i, y] = sc\n                    pre_tag[i, y] = y_pre\n    ma = max_score[0, 0]\n    tag = 0\n    for y in range(1, h):\n        sc = max_score[0, y]\n        if ma < sc:\n            ma = sc\n            tag = y\n    states[0] = tag\n    for i in range(1, w):\n        tag = pre_tag[i-1, tag]\n        states[i] = tag\n    if ma > 300:\n        ma = 300\n    return exp(ma), states\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef getLogYY(vector[vector[int]] feature_temp, int num_tag, int backoff, np.ndarray[double, ndim=1] w, double scalar):\n    cdef:\n        int num_node = feature_temp.size()\n        np.ndarray[double, ndim=2] node_score = np.zeros((num_node, num_tag), dtype=np.float64)\n        np.ndarray[double, ndim=2] edge_score = np.ones((num_tag, num_tag), dtype=np.float64)\n        int s, s_pre, i\n        double maskValue, tmp\n        vector[int] f_list\n        int f, ft\n    for i in range(num_node):\n        f_list = feature_temp[i]\n        for ft in f_list:\n            for s in range(num_tag):\n                f = ft * num_tag + s\n                node_score[i, s] += w[f] * scalar\n    for s in range(num_tag):\n        for s_pre in range(num_tag):\n            f = backoff + s * num_tag + s_pre\n            edge_score[s_pre, s] += w[f] * scalar\n    return node_score, edge_score\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef maskY(object tags, int nNodes, int nTag, np.ndarray[double, ndim=2] Y):\n    cdef np.ndarray[double, ndim=2] mask_Yi = Y.copy()\n    cdef double maskValue = -1e100\n    cdef list tagList = tags\n    cdef int i\n    for i in range(nNodes):\n        for s in range(nTag):\n            if tagList[i] != s:\n                mask_Yi[i, s] = maskValue\n    return mask_Yi\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncdef logMultiply(np.ndarray[double, ndim=2] A, np.ndarray[double, ndim=1] B):\n    cdef int r, c\n    cdef np.ndarray[double, ndim=2] toSumLists = np.zeros_like(A)\n    cdef np.ndarray[double, ndim=1] ret = np.zeros(A.shape[0])\n    for r in range(A.shape[0]):\n        for c in range(A.shape[1]):\n            toSumLists[r, c] = A[r, c] + B[c]\n    for r in range(A.shape[0]):\n        ret[r] = logSum(toSumLists[r])\n    return ret\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncdef logSum(double[:] a):\n    cdef int n = a.shape[0]\n    cdef double s = a[0]\n    cdef double m1\n    cdef double m2\n    for i in range(1, n):\n        if s >= a[i]:\n            m1, m2 = s, a[i]\n        else:\n            m1, m2 = a[i], s\n        s = m1 + log(1 + exp(m2 - m1))\n    return s\n\n\ndef decodeViterbi_fast(feature_temp, model):\n    Y, YY = getLogYY(feature_temp, model.n_tag, model.n_feature*model.n_tag, model.w, 1.0)\n    numer, tags = run_viterbi(Y, YY)\n    tags = list(tags)\n    return numer, tags\n\n\ndef getYYandY(model, example):\n    Y, YY = getLogYY(example.features, model.n_tag, model.n_feature*model.n_tag, model.w, 1.0)\n    mask_Y = maskY(example.tags, len(example), model.n_tag, Y)\n    mask_YY = YY\n    return Y, YY, mask_Y, mask_YY\n\n"
  },
  {
    "path": "pkuseg/model.py",
    "content": "import os\nimport sys\nimport numpy as np\n\nfrom .config import config\n\n\nclass Model:\n    def __init__(self, n_feature, n_tag):\n\n        self.n_tag = n_tag\n        self.n_feature = n_feature\n        self.n_transition_feature = n_tag * (n_feature + n_tag)\n        if config.random:\n            self.w = np.random.random(size=(self.n_transition_feature,)) * 2 - 1\n        else:\n            self.w = np.zeros(self.n_transition_feature)\n\n    def expand(self, n_feature, n_tag):\n        new_transition_feature = n_tag * (n_feature + n_tag)\n        if config.random:\n            new_w = np.random.random(size=(new_transition_feature,)) * 2 - 1\n        else:\n            new_w = np.zeros(new_transition_feature)\n        n_node = self.n_tag * self.n_feature\n        n_edge = self.n_tag * self.n_tag\n        new_w[:n_node] = self.w[:n_node]\n        new_w[-n_edge:] = self.w[-n_edge:]\n        self.n_tag = n_tag\n        self.n_feature = n_feature\n        self.n_transition_feature = new_transition_feature\n        self.w = new_w\n\n    def _get_node_tag_feature_id(self, feature_id, tag_id):\n        return feature_id * self.n_tag + tag_id\n\n    def _get_tag_tag_feature_id(self, pre_tag_id, tag_id):\n        return self.n_feature * self.n_tag + tag_id * self.n_tag + pre_tag_id\n\n    @classmethod\n    def load(cls, model_dir=None):\n        if model_dir is None:\n            model_dir = config.modelDir\n        model_path = os.path.join(model_dir, \"weights.npz\")\n        if os.path.exists(model_path):\n            npz = np.load(model_path)\n            sizes = npz[\"sizes\"]\n            w = npz[\"w\"]\n            model = cls.__new__(cls)\n            model.n_tag = int(sizes[0])\n            model.n_feature = int(sizes[1])\n            model.n_transition_feature = model.n_tag * (\n                model.n_feature + model.n_tag\n            )\n            model.w = w\n            assert model.w.shape[0] == model.n_transition_feature\n            return model\n\n        print(\n            \"WARNING: weights.npz does not exist, try loading using old format\",\n            file=sys.stderr,\n        )\n\n        model_path = os.path.join(model_dir, \"model.txt\")\n        with open(model_path, encoding=\"utf-8\") as f:\n            ary = f.readlines()\n\n        model = cls.__new__(cls)\n        model.n_tag = int(ary[0].strip())\n        wsize = int(ary[1].strip())\n        w = np.zeros(wsize)\n        for i in range(2, wsize):\n            w[i - 2] = float(ary[i].strip())\n        model.w = w\n        model.n_feature = wsize // model.n_tag - model.n_tag\n        model.n_transition_feature = wsize\n\n        model.save(model_dir)\n        return model\n\n    @classmethod\n    def new(cls, model, copy_weight=True):\n\n        new_model = cls.__new__(cls)\n        new_model.n_tag = model.n_tag\n        if copy_weight:\n            new_model.w = model.w.copy()\n        else:\n            new_model.w = np.zeros_like(model.w)\n        new_model.n_feature = (\n            new_model.w.shape[0] // new_model.n_tag - new_model.n_tag\n        )\n        new_model.n_transition_feature = new_model.w.shape[0]\n        return new_model\n\n    def save(self, model_dir=None):\n        if model_dir is None:\n            model_dir = config.modelDir\n        sizes = np.array([self.n_tag, self.n_feature])\n        np.savez(\n            os.path.join(model_dir, \"weights.npz\"), sizes=sizes, w=self.w\n        )\n        # np.save\n        # with open(file, \"w\", encoding=\"utf-8\") as f:\n        #     f.write(\"{}\\n{}\\n\".format(self.n_tag, self.w.shape[0]))\n        #     for value in self.w:\n        #         f.write(\"{:.4f}\\n\".format(value))\n"
  },
  {
    "path": "pkuseg/optimizer.py",
    "content": "import random\n\nimport numpy as np\nimport pkuseg.gradient as _grad\n\n# from pkuseg.config import config\n\n\nclass Optimizer:\n    def __init__(self):\n        self._preVals = []\n\n    def converge_test(self, err):\n        val = 1e100\n        if len(self._preVals) > 1:\n            prevVal = self._preVals[0]\n            if len(self._preVals) == 10:\n                self._preVals.pop(0)\n            avgImprovement = (prevVal - err) / len(self._preVals)\n            relAvg = avgImprovement / abs(err)\n            val = relAvg\n        self._preVals.append(err)\n        return val\n\n    def optimize(self):\n        raise NotImplementedError()\n\n\nclass ADF(Optimizer):\n    def __init__(self, config, dataset, model):\n\n        super().__init__()\n\n        self.config = config\n\n        self._model = model\n        self._X = dataset\n        self.decayList = np.ones_like(self._model.w) * config.rate0\n\n    def optimize(self):\n        config = self.config\n        sample_size = 0\n        w = self._model.w\n        fsize = w.shape[0]\n        xsize = len(self._X)\n        grad = np.zeros(fsize)\n        error = 0\n\n        feature_count_list = np.zeros(fsize)\n        # feature_count_list = [0] * fsize\n        ri = list(range(xsize))\n        random.shuffle(ri)\n\n        update_interval = xsize // config.nUpdate\n\n        # config.interval = xsize // config.nUpdate\n        n_sample = 0\n        for t in range(0, xsize, config.miniBatch):\n            XX = []\n            end = False\n            for k in range(t, t + config.miniBatch):\n                i = ri[k]\n                x = self._X[i]\n                XX.append(x)\n                if k == xsize - 1:\n                    end = True\n                    break\n            mb_size = len(XX)\n            n_sample += mb_size\n\n            # fSet = set()\n\n            err, feature_set = _grad.get_grad_SGD_minibatch(\n                grad, self._model, XX\n            )\n            error += err\n\n            feature_set = list(feature_set)\n\n            feature_count_list[feature_set] += 1\n\n            # for i in feature_set:\n            #     feature_count_list[i] += 1\n            check = False\n\n            for k in range(t, t + config.miniBatch):\n                if t != 0 and k % update_interval == 0:\n                    check = True\n\n            # update decay rates\n            if check or end:\n\n                self.decayList *= (\n                    config.upper\n                    - (config.upper - config.lower)\n                    * feature_count_list\n                    / n_sample\n                )\n                feature_count_list.fill(0)\n\n                # for i in range(fsize):\n                #     v = feature_count_list[i]\n                #     u = v / n_sample\n                #     eta = config.upper - (config.upper - config.lower) * u\n                #     self.decayList[i] *= eta\n                # feature_count_list\n                # for i in range(len(feature_count_list)):\n                #     feature_count_list[i] = 0\n            # update weights\n\n            w[feature_set] -= self.decayList[feature_set] * grad[feature_set]\n            grad[feature_set] = 0\n            # for i in feature_set:\n            #     w[i] -= self.decayList[i] * grad[i]\n            #     grad[i] = 0\n            # reg\n            if check or end:\n                if config.reg != 0:\n                    w -= self.decayList * (\n                        w / (config.reg * config.reg) * n_sample / xsize\n                    )\n\n                    # for i in range(fsize):\n                    #     grad_i = (\n                    #         w[i] / (config.reg * config.reg) * (n_sample / xsize)\n                    #     )\n                    #     w[i] -= self.decayList[i] * grad_i\n                n_sample = 0\n            sample_size += mb_size\n        if config.reg != 0:\n            s = (w * w).sum()\n            error += s / (2.0 * config.reg * config.reg)\n        diff = self.converge_test(error)\n        return error, sample_size, diff\n"
  },
  {
    "path": "pkuseg/postag/__init__.py",
    "content": "from __future__ import print_function\nimport sys\n\nimport os\nimport time\n\nfrom ..inference import decodeViterbi_fast\n\n\nfrom .feature_extractor import FeatureExtractor\nfrom .model import Model\n\nclass Postag:\n    def __init__(self, model_name):\n        modelDir = model_name\n        self.feature_extractor = FeatureExtractor.load(modelDir)\n        self.model = Model.load(modelDir)\n\n        self.idx_to_tag = {\n            idx: tag for tag, idx in self.feature_extractor.tag_to_idx.items()\n        }\n\n        self.n_feature = len(self.feature_extractor.feature_to_idx)\n        self.n_tag = len(self.feature_extractor.tag_to_idx)\n\n        # print(\"finish\")\n\n    def _cut(self, text):\n        examples = list(self.feature_extractor.normalize_text(text))\n        length = len(examples)\n\n        all_feature = []  # type: List[List[int]]\n        for idx in range(length):\n            node_feature_idx = self.feature_extractor.get_node_features_idx(\n                idx, examples\n            )\n            # node_feature = self.feature_extractor.get_node_features(\n            #     idx, examples\n            # )\n\n            # node_feature_idx = []\n            # for feature in node_feature:\n            #     feature_idx = self.feature_extractor.feature_to_idx.get(feature)\n            #     if feature_idx is not None:\n            #         node_feature_idx.append(feature_idx)\n            # if not node_feature_idx:\n            #     node_feature_idx.append(0)\n\n            all_feature.append(node_feature_idx)\n\n        _, tags = decodeViterbi_fast(all_feature, self.model)\n        tags = list(map(lambda x:self.idx_to_tag[x], tags))\n        return tags\n\n    def tag(self, sen):\n        \"\"\"txt: list[str], tags: list[str]\"\"\"\n        tags = self._cut(sen)\n        return tags\n\n"
  },
  {
    "path": "pkuseg/postag/feature_extractor.pyx",
    "content": "# distutils: language = c++\n# cython: infer_types=True\n# cython: language_level=3\nimport json\nimport os\nimport sys\nimport pickle\nfrom collections import Counter\nfrom itertools import product\n\nimport cython\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ncpdef get_slice_str(iterable, int start, int length, int all_len):\n    if start < 0 or start >= all_len:\n        return \"\"\n    if start + length >= all_len + 1:\n        return \"\"\n    return \"\".join(iterable[start : start + length])\n\n\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\n@cython.nonecheck(False)\ndef __get_node_features_idx(int idx, list nodes not None, dict feature_to_idx not None):\n\n    cdef:\n        list flist = []\n        Py_ssize_t i = idx\n        int length = len(nodes)\n        int j\n\n\n    w = nodes[i]\n\n    # $$ starts feature\n    flist.append(0)\n\n    # unigram/bgiram feature\n    feat  = \"w.\" + w\n    if feat in feature_to_idx:\n        feature = feature_to_idx[feat]\n        flist.append(feature)\n\n    for j in range(1, 4):\n        if len(w)>=j:\n            feat = \"tr1.pre.%d.%s\"%(j, w[:j])\n            if feat in feature_to_idx:\n                flist.append(feature_to_idx[feat])\n            feat = \"tr1.post.%d.%s\"%(j, w[-j:])\n            if feat in feature_to_idx:\n                flist.append(feature_to_idx[feat])\n\n    if i > 0:\n        feat = \"tr1.w-1.\" + nodes[i - 1]\n    else:\n        feat = \"tr1.w-1.BOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n    if i < length - 1:\n        feat = \"tr1.w1.\" + nodes[i + 1]\n    else:\n        feat = \"tr1.w1.EOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n    if i > 1:\n        feat = \"tr1.w-2.\" + nodes[i - 2]\n    else:\n        feat = \"tr1.w-2.BOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n    if i < length - 2:\n        feat = \"tr1.w2.\" + nodes[i + 2]\n    else:\n        feat = \"tr1.w2.EOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n    if i > 0:\n        feat = \"tr1.w_-1_0.\" + nodes[i - 1] + \".\" + w\n    else:\n        feat = \"tr1.w_-1_0.BOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n    if i < length - 1:\n        feat = \"tr1.w_0_1.\" + w + \".\" + nodes[i + 1]\n    else:\n        feat = \"tr1.w_0_1.EOS\"\n    if feat in feature_to_idx:\n        flist.append(feature_to_idx[feat])\n\n    return flist\n\n\nclass FeatureExtractor:\n\n    keywords = \"-._,|/*:\"\n\n    num = set(\"0123456789.\" \"几二三四五六七八九十千万亿兆零\" \"１２３４５６７８９０％\")\n    letter = set(\n        \"ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ\" \"ａｂｃｄｅｆｇｈｉｇｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ\" \"／・－\"\n    )\n\n    keywords_translate_table = str.maketrans(\"-._,|/*:\", \"&&&&&&&&\")\n\n    @classmethod\n    def keyword_rename(cls, text):\n        return text.translate(cls.keywords_translate_table)\n\n    @classmethod\n    def _num_letter_normalize(cls, word):\n        if not list(filter(lambda x:x not in cls.num, word)):\n            return \"**Num\"\n        return word\n\n    @classmethod\n    def normalize_text(cls, text):\n        for i in range(len(text)):\n            text[i] = cls.keyword_rename(text[i])\n        for character in text:\n            yield cls._num_letter_normalize(character)\n\n\n    def __init__(self):\n\n        # self.unigram = set()  # type: Set[str]\n        # self.bigram = set()  # type: Set[str]\n        self.feature_to_idx = {}  # type: Dict[str, int]\n        self.tag_to_idx = {}  # type: Dict[str, int]\n    \n    def get_node_features_idx(self, idx, nodes):\n        return __get_node_features_idx(idx, nodes, self.feature_to_idx)\n\n    def get_node_features(self, idx, wordary):\n        cdef int length = len(wordary)\n        w = wordary[idx]\n        flist = []\n\n        # 1 start feature\n        flist.append(\"$$\")\n\n        # 8 unigram/bgiram feature\n        flist.append(\"w.\" + w)\n\n        # prefix/suffix\n        for i in range(1, 4):\n            if len(w)>=i:\n                flist.append(\"tr1.pre.%d.%s\"%(i, w[:i]))\n                flist.append(\"tr1.post.%d.%s\"%(i, w[-i:]))\n            else:\n                flist.append(\"/\")\n                flist.append(\"/\")\n\n        if idx > 0:\n            flist.append(\"tr1.w-1.\" + wordary[idx - 1])\n        else:\n            flist.append(\"tr1.w-1.BOS\")\n        if idx < len(wordary) - 1:\n            flist.append(\"tr1.w1.\" + wordary[idx + 1])\n        else:\n            flist.append(\"tr1.w1.EOS\")\n        if idx > 1:\n            flist.append(\"tr1.w-2.\" + wordary[idx - 2])\n        else:\n            flist.append(\"tr1.w-2.BOS\")\n        if idx < len(wordary) - 2:\n            flist.append(\"tr1.w2.\" + wordary[idx + 2])\n        else:\n            flist.append(\"tr1.w2.EOS\")\n        if idx > 0:\n            flist.append(\"tr1.w_-1_0.\" + wordary[idx - 1] + \".\" + w)\n        else:\n            flist.append(\"tr1.w_-1_0.BOS\")\n        if idx < len(wordary) - 1:\n            flist.append(\"tr1.w_0_1.\" + w + \".\" + wordary[idx + 1])\n        else:\n            flist.append(\"tr1.w_0_1.EOS\")\n\n        return flist\n\n    def convert_feature_file_to_idx_file(\n        self, feature_file, feature_idx_file, tag_idx_file\n    ):\n\n        with open(feature_file, \"r\", encoding=\"utf8\") as reader:\n            lines = reader.readlines()\n\n        with open(feature_idx_file, \"w\", encoding=\"utf8\") as f_writer, open(\n            tag_idx_file, \"w\", encoding=\"utf8\"\n        ) as t_writer:\n\n            f_writer.write(\"{}\\n\\n\".format(len(self.feature_to_idx)))\n            t_writer.write(\"{}\\n\\n\".format(len(self.tag_to_idx)))\n\n            tags_idx = []  # type: List[str]\n            features_idx = []  # type: List[List[str]]\n            for line in lines:\n                line = line.strip()\n                if not line:\n                    # sentence finish\n                    for feature_idx in features_idx:\n                        if not feature_idx:\n                            f_writer.write(\"0\\n\")\n                        else:\n                            f_writer.write(\",\".join(map(str, feature_idx)))\n                            f_writer.write(\"\\n\")\n                    f_writer.write(\"\\n\")\n\n                    t_writer.write(\",\".join(map(str, tags_idx)))\n                    t_writer.write(\"\\n\\n\")\n\n                    tags_idx = []\n                    features_idx = []\n                    continue\n\n                splits = line.split(\" \")\n                feature_idx = [\n                    self.feature_to_idx[feat]\n                    for feat in splits[:-1]\n                    if feat in self.feature_to_idx\n                ]\n                features_idx.append(feature_idx)\n                if not splits[-1] in self.tag_to_idx:\n                    tags_idx.append(-1)\n                else:\n                    tags_idx.append(self.tag_to_idx[splits[-1]])\n\n    def convert_text_file_to_feature_file(\n        self, text_file, conll_file=None, feature_file=None\n    ):\n\n        if conll_file is None:\n            conll_file = \"{}.conll{}\".format(*os.path.split(text_file))\n        if feature_file is None:\n            feature_file = \"{}.feat{}\".format(*os.path.split(text_file))\n\n        conll_line_format = \"{} {}\\n\"\n\n        with open(text_file, \"r\", encoding=\"utf8\") as reader, open(\n            conll_file, \"w\", encoding=\"utf8\"\n        ) as c_writer, open(feature_file, \"w\", encoding=\"utf8\") as f_writer:\n            for line in reader.read().strip().replace(\"\\r\", \"\").split(\"\\n\\n\"):\n                line = line.strip()\n                if not line:\n                    continue\n                line = self.keyword_rename(line).split(\"\\n\")\n                words = []\n                tags = []\n                for word_tag in line:\n                    word, tag = word_tag.split()\n                    words.append(word)\n                    tags.append(tag)\n                example = [\n                    self._num_letter_normalize(word)\n                    for word in words\n                ]\n                for word, tag in zip(example, tags):\n                    c_writer.write(conll_line_format.format(word, tag))\n                c_writer.write(\"\\n\")\n\n                for idx, tag in enumerate(tags):\n                    features = self.get_node_features(idx, example)\n                    features = [\n                        (feature if feature in self.feature_to_idx else \"/\")\n                        for feature in features\n                    ]\n                    features.append(tag)\n                    f_writer.write(\" \".join(features))\n                    f_writer.write(\"\\n\")\n                f_writer.write(\"\\n\")\n\n    def save(self, model_dir):\n        data = {}\n        data[\"feature_to_idx\"] = self.feature_to_idx\n        data[\"tag_to_idx\"] = self.tag_to_idx\n\n        with open(os.path.join(model_dir, 'features.pkl'), 'wb') as writer:\n            pickle.dump(data, writer, protocol=pickle.HIGHEST_PROTOCOL)\n\n\n    @classmethod\n    def load(cls, model_dir):\n        extractor = cls.__new__(cls)\n\n        feature_path = os.path.join(model_dir, \"features.pkl\")\n        if os.path.exists(feature_path):\n            with open(feature_path, \"rb\") as reader:\n                data = pickle.load(reader)\n            extractor.feature_to_idx = data[\"feature_to_idx\"]\n            extractor.tag_to_idx = data[\"tag_to_idx\"]\n\n            return extractor\n\n\n        print(\n            \"WARNING: features.pkl does not exist, try loading features.json\",\n            file=sys.stderr,\n        )\n\n\n        feature_path = os.path.join(model_dir, \"features.json\")\n        if os.path.exists(feature_path):\n            with open(feature_path, \"r\", encoding=\"utf8\") as reader:\n                data = json.load(reader)\n            extractor.feature_to_idx = data[\"feature_to_idx\"]\n            extractor.tag_to_idx = data[\"tag_to_idx\"]\n            extractor.save(model_dir)\n            return extractor\n        print(\n            \"WARNING: features.json does not exist, try loading using old format\",\n            file=sys.stderr,\n        )\n\n        extractor.feature_to_idx = {}\n        feature_base_name = os.path.join(model_dir, \"featureIndex.txt\")\n        for i in range(10):\n            with open(\n                \"{}_{}\".format(feature_base_name, i), \"r\", encoding=\"utf8\"\n            ) as reader:\n                for line in reader:\n                    feature, index = line.split(\" \")\n                    feature = \".\".join(feature.split(\".\")[1:])\n                    extractor.feature_to_idx[feature] = int(index)\n\n        extractor.tag_to_idx = {}\n        with open(\n            os.path.join(model_dir, \"tagIndex.txt\"), \"r\", encoding=\"utf8\"\n        ) as reader:\n            for line in reader:\n                tag, index = line.split(\" \")\n                extractor.tag_to_idx[tag] = int(index)\n\n        print(\n            \"INFO: features.json is saved\",\n            file=sys.stderr,\n        )\n        extractor.save(model_dir)\n\n        return extractor\n"
  },
  {
    "path": "pkuseg/postag/model.py",
    "content": "import os\nimport sys\nimport numpy as np\n\nclass Model:\n    def __init__(self, n_feature, n_tag):\n\n        self.n_tag = n_tag\n        self.n_feature = n_feature\n        self.n_transition_feature = n_tag * (n_feature + n_tag)\n        self.w = np.zeros(self.n_transition_feature)\n\n    def _get_node_tag_feature_id(self, feature_id, tag_id):\n        return feature_id * self.n_tag + tag_id\n\n    def _get_tag_tag_feature_id(self, pre_tag_id, tag_id):\n        return self.n_feature * self.n_tag + tag_id * self.n_tag + pre_tag_id\n\n    @classmethod\n    def load(cls, model_dir):\n        model_path = os.path.join(model_dir, \"weights.npz\")\n        if os.path.exists(model_path):\n            npz = np.load(model_path)\n            sizes = npz[\"sizes\"]\n            w = npz[\"w\"]\n            model = cls.__new__(cls)\n            model.n_tag = int(sizes[0])\n            model.n_feature = int(sizes[1])\n            model.n_transition_feature = model.n_tag * (\n                model.n_feature + model.n_tag\n            )\n            model.w = w\n            assert model.w.shape[0] == model.n_transition_feature\n            return model\n\n        print(\n            \"WARNING: weights.npz does not exist, try loading using old format\",\n            file=sys.stderr,\n        )\n\n        model_path = os.path.join(model_dir, \"model.txt\")\n        with open(model_path, encoding=\"utf-8\") as f:\n            ary = f.readlines()\n\n        model = cls.__new__(cls)\n        model.n_tag = int(ary[0].strip())\n        wsize = int(ary[1].strip())\n        w = np.zeros(wsize)\n        for i in range(2, wsize):\n            w[i - 2] = float(ary[i].strip())\n        model.w = w\n        model.n_feature = wsize // model.n_tag - model.n_tag\n        model.n_transition_feature = wsize\n\n        model.save(model_dir)\n        return model\n\n    @classmethod\n    def new(cls, model, copy_weight=True):\n\n        new_model = cls.__new__(cls)\n        new_model.n_tag = model.n_tag\n        if copy_weight:\n            new_model.w = model.w.copy()\n        else:\n            new_model.w = np.zeros_like(model.w)\n        new_model.n_feature = (\n            new_model.w.shape[0] // new_model.n_tag - new_model.n_tag\n        )\n        new_model.n_transition_feature = new_model.w.shape[0]\n        return new_model\n\n    def save(self, model_dir):\n        sizes = np.array([self.n_tag, self.n_feature])\n        np.savez(\n            os.path.join(model_dir, \"weights.npz\"), sizes=sizes, w=self.w\n        )\n        # np.save\n        # with open(file, \"w\", encoding=\"utf-8\") as f:\n        #     f.write(\"{}\\n{}\\n\".format(self.n_tag, self.w.shape[0]))\n        #     for value in self.w:\n        #         f.write(\"{:.4f}\\n\".format(value))\n"
  },
  {
    "path": "pkuseg/res_summarize.py",
    "content": "import numpy as np\nfrom .config import Config\nimport os\n\n\ndef tomatrix(s):\n    lines = s.split(Config.lineEnd)\n    lst = []\n    for line in lines:\n        if line == \"\":\n            continue\n        if not line.startswith(\"%\"):\n            tmp = []\n            for i in line.split(Config.comma):\n                tmp.append(float(i))\n            lst.append(tmp)\n    return np.array(lst)\n\n\ndef summarize(config):\n    with open(\n        os.path.join(config.outDir, config.fResRaw), encoding=\"utf-8\"\n    ) as sr:\n        txt = sr.read()\n    txt = txt.replace(\"\\r\", \"\")\n    regions = txt.split(config.triLineEnd)\n\n    with open(\n        os.path.join(config.outDir, config.fResSum), \"w\", encoding=\"utf-8\"\n    ) as sw:\n        for region in regions:\n            if region == \"\":\n                continue\n\n            blocks = region.split(config.biLineEnd)\n            mList = []\n            for im in blocks:\n                mList.append(tomatrix(im))\n\n            avgM = np.zeros_like(mList[0])\n            for m in mList:\n                avgM = avgM + m\n            avgM = avgM / len(mList)\n\n            sqravgM = np.zeros_like(mList[0])\n            for m in mList:\n                sqravgM += m * m\n            sqravgM = sqravgM / len(mList)\n\n            deviM = (sqravgM - avgM * avgM) ** 0.5\n\n            sw.write(\"%averaged values:\\n\")\n            for i in range(avgM.shape[0]):\n                for j in range(avgM.shape[1]):\n                    sw.write(\"{:.2f},\".format(avgM[i, j]))\n                sw.write(\"\\n\")\n\n            sw.write(\"\\n%deviations:\\n\")\n            for i in range(deviM.shape[0]):\n                for j in range(deviM.shape[1]):\n                    sw.write(\"{:.2f},\".format(deviM[i, j]))\n                    # sw.write((\"%.2f\" % deviM[i, j]) + \",\")\n                sw.write(\"\\n\")\n\n            sw.write(\"\\n%avg & devi:\\n\")\n            for i in range(avgM.shape[0]):\n                for j in range(avgM.shape[1]):\n                    sw.write(\"{:.2f}+-{:,2f},\".format(avgM[i, j], deviM[i, j]))\n                sw.write(\"\\n\")\n\n            sw.write(\"%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\\n\\n\\n\")\n\n\ndef write(config, timeList, errList, diffList, scoreListList):\n    def log(message):\n        if config.rawResWrite:\n            config.swResRaw.write(message)\n\n    log(\"% training results:\" + config.metric + \"\\n\")\n    for i in range(config.ttlIter):\n        it = i\n        log(\"% iter#={}  \".format(it))\n        lst = scoreListList[i]\n        if config.evalMetric == \"f1\":\n            log(\n                \"% f-score={:.2f}%  precision={:.2f}%  recall={:.2f}%  \".format(\n                    lst[0], lst[1], lst[2]\n                )\n            )\n        else:\n            log(\"% {}={:.2f}%  \".format(config.metric, lst[0]))\n        time = 0\n        for k in range(i + 1):\n            time += timeList[k]\n        log(\n            \"cumulative-time(sec)={:.2f}  objective={:.2f}  diff={:.2f}\\n\".format(\n                time, errList[i], diffList[i]\n            )\n        )\n\n    # #ttlScore = 0\n    # for i in range(config.ttlIter):\n    #     it = i + 1\n    #     log(\"% iter#={}  \".format(it))\n    #     lst = scoreListList[i]\n    #    # ttlScore += lst[0]\n    #     if config.evalMetric == \"f1\":\n    #         log(\n    #             \"% f-score={:.2f}%  precision={:.2f}%  recall={:.2f}%  \".format(\n    #                 lst[0], lst[1], lst[2]\n    #             )\n    #         )\n    #     else:\n    #         log(\"% {}={:.2f}%  \".format(config.metric, lst[0]))\n    #     time = 0\n    #     for k in range(i + 1):\n    #         time += timeList[k]\n    #     log(\n    #         \"cumulative-time(sec)={:.2f}  objective={:.2f}  diff={:.2f}\\n\".format(\n    #             time, errList[i], diffList[i]\n    #         )\n    #     )\n"
  },
  {
    "path": "pkuseg/scorer.py",
    "content": "from pkuseg.config import Config\n\n\ndef getFscore(goldTagList, resTagList, idx_to_chunk_tag):\n    scoreList = []\n    assert len(resTagList) == len(goldTagList)\n    getNewTagList(idx_to_chunk_tag, goldTagList)\n    getNewTagList(idx_to_chunk_tag, resTagList)\n    goldChunkList = getChunks(goldTagList)\n    resChunkList = getChunks(resTagList)\n    gold_chunk = 0\n    res_chunk = 0\n    correct_chunk = 0\n    for i in range(len(goldChunkList)):\n        res = resChunkList[i]\n        gold = goldChunkList[i]\n        resChunkAry = res.split(Config.comma)\n        tmp = []\n        for t in resChunkAry:\n            if len(t) > 0:\n                tmp.append(t)\n        resChunkAry = tmp\n        goldChunkAry = gold.split(Config.comma)\n        tmp = []\n        for t in goldChunkAry:\n            if len(t) > 0:\n                tmp.append(t)\n        goldChunkAry = tmp\n        gold_chunk += len(goldChunkAry)\n        res_chunk += len(resChunkAry)\n        goldChunkSet = set()\n        for im in goldChunkAry:\n            goldChunkSet.add(im)\n        for im in resChunkAry:\n            if im in goldChunkSet:\n                correct_chunk += 1\n    pre = correct_chunk / res_chunk * 100\n    rec = correct_chunk / gold_chunk * 100\n    f1 = 0 if correct_chunk == 0 else 2 * pre * rec / (pre + rec)\n    scoreList.append(f1)\n    scoreList.append(pre)\n    scoreList.append(rec)\n    infoList = []\n    infoList.append(gold_chunk)\n    infoList.append(res_chunk)\n    infoList.append(correct_chunk)\n    return scoreList, infoList\n\n\ndef getNewTagList(tagMap, tagList):\n    tmpList = []\n    for im in tagList:\n        tagAry = im.split(Config.comma)\n        for i in range(len(tagAry)):\n            if tagAry[i] == \"\":\n                continue\n            index = int(tagAry[i])\n            if not index in tagMap:\n                raise Exception(\"Error\")\n            tagAry[i] = tagMap[index]\n        newTags = \",\".join(tagAry)\n        tmpList.append(newTags)\n    tagList.clear()\n    for im in tmpList:\n        tagList.append(im)\n\n\ndef getChunks(tagList):\n    tmpList = []\n    for im in tagList:\n        tagAry = im.split(Config.comma)\n        tmp = []\n        for t in tagAry:\n            if t != \"\":\n                tmp.append(t)\n        tagAry = tmp\n        chunks = \"\"\n        for i in range(len(tagAry)):\n            if tagAry[i].startswith(\"B\"):\n                pos = i\n                length = 1\n                ty = tagAry[i]\n                for j in range(i + 1, len(tagAry)):\n                    if tagAry[j] == \"I\":\n                        length += 1\n                    else:\n                        break\n                chunk = ty + \"*\" + str(length) + \"*\" + str(pos)\n                chunks = chunks + chunk + \",\"\n        tmpList.append(chunks)\n    return tmpList\n"
  },
  {
    "path": "pkuseg/trainer.py",
    "content": "# from .config import config\n# from .feature import *\n# from .data_format import *\n# from .toolbox import *\nimport os\nimport time\nfrom multiprocessing import Process, Queue\n\nfrom pkuseg import res_summarize\n\n# from .inference import *\n# from .config import Config\nfrom pkuseg.config import Config, config\nfrom pkuseg.data import DataSet\nfrom pkuseg.feature_extractor import FeatureExtractor\n\n# from .feature_generator import *\nfrom pkuseg.model import Model\nimport pkuseg.inference as _inf\n\n# from .inference import *\n# from .gradient import *\nfrom pkuseg.optimizer import ADF\nfrom pkuseg.scorer import getFscore\n\n# from typing import TextIO\n\n# from .res_summarize import summarize\n# from .res_summarize import write as reswrite\n\n# from pkuseg.trainer import Trainer\n\n\ndef train(config=None):\n    if config is None:\n        config = Config()\n\n    if config.init_model is None:\n        feature_extractor = FeatureExtractor()\n    else:\n        feature_extractor = FeatureExtractor.load(config.init_model)\n    feature_extractor.build(config.trainFile)\n    feature_extractor.save()\n\n    feature_extractor.convert_text_file_to_feature_file(\n        config.trainFile, config.c_train, config.f_train\n    )\n    feature_extractor.convert_text_file_to_feature_file(\n        config.testFile, config.c_test, config.f_test\n    )\n\n    feature_extractor.convert_feature_file_to_idx_file(\n        config.f_train, config.fFeatureTrain, config.fGoldTrain\n    )\n    feature_extractor.convert_feature_file_to_idx_file(\n        config.f_test, config.fFeatureTest, config.fGoldTest\n    )\n\n    config.globalCheck()\n\n    config.swLog = open(os.path.join(config.outDir, config.fLog), \"w\")\n    config.swResRaw = open(os.path.join(config.outDir, config.fResRaw), \"w\")\n    config.swTune = open(os.path.join(config.outDir, config.fTune), \"w\")\n\n    print(\"\\nstart training...\")\n    config.swLog.write(\"\\nstart training...\\n\")\n\n    print(\"\\nreading training & test data...\")\n    config.swLog.write(\"\\nreading training & test data...\\n\")\n\n    trainset = DataSet.load(config.fFeatureTrain, config.fGoldTrain)\n    testset = DataSet.load(config.fFeatureTest, config.fGoldTest)\n\n    trainset = trainset.resize(config.trainSizeScale)\n\n    print(\n        \"done! train/test data sizes: {}/{}\".format(len(trainset), len(testset))\n    )\n    config.swLog.write(\n        \"done! train/test data sizes: {}/{}\\n\".format(\n            len(trainset), len(testset)\n        )\n    )\n\n    config.swLog.write(\"\\nr: {}\\n\".format(config.reg))\n    print(\"\\nr: {}\".format(config.reg))\n    if config.rawResWrite:\n        config.swResRaw.write(\"\\n%r: {}\\n\".format(config.reg))\n\n    trainer = Trainer(config, trainset, feature_extractor)\n\n    time_list = []\n    err_list = []\n    diff_list = []\n    score_list_list = []\n\n    for i in range(config.ttlIter):\n        # config.glbIter += 1\n        time_s = time.time()\n        err, sample_size, diff = trainer.train_epoch()\n        time_t = time.time() - time_s\n        time_list.append(time_t)\n        err_list.append(err)\n        diff_list.append(diff)\n\n        score_list = trainer.test(testset, i)\n        score_list_list.append(score_list)\n        score = score_list[0]\n\n        logstr = \"iter{}  diff={:.2e}  train-time(sec)={:.2f}  {}={:.2f}%\".format(\n            i, diff, time_t, config.metric, score\n        )\n        config.swLog.write(logstr + \"\\n\")\n        config.swLog.write(\"------------------------------------------------\\n\")\n        config.swLog.flush()\n        print(logstr)\n\n    res_summarize.write(config, time_list, err_list, diff_list, score_list_list)\n    if config.save == 1:\n        trainer.model.save()\n\n    config.swLog.close()\n    config.swResRaw.close()\n    config.swTune.close()\n\n    res_summarize.summarize(config)\n\n    print(\"finished.\")\n\n\nclass Trainer:\n    def __init__(self, config, dataset, feature_extractor):\n        self.config = config\n        self.X = dataset\n        self.n_feature = dataset.n_feature\n        self.n_tag = dataset.n_tag\n\n        if config.init_model is None:\n            self.model = Model(self.n_feature, self.n_tag)\n        else:\n            self.model = Model.load(config.init_model)\n            self.model.expand(self.n_feature, self.n_tag)\n\n        self.optim = self._get_optimizer(dataset, self.model)\n\n        self.feature_extractor = feature_extractor\n        self.idx_to_chunk_tag = {}\n        for tag, idx in feature_extractor.tag_to_idx.items():\n            if tag.startswith(\"I\"):\n                tag = \"I\"\n            if tag.startswith(\"O\"):\n                tag = \"O\"\n            self.idx_to_chunk_tag[idx] = tag\n\n    def _get_optimizer(self, dataset, model):\n        config = self.config\n        if \"adf\" in config.modelOptimizer:\n            return ADF(config, dataset, model)\n\n        raise ValueError(\"Invalid Optimizer\")\n\n    def train_epoch(self):\n        return self.optim.optimize()\n\n    def test(self, testset, iteration):\n\n        outfile = os.path.join(config.outDir, config.fOutput.format(iteration))\n\n        func_mapping = {\n            \"tok.acc\": self._decode_tokAcc,\n            \"str.acc\": self._decode_strAcc,\n            \"f1\": self._decode_fscore,\n        }\n\n        with open(outfile, \"w\", encoding=\"utf8\") as writer:\n            score_list = func_mapping[config.evalMetric](\n                testset, self.model, writer\n            )\n\n        for example in testset:\n            example.predicted_tags = None\n\n        return score_list\n\n    def _decode(self, testset: DataSet, model: Model):\n        if config.nThread == 1:\n            self._decode_single(testset, model)\n        else:\n            self._decode_multi_proc(testset, model)\n\n    def _decode_single(self, testset: DataSet, model: Model):\n        # n_tag = model.n_tag\n        for example in testset:\n            _, tags = _inf.decodeViterbi_fast(example.features, model)\n            example.predicted_tags = tags\n\n    @staticmethod\n    def _decode_proc(model, in_queue, out_queue):\n        while True:\n            item = in_queue.get()\n            if item is None:\n                return\n            idx, features = item\n            _, tags = _inf.decodeViterbi_fast(features, model)\n            out_queue.put((idx, tags))\n\n    def _decode_multi_proc(self, testset: DataSet, model: Model):\n        in_queue = Queue()\n        out_queue = Queue()\n        procs = []\n        nthread = self.config.nThread\n        for i in range(nthread):\n            p = Process(\n                target=self._decode_proc, args=(model, in_queue, out_queue)\n            )\n            procs.append(p)\n\n        for idx, example in enumerate(testset):\n            in_queue.put((idx, example.features))\n\n        for proc in procs:\n            in_queue.put(None)\n            proc.start()\n\n        for _ in range(len(testset)):\n            idx, tags = out_queue.get()\n            testset[idx].predicted_tags = tags\n\n        for p in procs:\n            p.join()\n\n    # token accuracy\n    def _decode_tokAcc(self, dataset, model, writer):\n        config = self.config\n\n        self._decode(dataset, model)\n        n_tag = model.n_tag\n        all_correct = [0] * n_tag\n        all_pred = [0] * n_tag\n        all_gold = [0] * n_tag\n\n        for example in dataset:\n            pred = example.predicted_tags\n            gold = example.tags\n\n            if writer is not None:\n                writer.write(\",\".join(map(str, pred)))\n                writer.write(\"\\n\")\n\n            for pred_tag, gold_tag in zip(pred, gold):\n                all_pred[pred_tag] += 1\n                all_gold[gold_tag] += 1\n                if pred_tag == gold_tag:\n                    all_correct[gold_tag] += 1\n\n        config.swLog.write(\n            \"% tag-type  #gold  #output  #correct-output  token-precision  token-recall  token-f-score\\n\"\n        )\n        sumGold = 0\n        sumOutput = 0\n        sumCorrOutput = 0\n\n        for i, (correct, gold, pred) in enumerate(\n            zip(all_correct, all_gold, all_pred)\n        ):\n            sumGold += gold\n            sumOutput += pred\n            sumCorrOutput += correct\n\n            if gold == 0:\n                rec = 0\n            else:\n                rec = correct * 100.0 / gold\n\n            if pred == 0:\n                prec = 0\n            else:\n                prec = correct * 100.0 / pred\n\n            config.swLog.write(\n                \"% {}:  {}  {}  {}  {:.2f}  {:.2f}  {:.2f}\\n\".format(\n                    i,\n                    gold,\n                    pred,\n                    correct,\n                    prec,\n                    rec,\n                    (2 * prec * rec / (prec + rec)),\n                )\n            )\n\n        if sumGold == 0:\n            rec = 0\n        else:\n            rec = sumCorrOutput * 100.0 / sumGold\n        if sumOutput == 0:\n            prec = 0\n        else:\n            prec = sumCorrOutput * 100.0 / sumOutput\n\n        if prec == 0 and rec == 0:\n            fscore = 0\n        else:\n            fscore = 2 * prec * rec / (prec + rec)\n\n        config.swLog.write(\n            \"% overall-tags:  {}  {}  {}  {:.2f}  {:.2f}  {:.2f}\\n\".format(\n                sumGold, sumOutput, sumCorrOutput, prec, rec, fscore\n            )\n        )\n        config.swLog.flush()\n        return [fscore]\n\n    def _decode_strAcc(self, dataset, model, writer):\n\n        config = self.config\n\n        self._decode(dataset, model)\n\n        correct = 0\n        total = len(dataset)\n\n        for example in dataset:\n            pred = example.predicted_tags\n            gold = example.tags\n\n            if writer is not None:\n                writer.write(\",\".join(map(str, pred)))\n                writer.write(\"\\n\")\n\n            for pred_tag, gold_tag in zip(pred, gold):\n                if pred_tag != gold_tag:\n                    break\n            else:\n                correct += 1\n\n        acc = correct / total * 100.0\n        config.swLog.write(\n            \"total-tag-strings={}  correct-tag-strings={}  string-accuracy={}%\".format(\n                total, correct, acc\n            )\n        )\n        return [acc]\n\n    def _decode_fscore(self, dataset, model, writer):\n        config = self.config\n\n        self._decode(dataset, model)\n\n        gold_tags = []\n        pred_tags = []\n\n        for example in dataset:\n            pred = example.predicted_tags\n            gold = example.tags\n\n            pred_str = \",\".join(map(str, pred))\n            pred_tags.append(pred_str)\n            if writer is not None:\n                writer.write(pred_str)\n                writer.write(\"\\n\")\n            gold_tags.append(\",\".join(map(str, gold)))\n\n        scoreList, infoList = getFscore(\n            gold_tags, pred_tags, self.idx_to_chunk_tag\n        )\n        config.swLog.write(\n            \"#gold-chunk={}  #output-chunk={}  #correct-output-chunk={}  precision={:.2f}  recall={:.2f}  f-score={:.2f}\\n\".format(\n                infoList[0],\n                infoList[1],\n                infoList[2],\n                scoreList[1],\n                scoreList[2],\n                scoreList[0],\n            )\n        )\n        return scoreList\n\n    #     acc = correct / total * 100.0\n    #     config.swLog.write(\n    #         \"total-tag-strings={}  correct-tag-strings={}  string-accuracy={}%\".format(\n    #             total, correct, acc\n    #         )\n    #     )\n\n    #     goldTagList = []\n    #     resTagList = []\n    #     for x in X2:\n    #         res = \"\"\n    #         for im in x._yOutput:\n    #             res += str(im) + \",\"\n    #         resTagList.append(res)\n    #         # if not dynamic:\n    #         if writer is not None:\n    #             for i in range(len(x._yOutput)):\n    #                 writer.write(str(x._yOutput[i]) + \",\")\n    #             writer.write(\"\\n\")\n    #         goldTags = x._x.getTags()\n    #         gold = \"\"\n    #         for im in goldTags:\n    #             gold += str(im) + \",\"\n    #         goldTagList.append(gold)\n    #     # if dynamic:\n    #     #     return resTagList\n    #     scoreList = []\n\n    #     if config.runMode == \"train\":\n    #         infoList = []\n    #         scoreList = getFscore(\n    #             goldTagList, resTagList, infoList, self.idx_to_chunk_tag\n    #         )\n    #         config.swLog.write(\n    #             \"#gold-chunk={}  #output-chunk={}  #correct-output-chunk={}  precision={:.2f}  recall={:.2f}  f-score={:.2f}\\n\".format(\n    #                 infoList[0],\n    #                 infoList[1],\n    #                 infoList[2],\n    #                 \"%.2f\" % scoreList[1],\n    #                 \"%.2f\" % scoreList[2],\n    #                 \"%.2f\" % scoreList[0],\n    #             )\n    #         )\n    #     return scoreList\n\n    # # def multiThreading(self, X, X2):\n    #     config = self.config\n    #     # if dynamic:\n    #     #     for i in range(len(X)):\n    #     #         X2.append(dataSeqTest(X[i], []))\n    #     #     for k, x in enumerate(X2):\n    #     #         tags = []\n    #     #         prob = self.Inf.decodeViterbi_fast(self.Model, x._x, tags)\n    #     #         X2[k]._yOutput.clear()\n    #     #         X2[k]._yOutput.extend(tags)\n    #     #     return\n\n    #     for i in range(len(X)):\n    #         X2.append(dataSeqTest(X[i], []))\n    #     if len(X) < config.nThread:\n    #         config.nThread = len(X)\n    #     interval = (len(X2) + config.nThread - 1) // config.nThread\n    #     procs = []\n    #     Q = Queue(5000)\n    #     for i in range(config.nThread):\n    #         start = i * interval\n    #         end = min(start + interval, len(X2))\n    #         proc = Process(\n    #             target=Trainer.taskRunner_test,\n    #             args=(self.Inf, self.Model, X2, start, end, Q),\n    #         )\n    #         proc.start()\n    #         procs.append(proc)\n    #     for i in range(len(X2)):\n    #         t = Q.get()\n    #         k, tags = t\n    #         X2[k]._yOutput.clear()\n    #         X2[k]._yOutput.extend(tags)\n    #     for proc in procs:\n    #         proc.join()\n\n    # @staticmethod\n    # def taskRunner_test(Inf, Model, X2, start, end, Q):\n    #     for k in range(start, end):\n    #         x = X2[k]\n    #         tags = []\n    #         prob = Inf.decodeViterbi_fast(Model, x._x, tags)\n    #         Q.put((k, tags))\n"
  },
  {
    "path": "readme/comparison.md",
    "content": "\n\n# ϸѵԽ\n\nڲͬݼϵĶԱȽ\n\n| MSRA   | Precision | Recall |   F-score |\n| :----- | --------: | -----: | --------: |\n| jieba  |     87.01 |  89.88 |     88.42 |\n| THULAC |     95.60 |  95.91 |     95.71 |\n| pkuseg |     96.94 |  96.81 | **96.88** |\n\n\n| CTB8   | Precision | Recall |   F-score |\n| :----- | --------: | -----: | --------: |\n| jieba  |     88.63 |  85.71 |     87.14 |\n| THULAC |     93.90 |  95.30 |     94.56 |\n| pkuseg |     95.99 |  95.39 | **95.69** |\n\n| WEIBO  | Precision | Recall |   F-score |\n| :----- | --------: | -----: | --------: |\n| jieba  |     87.79 |  87.54 |     87.66 |\n| THULAC |     93.40 |  92.40 |     92.87 |\n| pkuseg |     93.78 |  94.65 | **94.21** |\n\n\n\n#### Խ\n\nѡ˻CTB8ϵѵѵͬʱвԣģģڡںݡϵķִЧѡCTB8ϵԭǣCTB8ڻϣµЧãڲǷCTB8ѵģͣй߰ԶԻøߵƽЧǿԵĽ\n\n| CTB8 Training | MSRA  | CTB8  | PKU   | WEIBO | All Average | OOD Average |\n| ------------- | ----- | ----- | ----- | ----- | ----------- | ----------- |\n| jieba         | 82.75 | 87.14 | 87.12 | 85.68 | 85.67       | 85.18       |\n| THULAC        | 83.50 | 94.56 | 89.13 | 91.00 | 89.55       | 87.88       |\n| pkuseg        | 83.67 | 95.69 | 89.67 | 91.19 | 90.06       | **88.18**   |\n\nУ`All Average`ʾвԼ(CTB8Լ)F-scoreƽ`OOD Average` (Out-of-domain Average)ʾڳCTB8Լƽ\n\n\n\n#### ĬģڲͬĲЧ\n\nǵܶûڳԷִʹߵʱ򣬴ʱʹù߰ԴģͲԡΪֱӶԱȡʼܣҲȽ˸߰ĬģڲͬĲЧע⣬ıȽֻΪ˵ĬµЧһǹƽġ\n\n| Default | MSRA  | CTB8  | PKU   | WEIBO | All Average |\n| ------- | :---: | :---: | :---: | :---: | :---------: |\n| jieba  | 81.45 | 79.58 | 81.83 | 83.56 | 81.61       |\n| THULAC |\t85.55 | 87.84 | 92.29 | 86.65 | 88.08 |\n| pkuseg | 87.29 | 91.77 | 92.68 | 93.43 | **91.29**   |\n\nУ`All Average`ʾвԼF-scoreƽ\n"
  },
  {
    "path": "readme/environment.md",
    "content": "# 实验环境\n\n考虑到jieba分词和THULAC工具包等并没有提供细领域的预训练模型，为了便于比较，我们重新使用它们提供的训练接口在细领域的数据集上进行训练，用训练得到的模型进行中文分词。\n\n我们选择Linux作为测试环境，在新闻数据(MSRA)、混合型文本(CTB8)、网络文本(WEIBO)数据上对不同工具包进行了准确率测试。我们使用了第二届国际汉语分词评测比赛提供的分词评价脚本。其中MSRA与WEIBO使用标准训练集测试集划分，CTB8采用随机划分。对于不同的分词工具包，训练测试数据的划分都是一致的；**即所有的分词工具包都在相同的训练集上训练，在相同的测试集上测试**。对于所有数据集，pkuseg使用了不使用词典的训练和测试接口。以下是pkuseg训练和测试代码示例:\n\n```\npkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models')\npkuseg.test('msr_test.raw', 'output.txt', user_dict=None)\n```\n\n\n"
  },
  {
    "path": "readme/history.md",
    "content": "# 汾ʷ\n\n\n- v0.0.11(2019-01-09)\n  - ޶ĬãCTB8ΪĬģͣʹôʵ\n- v0.0.14(2019-01-23)\n  - ޸˴ʵ䴦˴ʵ䣬ִЧ\n  - **ЧʽŻٶȽ֮ǰ汾9**\n  - ڴģݼѵͨģͣΪĬʹģ\n- v0.0.15(2019-01-30)\n  - ֧fine-tuneѵԤصģͼѵ֧趨ѵ\n- v0.0.18(2019-02-20)\n  - **ִ֧Աעҽơϸģ**"
  },
  {
    "path": "readme/interface.md",
    "content": "# 代码示例\n\n以下代码示例适用于python交互式环境。\n\n代码示例1：使用默认配置进行分词（**如果用户无法确定分词领域，推荐使用默认模型分词**）\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg()           # 以默认配置加载模型\ntext = seg.cut('我爱北京天安门')  # 进行分词\nprint(text)\n```\n\n代码示例2：细领域分词（**如果用户明确分词领域，推荐使用细领域模型分词**）\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型\ntext = seg.cut('我爱北京天安门')              # 进行分词\nprint(text)\n```\n\n代码示例3：分词同时进行词性标注，各词性标签的详细含义可参考 [tags.txt](https://github.com/lancopku/pkuseg-python/blob/master/tags.txt)\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能\ntext = seg.cut('我爱北京天安门')    # 进行分词和词性标注\nprint(text)\n```\n\n\n代码示例4：对文件分词\n```python3\nimport pkuseg\n\n# 对input.txt的文件分词输出到output.txt中\n# 开20个进程\npkuseg.test('input.txt', 'output.txt', nthread=20)     \n```\n\n\n代码示例5：额外使用用户自定义词典\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(user_dict='my_dict.txt')  # 给定用户词典为当前目录下的\"my_dict.txt\"\ntext = seg.cut('我爱北京天安门')                # 进行分词\nprint(text)\n```\n\n\n代码示例6：使用自训练模型分词（以CTB8模型为例）\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(model_name='./ctb8')  # 假设用户已经下载好了ctb8的模型并放在了'./ctb8'目录下，通过设置model_name加载该模型\ntext = seg.cut('我爱北京天安门')            # 进行分词\nprint(text)\n```\n\n\n\n代码示例7：训练新模型 （模型随机初始化）\n```python3\nimport pkuseg\n\n# 训练文件为'msr_training.utf8'\n# 测试文件为'msr_test_gold.utf8'\n# 训练好的模型存到'./models'目录下\n# 训练模式下会保存最后一轮模型作为最终模型\n# 目前仅支持utf-8编码，训练集和测试集要求所有单词以单个或多个空格分开\npkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models')\t\n```\n\n\n代码示例8：fine-tune训练（从预加载的模型继续训练）\n```python3\nimport pkuseg\n\n# 训练文件为'train.txt'\n# 测试文件为'test.txt'\n# 加载'./pretrained'目录下的模型，训练好的模型保存在'./models'，训练10轮\npkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')\n"
  },
  {
    "path": "readme/multiprocess.md",
    "content": "\n# 多进程分词\n\n当将以上代码示例置于文件中运行时，如涉及多进程功能，请务必使用`if __name__ == '__main__'`保护全局语句，如：  \nmp.py文件\n```python3\nimport pkuseg\n\nif __name__ == '__main__':\n    pkuseg.test('input.txt', 'output.txt', nthread=20)\n    pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20)\t\n```\n运行\n```\npython3 mp.py\n```\n详见[无法使用多进程分词和训练功能，提示RuntimeError和BrokenPipeError](https://github.com/lancopku/pkuseg-python/wiki#3-无法使用多进程分词和训练功能提示runtimeerror和brokenpipeerror)。\n\n**在Windows平台上，请当文件足够大时再使用多进程分词功能**，详见[关于多进程速度问题](https://github.com/lancopku/pkuseg-python/wiki#9-关于多进程速度问题)。\n"
  },
  {
    "path": "readme/readme_english.md",
    "content": "\n# Pkuseg \n\nA multi-domain Chinese word segmentation toolkit.\n\n## Highlights\n\nThe pkuseg-python toolkit has the following features:\n\n1.\tSupporting multi-domain Chinese word segmentation. Pkuseg-python supports multi-domain segmentation, including domains like news, web, medicine, and tourism. Users are free to choose different pre-trained models according to the domain features of the text to be segmented. If not sure the domain of the text, users are recommended to use the default model trained on mixed-domain data.\n\n2.\tHigher word segmentation results. Compared with existing word segmentation toolkits, pkuseg-python can achieve higher F1 scores on the same dataset.\n\n3.\tSupporting model training. Pkuseg-python  also supports users to train a new segmentation model with their own data.\n\n4.\tSupporting POS tagging. We also provide users POS tagging interfaces for further lexical analysis. \n\n\n\n## Installation\n\n- Requirements: python3\n\n1. Install pkuseg-python by using PyPI: (with the default model trained on mixed-doimain data)\n\t```\n\tpip3 install pkuseg\n\t```\n   or update to the latest version (**suggested**):\n   \t```\n\tpip3 install -U pkuseg\n\t```\n2. Install pkuseg-python by using image source for fast speed:\n\t```\n\tpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg\n\t```\n   or update to the latest version (**suggested**):\n\t```\n\tpip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg\n\t```\n   Note: The previous two installing commands only support python3.5, python3.6, python3.7 on linux, mac, and **windows 64 bit**.\n3. If the code is downloaded from GitHub, please run the following command to install pkuseg-python:\n\t```\n\tpython setup.py build_ext -i\n\t```\n\t\n   Note: the github code does not contain the pre-trained models, users need to download the pre-trained models from [release](https://github.com/lancopku/pkuseg-python/releases), and set parameter 'model_name' as the model path.\n   \n   \n\t\n\n## Usage\n\n#### Examples\n\n\nExample 1:\tSegmentation under the default configuration. **If users are not sure the domain of the text to be segmented, the default configuration is recommended.**\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg() #load the default model\ntext = seg.cut('我爱北京天安门')\nprint(text)\n```\n\nExample 2: Domain-specific segmentation. **If users know the text domain, they can select a pre-trained domain model according to the domain features.**\n\n```python3\nimport pkuseg\nseg = pkuseg.pkuseg(model_name='medicine') \n#Automatically download the domain-specific model.\ntext = seg.cut('我爱北京天安门')\nprint(text)\n```\n\nExample 3：Segmentation and POS tagging. For the detailed meaning of each POS tag, please refer to [tags.txt](https://github.com/lancopku/pkuseg-python/blob/master/tags.txt).\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(postag=True)                           \ntext = seg.cut('我爱北京天安门')\nprint(text)\n```\n\n\nExample 4：Segmentation with a text file as input.\n```python3\nimport pkuseg\n\n#Take file 'input.txt' as input. \n#The segmented result is stored in file 'output.txt'.\npkuseg.test('input.txt', 'output.txt', nthread=20)     \n```\n\n\nExample 5: Segmentation with a user-defined dictionary.\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(user_dict='my_dict.txt')\ntext = seg.cut('我爱北京天安门')\nprint(text)\n```\n\n\nExample 6: Segmentation with a user-trained model. Take CTB8 as an example.\n```python3\nimport pkuseg\n\nseg = pkuseg.pkuseg(model_name='./ctb8') \ntext = seg.cut('我爱北京天安门')\nprint(text)\n```\n\n\n\nExample 7: Training a new model (randomly initialized).\n\n```python3\nimport pkuseg\n\n# Training file: 'msr_training.utf8'.\n# Test file: 'msr_test_gold.utf8'.\n# Save the trained model to './models'.\n# The training and test files are in utf-8 encoding.\npkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models')\t\n```\n\nExample 8: Fine-tuning. Take a pre-trained model as input.\n```python3\nimport pkuseg\n\n# Training file: 'train.txt'.\n# Testing file'test.txt'.\n# The path of the pre-trained model: './pretrained'.\n# Save the trained model to './models'.\n# The training and test files are in utf-8 encoding.\npkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')\n```\n\n\n\n#### Parameter Settings\n\nSegmentation for sentences.\n```\npkuseg.pkuseg(model_name = \"default\", user_dict = \"default\", postag = False)\n\tmodel_name\t\tThe path of the used model.\n\t\t\t        \"default\". The default mixed-domain model.\n\t\t\t\t\"news\". The model trained on news domain data.\n\t\t\t\t\"web\". The model trained on web domain data.\n\t\t\t\t\"medicine\". The model trained on medicine domain data.\n\t\t\t\t\"tourism\". The model trained on tourism domain data.\n\t\t\t        model_path. Load a model from the user-specified path.\n\tuser_dict\t\tSet up the user dictionary.\n\t\t\t\t\"default\". Use the default dictionary.\n\t\t\t\tNone. No dictionary is used.\n\t\t\t\tdict_path. The path of the user-defined dictionary. Each line only contains one word.\n\tpostag\t\t        POS tagging or not.\n\t\t\t\tFalse. The default setting. Segmentation without POS tagging.\n\t\t\t\tTrue. Segmentation with POS tagging.\n```\n\nSegmentation for documents.\n\n```\npkuseg.test(readFile, outputFile, model_name = \"default\", user_dict = \"default\", postag = False, nthread = 10)\n\treadFile\t\tThe path of the input file.\n\toutputFile\t\tThe path of the output file.\n\tmodel_name\t\tThe path of the used model. Refer to pkuseg.pkuseg.\n\tuser_dict\t\tThe path of the user dictionary. Refer to pkuseg.pkuseg.\n\tpostag\t\t\tPOS tagging or not. Refer to pkuseg.pkuseg.\n\tnthread\t\t\tThe number of threads.\n```\n\n Model training.\n```\npkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)\n\ttrainFile\t\tThe path of the training file.\n\ttestFile\t\tThe path of the test file.\n\tsavedir\t\t\tThe saved path of the trained model.\n\ttrain_iter\t\tThe maximum number of training epochs.\n\tinit_model\t\tBy default, None means random initialization. Users can also load a pre-trained model as initialization, like init_model='./models/'.\n```\n\n\n## Publication\n\nThe toolkit is mainly based on the following publication. If you use the toolkit, please cite the paper:\n* Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, Xu Sun. [PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation](https://arxiv.org/abs/1906.11455). Arxiv. 2019.\n\n```\n\n@article{pkuseg,\n  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},\n  journal = {CoRR},\n  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},\n  url = {https://arxiv.org/abs/1906.11455},\n  volume = {abs/1906.11455},\n  year = 2019\n}\n```\n\n## Related Work\n\n* Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.\n* Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.\n* Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.\n\n\n## Authors\n\nRuixuan Luo, Jingjing Xu, Xuancheng Ren, Yi Zhang, Zhiyuan Zhang, Bingzhen Wei, Xu Sun  \n\n[Language Computing and Machine Learning Group](http://lanco.pku.edu.cn/), Peking University\n\n\n"
  },
  {
    "path": "setup.py",
    "content": "import setuptools\nimport os\nfrom distutils.extension import Extension\n\nimport numpy as np\n\ndef is_source_release(path):\n    return os.path.exists(os.path.join(path, \"PKG-INFO\"))\n\ndef setup_package():\n    root = os.path.abspath(os.path.dirname(__file__))\n\n    long_description = \"pkuseg-python\"\n\n    extensions = [\n        Extension(\n            \"pkuseg.inference\",\n            [\"pkuseg/inference.pyx\"],\n            include_dirs=[np.get_include()],\n            language=\"c++\"\n        ),\n        Extension(\n            \"pkuseg.feature_extractor\",\n            [\"pkuseg/feature_extractor.pyx\"],\n            include_dirs=[np.get_include()],\n        ),\n        Extension(\n            \"pkuseg.postag.feature_extractor\",\n            [\"pkuseg/postag/feature_extractor.pyx\"],\n            include_dirs=[np.get_include()],\n        ),\n    ]\n    \n    if not is_source_release(root):\n        from Cython.Build import cythonize\n        extensions = cythonize(extensions, annotate=True)\n\n\n    setuptools.setup(\n        name=\"pkuseg\",\n        version=\"0.0.25\",\n        author=\"Lanco\",\n        author_email=\"luoruixuan97@pku.edu.cn\",\n        description=\"A small package for Chinese word segmentation\",\n        long_description=long_description,\n        long_description_content_type=\"text/markdown\",\n        url=\"https://github.com/lancopku/pkuseg-python\",\n        packages=setuptools.find_packages(),\n        package_data={\"\": [\"*.txt*\", \"*.pkl\", \"*.npz\", \"*.pyx\", \"*.pxd\"]},\n        classifiers=[\n            \"Programming Language :: Python :: 3\",\n            \"License :: Other/Proprietary License\",\n            \"Operating System :: OS Independent\",\n        ],\n        install_requires=[\"cython\", \"numpy>=1.16.0\"],\n        setup_requires=[\"cython\", \"numpy>=1.16.0\"],\n        ext_modules=extensions,\n        zip_safe=False,\n    )\n\n\nif __name__ == \"__main__\":\n    setup_package()\n"
  }
]