Full Code of bojone/bytepiece for AI

main 2e72f3ccd8c6 cached

9 files

46.3 KB

13.4k tokens

31 symbols

1 requests

Download .txt

Repository: bojone/bytepiece
Branch: main
Commit: 2e72f3ccd8c6
Files: 9
Total size: 46.3 KB

Directory structure:
gitextract_wh3iysyy/

├── LICENSE
├── MANIFEST.in
├── README.md
├── README_en.md
├── bytepiece/
│   ├── __init__.py
│   ├── bytepiece.py
│   └── faster.pyx
├── models/
│   └── README.md
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MANIFEST.in
================================================
include README_en.md


================================================
FILE: README.md
================================================
[中文|[English](https://github.com/bojone/bytepiece/blob/main/README_en.md)]

# BytePiece
BytePiece是一个Byte-based的Unigram分词器，纯Python实现，更加易读和易拓展。由于采用了新的训练算法，所以压缩率通常比现有Tokenizer更高，同时支持多进程加速训练。此外，它直接操作文本的UTF-8 Bytes，几乎不进行任何的预处理，所以更加纯粹和语言无关。

博客：
- https://kexue.fm/archives/9752
- https://kexue.fm/archives/9768

## 性质

理想的Tokenizer及其训练算法，应该具备以下特点：
- 无损重构
- 高压缩率
- 语言无关
- 数据驱动
- 训练友好

目前主流的[SentencePiece](https://github.com/google/sentencepiece)，事实上已经基本具备以上特性，但仍存在一些问题。比如：它支持BPE和Unigram两种算法，BPE压缩率往往更高一些，但训练极慢，而且非常耗内存；它还是对文本进行了少许语言相关的预处理的，所以“语言无关”这一点也不够纯粹。此外，它是用C++写的，对于多数用户来说就是黑箱，因此也不利于研究和修改。

BytePiece构思了新的基于 **Byte-based N-gram Language Model（BNLM）** 的训练方式，能获得更高压缩率的词表，同时支持多进程训练，同等语料下相比SentencePiece的BPE训练有明显的加速。代码是纯Python，方便大家阅读和二次修改。此外，BytePiece还提供了比[Subword Regularization](https://arxiv.org/abs/1804.10959)更加高效的随机分词算法。

## 原理

BytePiece并非单纯基于Byte-based和多进程来重写已有的Unigram模型，而是为Unigram设计了新的训练方案，这是它压缩率更高的关键原因之一。

新的训练方案基于N-gram语言模型的新词发现算法，首次提出于笔者7年前的博客[《【中文分词系列】 5. 基于语言模型的无监督分词》](https://kexue.fm/archives/3956)，细节请移步阅读。

至于新的随机分词算法，则可以参考[《随机分词浅探：从Viterbi Decoding到Viterbi Sampling》](https://kexue.fm/archives/9768)和[《随机分词再探：从Viterbi Sampling到完美采样算法》](https://kexue.fm/archives/9811)。

## 安装

BytePiece只能运行在Python3上，使用了[pyahocorasick](https://github.com/WojciechMula/pyahocorasick)来加速训练过程。由于BytePiece是Byte-based的，而PyPi上的pyahocorasick是Unicode-based的，所以不能直接用，需要参考如下方式安装Byte-based版的pyahocorasick：
```bash
# 如果已经安装，请先卸载
pip uninstall pyahocorasick

# 直接从git编译安装，注意要传入环境变量AHOCORASICK_BYTES
AHOCORASICK_BYTES=1 pip install git+https://github.com/WojciechMula/pyahocorasick.git
```
然后安装Cython：
```bash
pip install Cython
```
安装完之后，就可以用pip安装BytePiece了：
```bash
pip install bytepiece==0.6.3
```

## 使用

BytePiece的所有源码其实也就是单文件，包含`Trainer`和`Tokenizer`两个类，分别对应训练和分词。

### 训练

训练Tokenizer只需要引入`Trainer`类：
```python
from bytepiece import Trainer
```
然后准备训练语料。BytePiece支持不一次性将所有语料读进内存中，但由于BytePiece训练需要过两遍数据，所以不支持Generator输入，而是要写成Iterator的形式，例如：
```python
import json

class corpus:
    def __iter__(self):
        f = 'data_sample.json'
        with open(f) as f:
            for l in f:
                yield json.loads(l)['text']  # 每次返回一个Unicode
```
然后就可以正式训练了：
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32)
trainer.train(corpus(), workers=64, batch_size=1000)
trainer.save('bytepiece.model')
```
这里的`order`就是n-gram语言模型的阶，推荐默认`order=6`就好；`max_vocab_size`是词表最大尺寸，注意由于去冗的原因，最后得到的词表不一定精确等于max_vocab_size，而是有可能会略少于；`min_count`则是token最低出现频数，数据量大时可以适当调大，一般不会明显影响训练结果；`workers`是并行训练的进程数，可以跑满机器的所有核心；`batch_size`是批大小，不会影响训练结果，一般情况下不用改，如果发现CPU利用率不满可以适当调大。

此外，`0.4.1`版本开始新增`isolate_digits`参数，默认为`False`，当改为`True`时，保证将所有阿拉伯数字都切分为单个字符：
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32, isolate_digits=True)
```
`0.6.0`版本开始新增`ensure_unicode`参数，能够保证所有的多字节token都可以还原为unicode，由于目前结果显示启用`ensure_unicode`后，训练得到的模型压缩率通常还高一些，所以默认为`True`，当改为`False`时（等价于0.6.0之前的版本），多字节token可能需要`decode(errors='ignore')`才能还原为unicode：
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32, ensure_unicode=True)
```

### 分词

训练完成后，参考使用方式：
```python
from bytepiece import Tokenizer

tokenizer = Tokenizer('bytepiece.model')
text = '今天天气不错'

tokens = tokenizer.tokenize(text)  # 返回bytes的list
print(b' '.join(tokens).decode(errors='ignore'))  # 可视化分词结果

ids = tokenizer.encode(text)  # 返回tokens对应的ids
print(tokenizer.decode(ids))  # 重新将ids解码为unicode文本
ids = tokenizer.encode(text, iter=True)  # 返回ids的generator

tokens = tokenizer.tokenize(text, alpha=0.2)  # 随机tokenize
print(b' '.join(tokens).decode(errors='ignore'))  # 可视化分词结果
```

## 对比

小数据量对比：

|  | 训练时间↓ | 最大内存占用↓ | 压缩率↑ | 分词速度↑ |
| :----: | :----: | :----: | :----: | :----: |
| SP-BPE | 55.3分钟 | 5.2GB | 4.80 | 5.47 |
| SP-Unigram | 1.6分钟 | 2.5GB | 4.73 | 7.84 |
| BytePiece | 6.5分钟 | 4.3GB | 5.05 | 2.50 |

大数据量对比：

|  | 训练时间↓ | 最大内存占用↓ | 压缩率(同源)↑ | 压缩率(异源)↑ | 分词速度↑ |
| :----: | :----: | :----: | :----: | :----: | :----: |
| SP-BPE | 19.21小时 | 97GB | 4.52 | 4.46 | 1.27 |
| SP-Unigram | 2.02小时 | 384GB | 4.51 | 4.48 | 5.55 |
| BytePiece | 2.24小时 | 51GB | 5.39 | 4.51 | 1.92 |

压缩率的单位是“bytes/token”，即平均每个token对应的字节数；速度的单位是“M bytes/second”，即平均每秒可以切分的字节数（以百万为单位）。其他细节请参考[这里](https://kexue.fm/archives/9752#%E6%95%88%E6%9E%9C%E6%B5%8B%E8%AF%95)。

第一个表格的数据集平均长度较短，BytePiece同时慢于SP-BPE和SP-Unigram；在第二个表格中，语料的平均长度普遍更长，出现了BytePiece的速度优于SP-BPE的结果。这说明BPE的分词速度受长度影响比较明显，也说明经过Cython加速的BytePiece分词速度，速度上已经可以跟SentencePiece相比较。

## 下载

下载开源的BytePiece模型请移步到[models](https://github.com/bojone/bytepiece/tree/main/models)。

## 转换

`0.6.2`版开始引入`convert_to_sentencepiece`方法，支持将`ensure_unicode`版模型转为sentencepiece模型，并用sentencepiece加载：
```python
from bytepiece import Tokenizer
tokenizer1 = Tokenizer('bytepiece.model')
tokenizer1.convert_to_sentencepiece('bytepiece_sp.model')

import sentencepiece as spm
tokenizer2 = spm.SentencePieceProcessor('bytepiece_sp.model')

tokenizer1.encode('今天天气不错')
tokenizer2.encode('今天天气不错')
```
对于大部分输入，两个版本的模型都能够获得相同的分词结果和相同的编码ids。但无论如何，bytepiece和sentencepiece的处理逻辑不完全一样，bytepiece更加纯粹一些，而sentencepiece加了很多莫须有的预处理操作，这导致两个版本的模型无法完全对齐。目前已知的问题之一是，当输入包含多个连续换行符(\n)时，分词结果可能会有分歧。

## 引用

```
@misc{bytepiece2023,
  title={BytePiece: A more pure and effective tokenizer},
  author={Jianlin Su},
  year={2023},
  howpublished={\url{https://github.com/bojone/bytepiece}},
}
```

## 交流
QQ群：67729435，微信群请加机器人spaces_ac_cn



================================================
FILE: README_en.md
================================================
[[中文](https://github.com/bojone/bytepiece/blob/main/README.md)|English]

# BytePiece
BytePiece is a Byte-based Unigram tokenizer, implemented purely in Python, making it more readable and expandable. Due to the use of a new training algorithm, its compression rate is often higher than existing Tokenizers, and it also supports multiprocessing acceleration for training. Moreover, it directly operates on the UTF-8 Bytes of the text, with almost no preprocessing, making it more pure and language-independent.

Blog: 
- https://kexue.fm/archives/9752
- https://kexue.fm/archives/9768

## Characteristics

An ideal Tokenizer and its training algorithm should have the following characteristics:
- Lossless reconstruction
- High compression rate
- Language-independent
- Data-driven
- Training-friendly

The mainstream [SentencePiece](https://github.com/google/sentencepiece) basically has the above characteristics, but there are still some problems. For example, it supports both BPE and Unigram algorithms. BPE often has a higher compression rate, but the training is extremely slow and consumes a lot of memory. Moreover, it does conduct some language-related preprocessing on the text, so it's not purely "language-independent". Besides, it's written in C++, which is a black box for most users, thus not conducive to research and modification.

BytePiece has conceived a new training method based on **Byte-based N-gram Language Model (BNLM)** , which can obtain a higher compression rate vocabulary table, support multiprocessing training, and significantly accelerate compared to SentencePiece's BPE training under the same corpus. The code is pure Python, easy for everyone to read and modify. In addition, BytePiece also provides a more efficient random segmentation algorithm than [Subword Regularization](https://arxiv.org/abs/1804.10959).

## Principle

BytePiece is not simply a rewrite of the existing Unigram model based on Byte-based and multiprocessor, but a new training method designed for Unigram, which is one of the key reasons for its higher compression rate.

The new training method is based on the new word discovery algorithm of the N-gram language model, first proposed in the author's blog 7 years ago [《【Chinese Word Segmentation Series】 5. Unsupervised Word Segmentation Based on Language Model》](https://kexue.fm/archives/3956). Please visit the blog for details.

For the new random segmentation algorithm, you can refer to ["A Brief Exploration of Random Segmentation: From Viterbi Decoding to Viterbi Sampling"](https://kexue.fm/archives/9768) and ["Further Exploration of Random Segmentation: From Viterbi Sampling to Perfect Sampling Algorithm"](https://kexue.fm/archives/9811).

## Installation

BytePiece can only run on Python3 and uses [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) to accelerate the training process. Since BytePiece is Byte-based, and the pyahocorasick on PyPi is Unicode-based, it cannot be used directly. Please follow the instructions below to install the Byte-based version of pyahocorasick:
```bash
# If already installed, please uninstall first
pip uninstall pyahocorasick

# Compile and install directly from git, note to pass the environment variable AHOCORASICK_BYTES
AHOCORASICK_BYTES=1 pip install git+https://github.com/WojciechMula/pyahocorasick.git
```
Then install Cython:
```bash
pip install Cython
```
After that, you can install BytePiece via pip:
```bash
pip install bytepiece==0.6.3
```

## Usage

All source code of BytePiece is actually in a single file, including `Trainer` and `Tokenizer` two classes, corresponding to training and tokenization respectively.

### Training

To train Tokenizer, you just need to import the `Trainer` class:
```python
from bytepiece import Trainer
```
Then prepare the training corpus. BytePiece supports not reading all corpora into memory at once, but since BytePiece training needs to go through the data twice, it does not support Generator input, but needs to be written in the form of Iterator, for example:
```python
import json

class corpus:
    def __iter__(self):
        f = 'data_sample.json'
        with open(f) as f:
            for l in f:
                yield json.loads(l)['text']  # Return a Unicode each time
```
Then you can start the actual training:
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32)
trainer.train(corpus(), workers=64, batch_size=1000)
trainer.save('bytepiece.model')
```
Here, `order` is the order of the n-gram language model, it is recommended to keep the default `order=6`; `max_vocab_size` is the maximum size of the vocabulary, note that due to redundancy removal, the final vocabulary may not precisely equal max_vocab_size, it might be slightly less; `min_count` is the minimum occurrence frequency of tokens, when the data volume is large, it can be appropriately increased, it generally doesn't significantly affect the training results; `workers` is the number of parallel training processes, which can utilize all cores of the machine; `batch_size` is the batch size, it won't affect the training results, it usually doesn't need to be changed, if you find the CPU utilization is not full, you can appropriately increase it.

In addition, starting from version `0.4.1`, a new parameter `isolate_digits` is added, which defaults to `False`. When set to `True`, it ensures that all Arabic numbers are split into individual characters:
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32, isolate_digits=True)
```
Starting from version `0.6.0`, a new parameter `ensure_unicode` is added, which can ensure that all multi-byte tokens can be restored to unicode. Since the current results show that enabling `ensure_unicode` often results in a higher compression rate for the trained model, it is set to `True` by default. When set to `False` (equivalent to versions before 0.6.0), multi-byte tokens may need `decode(errors='ignore')` to be restored to unicode:
```python
trainer = Trainer(order=6, max_vocab_size=100000, min_count=32, ensure_unicode=True)
```

### Tokenization

After the training is completed, refer to the following usage:
```python
from bytepiece import Tokenizer

tokenizer = Tokenizer('bytepiece.model')
text = 'Today's weather is great'

tokens = tokenizer.tokenize(text)  # Returns a list of bytes
print(b' '.join(tokens).decode(errors='ignore'))  # Visualize the tokenization result

ids = tokenizer.encode(text)  # Returns the ids corresponding to tokens
print(tokenizer.decode(ids))  # Decode the ids back to unicode text
ids = tokenizer.encode(text, iter=True)  # Returns the generator of ids

tokens = tokenizer.tokenize(text, alpha=0.2)  # Random Tokenization
print(b' '.join(tokens).decode(errors='ignore'))  # Visualize the tokenization result
```

## Comparison

Comparison with small data volume:

|  | Training Time↓ | Maximum Memory Usage↓ | Compression Rate↑ | Tokenization Speed↑ |
| :----: | :----: | :----: | :----: | :----: |
| SP-BPE | 55.3 minutes | 5.2GB | 4.80 | 5.47 |
| SP-Unigram | 1.6 minutes | 2.5GB | 4.73 | 7.84 |
| BytePiece | 6.5 minutes | 4.3GB | 5.05 | 2.50 |

Comparison with large data volume:

|  | Training Time↓ | Maximum Memory Usage↓ | Compression Rate (Homologous)↑ | Compression Rate (Heterologous)↑ | Tokenization Speed↑ |
| :----: | :----: | :----: | :----: | :----: | :----: |
| SP-BPE | 19.21 hours | 97GB | 4.52 | 4.46 | 1.27 |
| SP-Unigram | 2.02 hours | 384GB | 4.51 | 4.48 | 5.55 |
| BytePiece | 2.24 hours | 51GB | 5.39 | 4.51 | 1.92 |

The unit of compression rate is "bytes/token", i.e., the average number of bytes per token; the unit of speed is "M bytes/second", i.e., the average number of bytes that can be segmented per second (in millions). For other details, please refer to [here](https://kexue.fm/archives/9752#%E6%95%88%E6%9E%9C%E6%B5%8B%E8%AF%95).

In the first table, the dataset has a shorter average length, BytePiece is slower than both SP-BPE and SP-Unigram; in the second table, the average length of the corpus is generally longer, resulting in BytePiece being faster than SP-BPE. This indicates that BPE's tokenization speed is significantly affected by length, and also indicates that BytePiece's tokenization speed, accelerated by Cython, can be compared with SentencePiece in terms of speed.

## Download

To download the open-source BytePiece model, please go to [models](https://github.com/bojone/bytepiece/tree/main/models).

## Citation

```
@misc{bytepiece2023,
  title={BytePiece: A more pure and effective tokenizer},
  author={Jianlin Su},
  year={2023},
  howpublished={\url{https://github.com/bojone/bytepiece}},
}
```

## Communication
QQ Group: 67729435, for WeChat group please add robot spaces_ac_cn


================================================
FILE: bytepiece/__init__.py
================================================
#! -*- coding: utf-8 -*-

from .bytepiece import *

__version__ = '0.6.3'


================================================
FILE: bytepiece/bytepiece.py
================================================
# -*- coding: utf-8 -*-
# Reference 1: https://kexue.fm/archives/9752
# Reference 2: https://kexue.fm/archives/9768

import numpy as np
import re, json, unicodedata
from itertools import chain
from functools import partial
from tqdm import tqdm, trange
from base64 import b64encode, b64decode
from multiprocessing import Pool, Queue
import ahocorasick
from . import faster


def normalize(text, maxlen=0, isolate_digits=False):
    text = unicodedata.normalize('NFC', text)
    if maxlen > 0:
        if isolate_digits:
            regex = '\d|[^\n\d]{,%d}\n{1,100}|[^\n\d]{1,%d}' % (maxlen, maxlen)
        else:
            regex = '.{,%d}\n{1,100}|.{1,%d}' % (maxlen, maxlen)
    else:
        if isolate_digits:
            regex = '\d|[^\n\d]*\n+|[^\n\d]+'
        else:
            regex = '.*\n+|.+'
    return [t.encode() for t in re.findall(regex, text)]


class Trainer:
    """A novel unsupervised training algorithm for Unigram
    Reference: https://kexue.fm/archives/3956
    """
    def __init__(
        self,
        order=6,
        max_vocab_size=10000,
        max_piece_length=36,
        min_count=2,
        isolate_digits=False,
        ensure_unicode=True
    ):
        self.order = order
        self.max_piece_length = max_piece_length
        self.min_count = min_count
        self.isolate_digits = isolate_digits
        self.ensure_unicode = ensure_unicode
        if isinstance(max_vocab_size, list):
            self.max_vocab_size = sorted(max_vocab_size)[::-1]
        else:
            self.max_vocab_size = [max_vocab_size]

    def count_ngrams(self, texts):
        ngrams = [{} for i in range(self.order + 1)]
        for text in texts:
            for i in range(len(text)):
                for j in range(self.order + 1):
                    k = text[i:i + j]
                    ngrams[j][k] = ngrams[j].get(k, 0) + 1
        return ngrams

    def prune_ngrams(self, ngrams):
        for i in range(256):
            if bytes([i]) not in ngrams[1]:
                ngrams[1][bytes([i])] = 1
                ngrams[0][b''] += 1
        for i in trange(len(ngrams) - 1, -1, -1, desc='Prune Ngrams', ncols=0):
            ngrams[i] = {
                k: np.log(v)
                for k, v in ngrams[i].items()
                if len(k) == i and v >= (self.min_count if i > 1 else 0)
            }
            if i < len(ngrams) - 1:
                ngrams[i + 1] = {
                    k: v - ngrams[i][k[:i]]
                    for k, v in ngrams[i + 1].items()
                }
        return ngrams

    @property
    def trans(self):
        if not hasattr(self, '_trans'):
            self._trans = np.full((self.order, self.order), -np.inf)
            for i in range(self.order):
                self._trans[i, 0] = 0
                self._trans[i, min(i + 1, self.order - 1)] = 0
        return self._trans

    def _tokenize(self, text):
        # Nodes
        nodes = np.full((len(text), self.order), -np.inf)
        for j in range(self.order):
            for i in range(j, len(text)):
                nodes[i, j] = self.ngrams[j + 1].get(text[i - j:i + 1], -np.inf)
        if self.ensure_unicode:
            text_array = np.frombuffer(text, dtype=np.uint8)
            nodes[(text_array >= 128) & (text_array < 192), 0] -= np.inf
        # Viterbi
        routes = np.zeros((len(text) - 1, self.order), dtype='int32')
        for i in range(1, len(nodes)):
            scores = nodes[i - 1][:, None] + self.trans + nodes[i]
            routes[i - 1] = scores.argmax(0)
            nodes[i] = scores.max(0)
        # Output
        opt_route = [nodes[-1].argmax()]
        for i in range(1, len(nodes)):
            opt_route.append(routes[-i][opt_route[-1]])
        opt_route = np.array(opt_route[::-1])
        opt_route = np.append(np.where(opt_route == 0)[0], len(nodes))
        return [text[s:e] for s, e in zip(opt_route, opt_route[1:])]

    def count_pieces(self, texts):
        pieces = {}
        for text in texts:
            for p in self._tokenize(text):
                pieces[p] = pieces.get(p, 0) + 1
        return pieces

    def split_pieces(self, keep, drop):
        tokenizer, counter = Tokenizer(self.dump(keep)), {}
        for k, v in drop:
            for p in tokenizer._tokenize(k):
                counter[p] = counter.get(p, 0) + v
        return counter

    def prune_pieces(self, pieces, workers=1, batch_size=1000):
        desc = 'Prune Pieces'
        split_pieces = partial(
            self.psplit_pieces, workers=workers, batch_size=batch_size
        ) if workers > 1 else self.split_pieces
        # Complete all bytes
        for i in range(256):
            if bytes([i]) not in pieces:
                pieces[bytes([i])] = 1
        # Prune by frequency and length
        keep_pieces, drop_pieces = {}, {}
        for k, v in pieces.items():
            if len(k) == 1 or (
                len(k) <= self.max_piece_length and v >= self.min_count
            ):
                keep_pieces[k] = v
            else:
                drop_pieces[k] = v
        drop_pieces = tqdm(drop_pieces.items(), desc=desc, ncols=0)
        for k, v in split_pieces(keep_pieces, drop_pieces).items():
            keep_pieces[k] += v
        # Prune wasted pieces
        while True:
            len_keep_pieces = len(keep_pieces)
            drop_pieces = tqdm(keep_pieces.items(), desc=desc, ncols=0)
            keep_pieces = split_pieces(keep_pieces, drop_pieces)
            if len_keep_pieces == len(keep_pieces):
                break
        # Prune by max_vocab_size
        final_pieces = []
        for max_vocab_size in self.max_vocab_size:
            if len(keep_pieces) <= max_vocab_size - 3:
                final_pieces.append(keep_pieces)
                continue
            pieces = sorted(
                keep_pieces.items(),
                key=lambda t: (len(t[0]) > 1, -t[1], -len(t[0]), t[0])
            )
            keep_pieces = dict(pieces[:max_vocab_size - 3])
            drop_pieces = tqdm(pieces[max_vocab_size - 3:], desc=desc, ncols=0)
            for k, v in split_pieces(keep_pieces, drop_pieces).items():
                keep_pieces[k] += v
            # Prune wasted pieces
            while True:
                len_keep_pieces = len(keep_pieces)
                drop_pieces = tqdm(keep_pieces.items(), desc=desc, ncols=0)
                keep_pieces = split_pieces(keep_pieces, drop_pieces)
                if len_keep_pieces == len(keep_pieces):
                    break
            final_pieces.append(keep_pieces)
        # Output
        return final_pieces

    def norm(self, texts):
        for text in texts:
            for t in normalize(text, 10000, self.isolate_digits):
                yield t

    def train(self, texts, workers=1, batch_size=1000):
        if workers > 1:
            texts1 = self.norm(tqdm(texts, desc='Count Ngrams'))
            self.ngrams = self.pcount_ngrams(texts1, workers, batch_size)
            self.ngrams = self.prune_ngrams(self.ngrams)
            texts2 = self.norm(tqdm(texts, desc='Count Pieces'))
            self.pieces = self.pcount_pieces(texts2, workers, batch_size)
            self.pieces = self.prune_pieces(self.pieces, workers, batch_size)
        else:
            texts1 = self.norm(tqdm(texts, desc='Count Ngrams'))
            self.ngrams = self.count_ngrams(texts1)
            self.ngrams = self.prune_ngrams(self.ngrams)
            texts2 = self.norm(tqdm(texts, desc='Count Pieces'))
            self.pieces = self.count_pieces(texts2)
            self.pieces = self.prune_pieces(self.pieces)

    def dump(self, pieces):
        pieces = sorted(pieces.items(), key=lambda t: (len(t[0]), t[0]))
        return {
            b64encode(k).decode(): [i + 3, k.decode(errors='ignore'), v]
            for i, (k, v) in enumerate(pieces)
        }

    def save(self, path):
        if len(self.pieces) == 1:
            paths = [path]
        else:
            paths = ['%s.%s' % (path, size) for size in self.max_vocab_size]
        for pieces, path in zip(self.pieces, paths):
            json.dump(
                self.dump(pieces),
                open(path, 'w'),
                indent=4,
                ensure_ascii=False
            )

    def pcount(self, inputs, count, merge, init, desc, workers, batch_size):
        def worker_func(in_queue, out_queue):
            counter = init()
            while True:
                inputs = in_queue.get()
                if inputs is None:
                    break
                merge(counter, count(inputs))
            out_queue.put(counter)

        # Count
        in_queue, out_queue = Queue(workers + 1), Queue()
        pool = Pool(workers, worker_func, (in_queue, out_queue))
        batch = []
        for input in inputs:
            batch.append(input)
            if len(batch) == batch_size:
                in_queue.put(batch)
                batch = []
        if batch:
            in_queue.put(batch)
        for i in range(workers):
            in_queue.put(None)
        # Merge
        counter = init()
        for _ in trange(workers, desc=desc, ncols=0):
            merge(counter, out_queue.get())
        pool.terminate()
        return counter

    def pcount_ngrams(self, texts, workers=1, batch_size=1000):
        def merge(ngrams1, ngrams2):
            for i, G in enumerate(ngrams2):
                for k, v in G.items():
                    ngrams1[i][k] = ngrams1[i].get(k, 0) + v

        init = lambda: [{} for i in range(self.order + 1)]
        return self.pcount(
            texts, self.count_ngrams, merge, init, 'Merge Ngrams', workers,
            batch_size
        )

    def psplit_pieces(self, keep, drop, workers=1, batch_size=1000):
        def merge(pieces1, pieces2):
            for k, v in pieces2.items():
                pieces1[k] = pieces1.get(k, 0) + v

        split_pieces = lambda drop: self.split_pieces(keep, drop)
        return self.pcount(
            drop, split_pieces, merge, dict, 'Merge Pieces', workers,
            batch_size * 10
        )

    def pcount_pieces(self, texts, workers=1, batch_size=1000):
        def merge(pieces1, pieces2):
            for k, v in pieces2.items():
                pieces1[k] = pieces1.get(k, 0) + v

        return self.pcount(
            texts, self.count_pieces, merge, dict, 'Merge Pieces', workers,
            batch_size // 10
        )


class Tokenizer:
    """Unigram tokenizer with Aho-Corasick automaton
    """
    def __init__(self, pieces, seed=None):
        if isinstance(pieces, str):
            pieces = json.load(open(pieces))
        pieces = {b64decode(k): v for k, v in pieces.items()}
        self._pieces = {k: v[-1] for k, v in pieces.items()}
        self._piece2id = {k: v[0] for k, v in pieces.items()}
        for i, k in enumerate(['<pad>', '<bos>', '<eos>']):
            self._piece2id[k] = i
        self._id2piece = {v: k for k, v in self._piece2id.items()}
        self.vocab_size = len(self._pieces) + 3
        # Aho-Corasick automaton
        log_total = np.log(sum(self._pieces.values()))
        self._automaton = ahocorasick.Automaton()
        for k, v in self._pieces.items():
            self._automaton.add_word(k, (len(k), np.log(v) - log_total))
        self._automaton.make_automaton()
        self.set_seed(seed)

    def set_seed(self, seed):
        if seed is not None:
            faster.set_seed(seed)

    def _tokenize(self, text, alpha=-1):
        return faster._tokenize(self, text, alpha)

    def tokenize(self, text, alpha=-1, iter=False):
        pieces = chain(*(self._tokenize(t, alpha) for t in normalize(text)))
        if iter:
            return pieces
        return list(pieces)

    def piece_to_id(self, p):
        return self._piece2id[p]

    def id_to_piece(self, i):
        return self._id2piece[i]

    def pieces_to_ids(self, pieces):
        return [self._piece2id[p] for p in pieces]

    def ids_to_pieces(self, ids):
        return [self._id2piece[i] for i in ids]

    def encode(self, text, add_bos=False, add_eos=False, alpha=-1, iter=False):
        def generator():
            if add_bos:
                yield 1
            for p in self.tokenize(text, alpha, True):
                yield self._piece2id[p]
            if add_eos:
                yield 2

        if iter:
            return generator()
        return list(generator())

    def decode(self, ids):
        pieces = [self._id2piece[i] for i in ids if i > 2]
        return b''.join(pieces).decode(errors='ignore')

    def convert_to_sentencepiece(self, path):
        from sentencepiece.sentencepiece_model_pb2 import TrainerSpec, NormalizerSpec, ModelProto
        SentencePiece = ModelProto.SentencePiece

        pieces, others = [
            SentencePiece(piece='<unk>', score=0, type=2),
            SentencePiece(piece='<s>', score=0, type=3),
            SentencePiece(piece='</s>', score=0, type=3)
        ], []
        for i in range(3, self.vocab_size):
            p = self._id2piece[i]
            s = self._automaton.get(p)[1]
            if len(p) > 1 or len(str(p)) == 4:
                if len(p) == 1:
                    p2 = '<0x{:02X}>'.format(ord(p))
                    others.append(SentencePiece(piece=p2, score=-100, type=6))
                p = re.sub(' ', '▁', p.decode())
                pieces.append(SentencePiece(piece=p, score=s))
            else:
                p = '<0x{:02X}>'.format(ord(p))
                pieces.append(SentencePiece(piece=p, score=s, type=6))

        trainer_spec = TrainerSpec(
            model_type=1,  # Unigram
            vocab_size=len(pieces + others),
            split_by_unicode_script=False,
            byte_fallback=True
        )
        normalizer_spec = NormalizerSpec(
            name='identity',
            precompiled_charsmap=b'',
            add_dummy_prefix=False,
            remove_extra_whitespaces=False
        )
        model = ModelProto(
            pieces=pieces + others,
            trainer_spec=trainer_spec,
            normalizer_spec=normalizer_spec
        )
        with open(path, 'wb') as fw:
            fw.write(model.SerializeToString())


def convert_to_bytepiece(pieces, path):
    pieces = {
        k if isinstance(k, bytes) else k.encode(): v
        for k, v in pieces.items()
    }
    trainer = Trainer()
    trainer.max_vocab_size = [len(pieces) + 259]
    trainer.max_piece_length = np.inf
    trainer.min_count = 1
    trainer.pieces = trainer.prune_pieces(pieces)
    trainer.save(path)


================================================
FILE: bytepiece/faster.pyx
================================================
# cython: language_level=3
from libc.time cimport time
from libc.stdlib cimport RAND_MAX, rand, srand
from libc.math cimport INFINITY, exp, log

srand(time(NULL))


cpdef set_seed(unsigned int seed):
    srand(seed)


cdef inline double logsumexp(double x, double y):
    if x < y:
        x, y = y, x
    return x + log(1 + exp(y - x))


cdef inline bint choice(double x, double y):
    return rand() < exp(x - y) * RAND_MAX


def _tokenize(self, bytes text, double alpha=-1):
    cdef int e, k, s
    cdef double v, score
    cdef list scores = [0] + [-INFINITY] * len(text)
    cdef list routes = list(range(len(text) + 1))
    cdef list tokens = []
    for e, (k, v) in self._automaton.iter(text):
        s, e = e - k + 1, e + 1
        if alpha < 0:
            score = scores[s] + v
            if score > scores[e]:
                scores[e], routes[e] = score, s
        else:
            score = scores[s] + alpha * v
            scores[e] = logsumexp(scores[e], score)
            if choice(score, scores[e]):
                routes[e] = s
    while text:
        s = routes[e]
        tokens.append(text[s:e])
        text, e = text[:s], s
    return tokens[::-1]


================================================
FILE: models/README.md
================================================
# 共享模型

## 基础版

在38G中英混合语料（中英比为3:5）上训练的两个模型：

|  | vocab size | 压缩率 (bytes/token) |
| :----: | :----: | :----: |
| [bytepiece_80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece_80k.zip) | 79,896 | 5.09 |
| [bytepiece_160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece_160k.zip) | 159,896 | 5.34 |

## 增强版

在185G混合语料（中、英、代码语料比为3:5:0.5）上训练的模型：

|  | vocab size | 压缩率 (bytes/token) |
| :----: | :----: | :----: |
| [bytepiece.plus.40k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.plus.40k.zip) | 39,843 | 4.63 |
| [bytepiece.plus.80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.plus.80k.zip) | 79,812 | 5.13 |
| [bytepiece.plus.160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.plus.160k.zip) | 159,846 | 5.56 |
| [bytepiece.plus.240k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.plus.240k.zip) | 239,858 | 5.74 |
| [bytepiece.plus.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.plus.320k.zip) | 319,768 | 5.83 |
| [bytepiece.id.plus.40k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.plus.40k.zip) | 39,857 | 4.51 |
| [bytepiece.id.plus.80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.plus.80k.zip) | 79,827 | 4.96 |
| [bytepiece.id.plus.160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.plus.160k.zip) | 159,868 | 5.34 |
| [bytepiece.id.plus.240k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.plus.240k.zip) | 239,888 | 5.50 |
| [bytepiece.id.plus.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.plus.320k.zip) | 319,808 | 5.58 |
| [bytepiece.eu.plus.40k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.40k.zip) | 39,842 | 4.59 |
| [bytepiece.eu.plus.80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.80k.zip) | 79,816 | 5.11 |
| [bytepiece.eu.plus.160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.160k.zip) | 159,831 | 5.57 |
| [bytepiece.eu.plus.240k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.240k.zip) | 239,862 | 5.76 |
| [bytepiece.eu.plus.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.320k.zip) | 319,767 | 5.86 |
| [bytepiece.id.eu.plus.40k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.eu.plus.40k.zip) | 39,857 | 4.65 |
| [bytepiece.id.eu.plus.80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.eu.plus.80k.zip) | 79,829 | 5.08 |
| [bytepiece.id.eu.plus.160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.eu.plus.160k.zip) | 159,869 | 5.41 |
| [bytepiece.id.eu.plus.240k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.eu.plus.240k.zip) | 239,884 | 5.55 |
| [bytepiece.id.eu.plus.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.id.eu.plus.320k.zip) | 319,811 | 5.61 |

其中id指的是isolate digits，即将阿拉伯数字单独分开；eu指的是eusure unicode，保证每一个多字节token都可以decode为unicode。可以看到，在固定的语料配比上，当vocab_size到大一定程度后，增大vocab_size也无法带来明显的压缩率提高。

## 中文版

在近200G清洗过后的wudao语料（中文）上训练的模型：

|  | vocab size | 中文压缩率 (bytes/token) |
| :----: | :----: | :----: |
| [bytepiece.zh.id.eu.40k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.zh.id.eu.40k.zip) | 39,683 | 4.88 |
| [bytepiece.zh.id.eu.80k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.zh.id.eu.80k.zip) | 79,363 | 5.34 |
| [bytepiece.zh.id.eu.160k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.zh.id.eu.160k.zip) | 159,220 | 5.79 |
| [bytepiece.zh.id.eu.240k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.zh.id.eu.240k.zip) | 239,066 | 5.99 |
| [bytepiece.zh.id.eu.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.zh.id.eu.320k.zip) | 318,582 | 6.09 |

## 结巴版

从[jieba](https://github.com/fxsjy/jieba)的词表转换而来，主要保留了jieba的原词表和词频，并融合了[bytepiece.eu.plus.320k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.eu.plus.320k.zip)的标点、单字和英文，文本分词效果基本上会跟jieba一致，但英文和数字可能会略有不同（jieba本身不是面向tokenizer设计，所以没有加上标点、英文等token，需要额外补充，而有限补充的情况下，无法保证分词结果一致）。

|  | vocab size | 压缩率 (bytes/token) | 中文压缩率 (bytes/token) |
| :----: | :----: | :----: | :----: |
| [bytepiece.jieba.410k](https://github.com/bojone/bytepiece/blob/main/models/bytepiece.jieba.410k.zip) | 409,629 | 2.87 | 4.43 |

转换该模型的目的，是得到一个中文分词结果跟我们常规认知的中文分词一致的tokenizer，而不是追求压缩率，可以理解为这是一个简单的有监督bytepiece模型，也可以理解为一个带有id转换功能的jieba分词。bytepiece版本分词速度大概是原始jieba（HMM=False）的两倍，跟[jieba_fast](https://github.com/deepcs233/jieba_fast)持平（当输入进一步增长时，jieba和jieba_fast都会明显下降，最终则明显慢于bytepiece）。

转换代码：
```python
from bytepiece import Tokenizer, convert_to_bytepiece
import re

pieces = {}
with open('/root/miniconda3/lib/python3.10/site-packages/jieba/dict.txt') as f:
    for l in f:
        k, v = l.strip().split(' ')[:2]
        pieces[k.encode()] = int(v)

tokenizer = Tokenizer('bytepiece.eu.plus.320k.model')
pieces2 = {}
for k, v in tokenizer._pieces.items():
    if len(k) == 1:
        pieces2[k] = pieces2.get(k, 0) + v
    elif len(k.decode()) == 1:
        pieces2[k] = pieces2.get(k, 0) + v
    else:
        for k in k.split():
            if len(re.findall(b'[a-zA-Z0-9]', k)) == len(k):
                pieces2[k] = pieces2.get(k, 0) + v

r = pieces2['的'.encode()] / pieces['的'.encode()]
pieces = {k: int(round(v * r)) for k, v in pieces.items()}
for k, v in pieces2.items():
    if k not in pieces:
        pieces[k] = v

convert_to_bytepiece(pieces, 'bytepiece.jieba.410k.model')```


================================================
FILE: setup.py
================================================
#! -*- coding: utf-8 -*-

from setuptools import setup, find_packages
from Cython.Build import cythonize

setup(
    name='bytepiece',
    version='0.6.3',
    python_requires='>=3',
    description='Smarter Byte-based Tokenizer',
    long_description=open('README_en.md',encoding="utf-8").read(),
    long_description_content_type='text/markdown',
    license='Apache License 2.0',
    url='https://github.com/bojone/bytepiece',
    author='bojone',
    author_email='bojone@spaces.ac.cn',
    install_requires=['numpy', 'tqdm'],
    packages=find_packages(),
    ext_modules=cythonize('bytepiece/*.pyx'),
    package_data={'bytepiece': ['*.pyx']},
    include_package_data=True
)

Download .txt

gitextract_wh3iysyy/

├── LICENSE
├── MANIFEST.in
├── README.md
├── README_en.md
├── bytepiece/
│   ├── __init__.py
│   ├── bytepiece.py
│   └── faster.pyx
├── models/
│   └── README.md
└── setup.py

Download .txt

SYMBOL INDEX (31 symbols across 1 files)

FILE: bytepiece/bytepiece.py
  function normalize (line 16) | def normalize(text, maxlen=0, isolate_digits=False):
  class Trainer (line 31) | class Trainer:
    method __init__ (line 35) | def __init__(
    method count_ngrams (line 54) | def count_ngrams(self, texts):
    method prune_ngrams (line 63) | def prune_ngrams(self, ngrams):
    method trans (line 82) | def trans(self):
    method _tokenize (line 90) | def _tokenize(self, text):
    method count_pieces (line 113) | def count_pieces(self, texts):
    method split_pieces (line 120) | def split_pieces(self, keep, drop):
    method prune_pieces (line 127) | def prune_pieces(self, pieces, workers=1, batch_size=1000):
    method norm (line 180) | def norm(self, texts):
    method train (line 185) | def train(self, texts, workers=1, batch_size=1000):
    method dump (line 201) | def dump(self, pieces):
    method save (line 208) | def save(self, path):
    method pcount (line 221) | def pcount(self, inputs, count, merge, init, desc, workers, batch_size):
    method pcount_ngrams (line 251) | def pcount_ngrams(self, texts, workers=1, batch_size=1000):
    method psplit_pieces (line 263) | def psplit_pieces(self, keep, drop, workers=1, batch_size=1000):
    method pcount_pieces (line 274) | def pcount_pieces(self, texts, workers=1, batch_size=1000):
  class Tokenizer (line 285) | class Tokenizer:
    method __init__ (line 288) | def __init__(self, pieces, seed=None):
    method set_seed (line 306) | def set_seed(self, seed):
    method _tokenize (line 310) | def _tokenize(self, text, alpha=-1):
    method tokenize (line 313) | def tokenize(self, text, alpha=-1, iter=False):
    method piece_to_id (line 319) | def piece_to_id(self, p):
    method id_to_piece (line 322) | def id_to_piece(self, i):
    method pieces_to_ids (line 325) | def pieces_to_ids(self, pieces):
    method ids_to_pieces (line 328) | def ids_to_pieces(self, ids):
    method encode (line 331) | def encode(self, text, add_bos=False, add_eos=False, alpha=-1, iter=Fa...
    method decode (line 344) | def decode(self, ids):
    method convert_to_sentencepiece (line 348) | def convert_to_sentencepiece(self, path):
  function convert_to_bytepiece (line 391) | def convert_to_bytepiece(pieces, path):

Download .json

Condensed preview — 9 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (53K chars).

[
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "MANIFEST.in",
    "chars": 21,
    "preview": "include README_en.md\n"
  },
  {
    "path": "README.md",
    "chars": 5304,
    "preview": "[中文|[English](https://github.com/bojone/bytepiece/blob/main/README_en.md)]\n\n# BytePiece\nBytePiece是一个Byte-based的Unigram分词"
  },
  {
    "path": "README_en.md",
    "chars": 8737,
    "preview": "[[中文](https://github.com/bojone/bytepiece/blob/main/README.md)|English]\n\n# BytePiece\nBytePiece is a Byte-based Unigram t"
  },
  {
    "path": "bytepiece/__init__.py",
    "chars": 74,
    "preview": "#! -*- coding: utf-8 -*-\n\nfrom .bytepiece import *\n\n__version__ = '0.6.3'\n"
  },
  {
    "path": "bytepiece/bytepiece.py",
    "chars": 14538,
    "preview": "# -*- coding: utf-8 -*-\n# Reference 1: https://kexue.fm/archives/9752\n# Reference 2: https://kexue.fm/archives/9768\n\nimp"
  },
  {
    "path": "bytepiece/faster.pyx",
    "chars": 1176,
    "preview": "# cython: language_level=3\nfrom libc.time cimport time\nfrom libc.stdlib cimport RAND_MAX, rand, srand\nfrom libc.math cim"
  },
  {
    "path": "models/README.md",
    "chars": 5546,
    "preview": "# 共享模型\n\n## 基础版\n\n在38G中英混合语料（中英比为3:5）上训练的两个模型：\n\n|  | vocab size | 压缩率 (bytes/token) |\n| :----: | :----: | :----: |\n| [byte"
  },
  {
    "path": "setup.py",
    "chars": 682,
    "preview": "#! -*- coding: utf-8 -*-\n\nfrom setuptools import setup, find_packages\nfrom Cython.Build import cythonize\n\nsetup(\n    nam"
  }
]

About this extraction

This page contains the full source code of the bojone/bytepiece GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 9 files (46.3 KB), approximately 13.4k tokens, and a symbol index with 31 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo