Full Code of xueyouluo/ccks2021-track2-code for AI

master 688bbd8c5285 cached

42 files

2.8 MB

723.7k tokens

283 symbols

1 requests

Download .txt

Showing preview only (3,351K chars total). Download the full file or copy to clipboard to get everything.

Repository: xueyouluo/ccks2021-track2-code
Branch: master
Commit: 688bbd8c5285
Files: 42
Total size: 2.8 MB

Directory structure:
gitextract_8qq4ysmv/

├── .dockerignore
├── .gitignore
├── .vscode/
│   └── settings.json
├── Dockerfile
├── README.md
├── code/
│   ├── assemble.py
│   ├── conlleval.py
│   ├── create_raw_text.py
│   ├── electra-pretrain/
│   │   ├── .gitignore
│   │   ├── LICENSE
│   │   ├── README.md
│   │   ├── build_pretraining_dataset.py
│   │   ├── config/
│   │   │   ├── base_discriminator_config.json
│   │   │   ├── base_generator_config.json
│   │   │   ├── large_discriminator_config.json
│   │   │   └── large_generator_config.json
│   │   ├── configure_pretraining.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── modeling.py
│   │   │   ├── optimization.py
│   │   │   └── tokenization.py
│   │   ├── pretrain/
│   │   │   ├── __init__.py
│   │   │   ├── pretrain_data.py
│   │   │   └── pretrain_helpers.py
│   │   ├── pretrain.sh
│   │   ├── run_pretraining.py
│   │   └── util/
│   │       ├── __init__.py
│   │       ├── training_utils.py
│   │       └── utils.py
│   ├── modeling.py
│   ├── optimization.py
│   ├── pipeline.py
│   ├── prepare.sh
│   ├── pretrain.sh
│   ├── run.sh
│   ├── run_biaffine_ner.py
│   ├── simple_run.sh
│   ├── tokenization.py
│   └── utils.py
└── user_data/
    └── extra_data/
        ├── dev.txt
        ├── test.txt
        └── train.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .dockerignore
================================================
.git/
code/__pycache__
__pycache__/
user_data/models/
user_data/pretrain_tfrecords/
user_data/texts/
user_data/tcdata/
user_data/emb/
user_data/chinese_roberta_wwm_ext_L-12_H-768_A-12/

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

tcdata/
user_data/*
!user_data/extra_data
!user_data/track3


================================================
FILE: .vscode/settings.json
================================================
{
  "python.pythonPath": "/home/xueyou/.conda/envs/jason_py3/bin/python"
}

================================================
FILE: Dockerfile
================================================
FROM nvcr.io/nvidia/tensorflow:19.10-py3

# set noninteractive installation
ENV DEBIAN_FRONTEND=noninteractive

# install tzdata & curl package
RUN apt update && apt-get install -y tzdata wget curl

RUN ln -fs /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& dpkg-reconfigure -f noninteractive tzdata

# pretained models and datas
COPY user_data/electra /user_data/electra
COPY user_data/extra_data /user_data/extra_data
# COPY user_data/track3 /user_data/track3

# add code
COPY Dockerfile /Dockerfile

COPY code /code

WORKDIR /code
CMD ["sh","run.sh"]

================================================
FILE: README.md
================================================
# CCKS2021-赛道二-中文NLP地址要素解析

团队：xueyouluo

初赛：1 - 93.63

复赛：3 - 91.32

> 这里的代码是复赛的全流程代码，需要在32G显存的卡上才能正常跑通，如果没有这么大的显存，可以考虑将seq_length改成32，以及减小batch size。

## 解决方案

### 初赛

整体还是以预训练+finetune的思路，主要在模型结构、预训练、模型泛化能力提升、数据增强、融合、伪标签、后处理等方面做了优化。

#### 模型

现在的实体识别方案很多，包括BERT+CRF的序列标注、基于Span的方法、基于MRC的方法，我这里使用的是基于BERT的Biaffine结构，直接预测文本构成的所有span的类别。相比单纯基于span预测和基于MRC的预测，Biaffine的结构可以同时考虑所有span之间的关系，从而提高预测的准确率。

> Biaffine意思双仿射，如果`W*X`是单仿射的话，`X*W*Y`就是双仿射了。本质上就是输入一个长度为`L`的序列，预测一个`L*L*C`的tensor，预测每个span的类别信息。

具体来说参考了论文[Named Entity Recognition as Dependency Parsing](https://arxiv.org/abs/2005.07150)，但是稍有区别：

- 纯粹基于bert进行finetune，不利用fasttext、bert等做context embedding抽取，这也是为了简化模型
- 不区分char word的embedding，默认就是char【中文的BERT基本都是char】
- 原来的论文中有上下文的多句话，这里默认都是一句话【数据决定】
- 同时改进了原有greedy的decoding方法，使用基于DAG的动态规划算法找到全局最优解

但是这种方法也有一些局限：

- 对边界判断不是特别准
- 有大量的负样本

> 原来我也实现过[Biaffine-BERT-NER](https://github.com/xueyouluo/Biaffine-BERT-NER),但这里的版本优化了一些。

#### 预训练

在比较了大部分开源的预训练模型后，哈工大的electra效果比较好，因此我们采用了electra的预训练方法。使用了本赛道的所有数据+赛道三的初赛所有数据，构建了预训练样本，分别继续预训练base和large的模型33K步【大概15个epoch】。

> 继续预训练模型可以提升1个百分点左右的效果，还是非常有效的。

#### 泛化能力提升

这些应该是属于比较基本的操作了，主要包括：

- 使用了对抗学习（FGM）的方法，但代价是训练速度慢了一倍
- 在Dropout方面加入了spatial dropout和embedding dropout
- 使用SWA的方法避免局部最优解

需要在验证集上调参找到比较合适的值。

#### 数据增强

我们用到了开源的一份地址解析数据，来自[《Neural Chinese Address Parsing》](https://github.com/leodotnet/neural-chinese-address-parsing)。参考赛道二的标注规范，使用规则将数据进行清洗，并用这份数据作为数据增强的语料。同时利用统计信息稍微优化了一下数据，即认为一个span如果被标注次数大于10，并且有一个类别占比不到10%且标注数量小于5就认为是不合理的并将其抛弃。

我们使用了同类型实体替换的方法进行数据增强，然后将预训练后的模型在这份数据上finetune。最后用赛道本身的数据进行二次finetune。初赛上，上面的流程走下来可以在dev上达到94.71，线上92.56。

#### 融合

融合的提升非常明显。在融合上，我们使用了electra-base和electra-large两个模型，分别进行预训练和finetune，然后5-fold。

最后对实体进行投票，其中base权重1/3，large权重2/3，只选择投票结果大于3的实体作为最终结果。

> 初赛上，base单独5-fold融合为93.0，large单独5-fold融合为93.477。二者加权融合为93.537。

#### 伪标签

在融合的基础上，我们进一步使用了伪标签，即将上面的融合后预测的测试集结果作为伪标签，重新训练了base模型的一个fold，再进行预测，最终线上可以到93.5920。后面我也实验了训练5-fold的模型，测试下来可以到93.6087。

#### 后处理

我这边后处理比较简单，主要对特殊符号进行了处理，由于一些特殊符号在训练集没有见过，导致模型预测错误。对于包含特殊符号的实体，如果特殊符号是在实体的边界，那么直接去除特殊符号，保留原来的实体类型；如果不是，则去除这个实体。在伪标签结果的基础上加后处理，线上到93.6212。

#### 实验结果

| 序号 |                  实验                   | Dev指标 | 线上指标 |
| :--: | :-------------------------------------: | ------- | :------: |
|  1   |         Biaffine + roberta ext          | 92.15   |          |
|  2   |         Biaffine + google bert          | 92.33   |          |
|  3   |                 2 + FGM                 | 92.79   |          |
|  4   | 3 + spatial dropout + embedding dropout | 92.94   |  90.65   |
|  5   |        4 + extra data + finetune        | 93.74   |          |
|  6   |              5 + 数据增强               | 93.98   |          |
|  7   |        6 + roberta ext pretrain         | 94.15   |  92.08   |
|  8   |            5 + electra base             | 94.19   |          |
|  9   |            5 + electra large            | 94.32   |  92.13   |
|  10  |        5 + electra base pretrain        | 94.71   |  92.56   |
|  11  |       5 + electra large pretrain        | 94.54   |          |
|  12  |               10 + 5-fold               | -       |  93.009  |
|  13  |               11 + 5-fold               | -       |  93.499  |
|  14  |                 12 + 13                 | -       |  93.537  |
|  15  |             14 + pseudo tag             | -       |  93.62   |
|  16  |               15 + 5-fold               | -       |  93.63   |

### 复赛

复赛上我对原来的流程基本没有做什么改动【主要也是我也没想到什么好改进的点了】，就是预训练改了一下。

复赛由于线上训练时间12h的限制，我不可能跑那么久的预训练了【我线下训练large的模型花了20多个小时😂】，因此预训练的语料只用了本赛道的数据集+开源的数据集来减少预训练的时间。

> 唯一非常折腾我的是，large模型在复赛的时候效果一直比不上base模型，可能是预训练不够导致的。

我在复赛的时候都是全流程提交的，直接线上调参了。大概的结果如下【都是5-fold】：

| 序号 |           实验           | 线上指标 |
| :--: | :----------------------: | :------: |
|  1   |       Electra-base       |  89.15   |
|  2   |      Electra-large       |  89.58   |
|  3   | Electra-base + pretrain  |  90.74   |
|  4   | Electra-large + pretrain |  90.75   |
|  5   |          3 + 4           |  91.08   |
|  6   |     5 + fake 1-fold      |  91.31   |
|  7   |     6 + fake 5-fold      |  91.32   |

最终复赛的结果就是91.32，离第一还是有2个千分点差距的。更多细节就看代码吧，毕竟全都在代码里面了。

## 运行

### 运行环境

我们选择了英伟达提供的[docker](nvcr.io/nvidia/tensorflow:19.10-py3)作为基础镜像进行训练，主要是为了避免配环境的各种问题。

具体：

- Unbuntu == 16.04
- Python == 3.6.8
- GPU V100 32G
- 1.14.0 <= Tensorflow-gpu <= 1.15.*

### 数据准备

#### 赛道数据

这里不提供比赛的数据，大家自己下载好放在tcdata目录下。

#### 预训练模型

预训练模型我们使用了哈工大开源的[中文ELECTRA模型](https://github.com/ymcui/Chinese-ELECTRA#%E5%A4%A7%E8%AF%AD%E6%96%99%E7%89%88%E6%96%B0%E7%89%88180g%E6%95%B0%E6%8D%AE)，具体为大语料版本的模型：

- [ELECTRA-180g-large, Chinese](https://drive.google.com/file/d/1P9yAuW0-HR7WvZ2r2weTnx3slo6f5u9q/view?usp=sharing)

- [ELECTRA-180g-base, Chinese](https://drive.google.com/file/d/1RlmfBgyEwKVBFagafYvJgyCGuj7cTHfh/view?usp=sharing)

下载后解压在user_data/electra目录下。

#### 额外数据

下载[neural-chinese-address-parsing](https://github.com/leodotnet/neural-chinese-address-parsing)中data目录下train、dev、test数据到user_data/extra_data目录下。

#### 目录结构

```
├── code
│   ├── electra-pretrain
│   └── ...
├── tcdata
│   ├── dev.conll
│   ├── final_test.txt
│   └── train.conll
├── user_data
│   ├── electra
│   │   ├── electra_180g_base
│   │   │   ├── base_discriminator_config.json
│   │   │   ├── base_generator_config.json
│   │   │   ├── electra_180g_base.ckpt.data-00000-of-00001
│   │   │   ├── electra_180g_base.ckpt.index
│   │   │   ├── electra_180g_base.ckpt.meta
│   │   │   └── vocab.txt
│   │   └── electra_180g_large
│   │       ├── electra_180g_large.ckpt.data-00000-of-00001
│   │       ├── electra_180g_large.ckpt.index
│   │       ├── electra_180g_large.ckpt.meta
│   │       ├── large_discriminator_config.json
│   │       ├── large_generator_config.json
│   │       └── vocab.txt
│   ├── extra_data
│   │   ├── dev.txt
│   │   ├── test.txt
│   │   └── train.txt
│   └── track3 # 这里可以不需要
│       ├── final_test.txt #这是初赛的测试集
│       ├── Xeon3NLP_round1_test_20210524.txt #可以不用，复赛没有使用这个数据
│       └── Xeon3NLP_round1_train_20210524.txt #可以不用，复赛没有使用这个数据
```

### 运行

在code目录下运行

```
sh run.sh
```

具体训练细节参考`pipeline.py`文件。

也有一个简化版本的，把seq_len改成了32，没有5-fold，自己测试跑下来dev上大概为94。

```
sh simple_run.sh
```


================================================
FILE: code/assemble.py
================================================
'''
模型结果融合
'''
import re
from collections import Counter, defaultdict
from glob import glob

from utils import convert_data_format, iob_iobes


def refine_entity(w,s,e):
  # 去除包含特殊字符的实体
  if re.findall('[，。（）()]',w):
    nw = w.strip('，。（）()')
    if not nw:
      return False,None
    else:
      start = w.find(nw)
      s = s + start
      e = s + len(nw) - 1
      return True,(s,e)
  else:
    return True,(s,e)

def convert(entity, refine=False):
    tmp = []
    for k,words in entity.items():
        for w,spans in words.items():
            for span in spans:
                if refine:
                    should_keep,span = refine_entity(w,span[0],span[1])
                    if not should_keep:
                        continue
                tmp.append((k,w,span[0],span[1]))
    return tmp

def get_entities(text,tags):
    tag_words = []
    word = ''
    tag = ''
    for i,(c,t) in enumerate(zip(text,tags)):
        if t[0] in ['B','S','O']:
            if word:
                tag_words.append((word,i,tag))
            if t[0] == 'O':
                word = ''
                tag = ''
                continue
            word = c
            tag = t[2:]
        else:
            word += c
    if word:
        tag_words.append((word,i+1,tag))

    entities = {}
    for w,i,t in tag_words:
        if t not in entities:
            entities[t] = {}
        if w in entities[t]:
            entities[t][w].append([i-len(w),i-1])
        else:
            entities[t][w] = [[i-len(w),i-1]]
    return entities
  
def check_special(text):
  text = re.sub('[\u4e00-\u9fa5]','',text)
  text = re.sub('[0A-]','',text)
  if text.strip():
    return True
  else:
    return False

def merge_by_4_tuple(raw_texts,data,weights,threshold=3.0, refine=False):
  '''
  根据（类型、实体文本、起始位置、结束位置）四元组进行投票确定最终的结果
  '''
  new_tags = []
  ent_cnt = 0
  special_cnt = 0
  check_fail = 0
  fail_cnt = 0

  for i,gtags in enumerate(data):
    _,text = raw_texts[i]
    cnt = Counter()
    assert len(weights) == len(gtags), 'weight {} != tags {}'.format(len(weights),len(gtags))
    for j,tags in enumerate(gtags):
      entities = convert(get_entities(text,tags))
      ratio = weights[j]
      for x in entities:
        cnt[x] += ratio

    ntags = ['O'] * len(text)
    for m,n in cnt.most_common():
      # k = 类型, w = 实体文本, s = 实体起始位置, e = 实体结束位置
      (k,w,s,e) = m
      if n < threshold:
        fail_cnt += 1
        continue

      if refine:
        should_keep,span = refine_entity(w,s,e)
        if not should_keep:
          continue
        else:
          s,e = span

      # 检查是否有其他实体占据span
      if not all(x=='O' for x in ntags[s:e+1]):
        continue

      ent_cnt += 1
      try:
        if check_special(text[s:e+1]):
          special_cnt += 1
      except:
        check_fail += 1
      ntags[s:e+1] = ['I-'+k] * (e-s+1)
      ntags[s] = 'B-'+k
    new_tags.append(iob_iobes(ntags))

  with open('/tmp/entity_cnt.txt','w') as f:
    f.write('fail_cnt - {}, ent_cnt - {}, special_cnt - {}\n'.format(fail_cnt,ent_cnt,special_cnt))

  return new_tags


def assemble_fake():
  base_dir = '../user_data/models'
  output_file= '../user_data/tcdata/fake.conll'

  patterns = [
    base_dir + '/k-fold/bif_electra_base_pretrain_fold_*/export/f1_export/result.txt',
    base_dir + '/k-fold/bif_electra_large_pretrain_fold_*/export/f1_export/result.txt',
  ]

  weights = [1/2] * 5 + [1/2] * 5 
  threshold = 3.0
  refine = True
  
  data = []
  raw_texts = []

  for pattern in patterns:
    for fname in glob(pattern):
      for i,line in enumerate(open(fname)):
        idx,text,tags = line.strip().split('\x01')
        if len(data) <= i:
          data.append([])
        data[i].append(tags.split(' '))
        if len(raw_texts) <= i:
          raw_texts.append((idx,text))
  
  assert len(data[0]) == len(weights)
  new_tags = merge_by_4_tuple(raw_texts,data,weights,threshold,refine)

  seen_texts = set()

  with open(output_file,'w') as f:
    for (idx,text),tags in zip(raw_texts,new_tags):
      if len(text) != len(tags):
        continue
      if text in seen_texts:
        continue
      else:
        seen_texts.add(text)
        
      for c,t in zip(text,tags):
        f.write(c + ' ' + t + '\n')
      f.write('\n')

def assemble_final():
  base_dir = '../user_data/models'
  output_file= './result.txt'

  patterns = [
    base_dir + '/k-fold/bif_fake_tags_fold_*/export/f1_export/result.txt',
  ]

  weights = [1] * 5 
  threshold = 3.0
  refine = True
  
  data = []
  raw_texts = []

  for pattern in patterns:
    for fname in glob(pattern):
      for i,line in enumerate(open(fname)):
        idx,text,tags = line.strip().split('\x01')
        if len(data) <= i:
          data.append([])
        data[i].append(tags.split(' '))
        if len(raw_texts) <= i:
          raw_texts.append((idx,text))
  
  assert len(data[0]) == len(weights)
  new_tags = merge_by_4_tuple(raw_texts,data,weights,threshold,refine)

  with open(output_file,'w') as f:
    for (idx,text),tags in zip(raw_texts,new_tags):
        assert len(text) == len(tags)
        f.write('\x01'.join([idx,text,' '.join(tags)]) + '\n')

================================================
FILE: code/conlleval.py
================================================
# Python version of the evaluation script from CoNLL'00-
# Originates from: https://github.com/spyysalo/conlleval.py


# Intentional differences:
# - accept any space as delimiter by default
# - optional file argument (default STDIN)
# - option to set boundary (-b argument)
# - LaTeX output (-l argument) not supported
# - raw tags (-r argument) not supported

import sys
import re
import codecs
from collections import defaultdict, namedtuple

ANY_SPACE = '<SPACE>'


class FormatError(Exception):
    pass

Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')


class EvalCounts(object):
    def __init__(self):
        self.correct_chunk = 0    # number of correctly identified chunks
        self.correct_tags = 0     # number of correct chunk tags
        self.found_correct = 0    # number of chunks in corpus
        self.found_guessed = 0    # number of identified chunks
        self.token_counter = 0    # token counter (ignores sentence breaks)

        # counts by type
        self.t_correct_chunk = defaultdict(int)
        self.t_found_correct = defaultdict(int)
        self.t_found_guessed = defaultdict(int)


def parse_args(argv):
    import argparse
    parser = argparse.ArgumentParser(
        description='evaluate tagging results using CoNLL criteria',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    arg = parser.add_argument
    arg('-b', '--boundary', metavar='STR', default='-X-',
        help='sentence boundary')
    arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE,
        help='character delimiting items in input')
    arg('-o', '--otag', metavar='CHAR', default='O',
        help='alternative outside tag')
    arg('file', nargs='?', default=None)
    return parser.parse_args(argv)


def parse_tag(t):
    m = re.match(r'^([^-]*)-(.*)$', t)
    return m.groups() if m else (t, '')


def evaluate(iterable, options=None):
    if options is None:
        options = parse_args([])    # use defaults

    counts = EvalCounts()
    num_features = None       # number of features per line
    in_correct = False        # currently processed chunks is correct until now
    last_correct = 'O'        # previous chunk tag in corpus
    last_correct_type = ''    # type of previously identified chunk tag
    last_guessed = 'O'        # previously identified chunk tag
    last_guessed_type = ''    # type of previous chunk tag in corpus

    for line in iterable:
        line = line.rstrip('\r\n')

        if options.delimiter == ANY_SPACE:
            features = line.split()
        else:
            features = line.split(options.delimiter)

        if num_features is None:
            num_features = len(features)
        elif num_features != len(features) and len(features) != 0:
            raise FormatError('unexpected number of features: %d (%d)' %
                              (len(features), num_features), line)

        if len(features) == 0 or features[0] == options.boundary:
            features = [options.boundary, 'O', 'O']
        if len(features) < 3:
            raise FormatError('unexpected number of features in line %s' % line)

        guessed, guessed_type = parse_tag(features.pop())
        correct, correct_type = parse_tag(features.pop())
        first_item = features.pop(0)

        if first_item == options.boundary:
            guessed = 'O'

        end_correct = end_of_chunk(last_correct, correct,
                                   last_correct_type, correct_type)
        end_guessed = end_of_chunk(last_guessed, guessed,
                                   last_guessed_type, guessed_type)
        start_correct = start_of_chunk(last_correct, correct,
                                       last_correct_type, correct_type)
        start_guessed = start_of_chunk(last_guessed, guessed,
                                       last_guessed_type, guessed_type)

        if in_correct:
            if (end_correct and end_guessed and
                last_guessed_type == last_correct_type):
                in_correct = False
                counts.correct_chunk += 1
                counts.t_correct_chunk[last_correct_type] += 1
            elif (end_correct != end_guessed or guessed_type != correct_type):
                in_correct = False

        if start_correct and start_guessed and guessed_type == correct_type:
            in_correct = True

        if start_correct:
            counts.found_correct += 1
            counts.t_found_correct[correct_type] += 1
        if start_guessed:
            counts.found_guessed += 1
            counts.t_found_guessed[guessed_type] += 1
        if first_item != options.boundary:
            if correct == guessed and guessed_type == correct_type:
                counts.correct_tags += 1
            counts.token_counter += 1

        last_guessed = guessed
        last_correct = correct
        last_guessed_type = guessed_type
        last_correct_type = correct_type

    if in_correct:
        counts.correct_chunk += 1
        counts.t_correct_chunk[last_correct_type] += 1

    return counts


def uniq(iterable):
  seen = set()
  return [i for i in iterable if not (i in seen or seen.add(i))]


def calculate_metrics(correct, guessed, total):
    tp, fp, fn = correct, guessed-correct, total-correct
    p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
    r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
    f = 0 if p + r == 0 else 2 * p * r / (p + r)
    return Metrics(tp, fp, fn, p, r, f)


def metrics(counts):
    c = counts
    overall = calculate_metrics(
        c.correct_chunk, c.found_guessed, c.found_correct
    )
    by_type = {}
    for t in uniq(list(c.t_found_correct) + list(c.t_found_guessed)):
        by_type[t] = calculate_metrics(
            c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
        )
    return overall, by_type


def report(counts, out=None):
    if out is None:
        out = sys.stdout

    overall, by_type = metrics(counts)

    c = counts
    out.write('processed %d tokens with %d phrases; ' %
              (c.token_counter, c.found_correct))
    out.write('found: %d phrases; correct: %d.\n' %
              (c.found_guessed, c.correct_chunk))

    if c.token_counter > 0:
        out.write('accuracy: %6.2f%%; ' %
                  (100.*c.correct_tags/c.token_counter))
        out.write('precision: %6.2f%%; ' % (100.*overall.prec))
        out.write('recall: %6.2f%%; ' % (100.*overall.rec))
        out.write('FB1: %6.2f\n' % (100.*overall.fscore))

    for i, m in sorted(by_type.items()):
        out.write('%17s: ' % i)
        out.write('precision: %6.2f%%; ' % (100.*m.prec))
        out.write('recall: %6.2f%%; ' % (100.*m.rec))
        out.write('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))


def report_notprint(counts, out=None):
    if out is None:
        out = sys.stdout

    overall, by_type = metrics(counts)

    c = counts
    final_report = []
    line = []
    line.append('processed %d tokens with %d phrases; ' %
              (c.token_counter, c.found_correct))
    line.append('found: %d phrases; correct: %d.\n' %
              (c.found_guessed, c.correct_chunk))
    final_report.append("".join(line))

    if c.token_counter > 0:
        line = []
        line.append('accuracy: %6.2f%%; ' %
                  (100.*c.correct_tags/c.token_counter))
        line.append('precision: %6.2f%%; ' % (100.*overall.prec))
        line.append('recall: %6.2f%%; ' % (100.*overall.rec))
        line.append('FB1: %6.2f\n' % (100.*overall.fscore))
        final_report.append("".join(line))

    for i, m in sorted(by_type.items()):
        line = []
        line.append('%17s: ' % i)
        line.append('precision: %6.2f%%; ' % (100.*m.prec))
        line.append('recall: %6.2f%%; ' % (100.*m.rec))
        line.append('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
        final_report.append("".join(line))
    return final_report


def end_of_chunk(prev_tag, tag, prev_type, type_):
    # check if a chunk ended between the previous and current word
    # arguments: previous and current chunk tags, previous and current types
    chunk_end = False

    if prev_tag == 'E': chunk_end = True
    if prev_tag == 'S': chunk_end = True

    if prev_tag == 'B' and tag == 'B': chunk_end = True
    if prev_tag == 'B' and tag == 'S': chunk_end = True
    if prev_tag == 'B' and tag == 'O': chunk_end = True
    if prev_tag == 'I' and tag == 'B': chunk_end = True
    if prev_tag == 'I' and tag == 'S': chunk_end = True
    if prev_tag == 'I' and tag == 'O': chunk_end = True

    if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
        chunk_end = True

    # these chunks are assumed to have length 1
    if prev_tag == ']': chunk_end = True
    if prev_tag == '[': chunk_end = True

    return chunk_end


def start_of_chunk(prev_tag, tag, prev_type, type_):
    # check if a chunk started between the previous and current word
    # arguments: previous and current chunk tags, previous and current types
    chunk_start = False

    if tag == 'B': chunk_start = True
    if tag == 'S': chunk_start = True

    if prev_tag == 'E' and tag == 'E': chunk_start = True
    if prev_tag == 'E' and tag == 'I': chunk_start = True
    if prev_tag == 'S' and tag == 'E': chunk_start = True
    if prev_tag == 'S' and tag == 'I': chunk_start = True
    if prev_tag == 'O' and tag == 'E': chunk_start = True
    if prev_tag == 'O' and tag == 'I': chunk_start = True

    if tag != 'O' and tag != '.' and prev_type != type_:
        chunk_start = True

    # these chunks are assumed to have length 1
    if tag == '[': chunk_start = True
    if tag == ']': chunk_start = True

    return chunk_start


def return_report(input_file):
    with codecs.open(input_file, "r", "utf8") as f:
        counts = evaluate(f)
    return report_notprint(counts)


def main(argv):
    args = parse_args(argv[1:])

    if args.file is None:
        counts = evaluate(sys.stdin, args)
    else:
        with open(args.file) as f:
            counts = evaluate(f, args)
    report(counts)

if __name__ == '__main__':
    sys.exit(main(sys.argv))

================================================
FILE: code/create_raw_text.py
================================================
import re
import json
import random

from collections import Counter,defaultdict

from utils import normalize, read_data, convert_back_to_bio, convert_data_format, iob_iobes

random.seed(20190525)

TCDATA_DIR = '../user_data/tcdata/'
USERDATA_DIR = '../user_data/'


def read_conll(fname):
    lines = []
    line = ''
    for x in open(fname):
        x = x.strip()
        if not x:
            lines.append(line)
            line = ''
            continue
        else:
            line += x.split(' ')[0]
    return lines

def read_track3(fname):
    lines = []
    for x in open(fname):
        x = json.loads(x)
        lines.append(x['query'])
        for y in x['candidate']:
            lines.append(y['text'])
    return [normalize(x) for x in lines]

def create_preatrain_data():
    # 构建预训练语料
    data = open(TCDATA_DIR + 'final_test.txt').readlines()
    data = [x.strip().split('\x01')[1] for x in data]
    train = read_conll(TCDATA_DIR + 'train.conll')
    dev = read_conll(TCDATA_DIR + 'dev.conll')
    # 复赛没有使用
    # train3 = read_track3(
    #     USERDATA_DIR + 'track3/Xeon3NLP_round1_train_20210524.txt')
    # test3 = read_track3(
    #     USERDATA_DIR + 'track3/Xeon3NLP_round1_test_20210524.txt')
    extra_data = read_data([USERDATA_DIR + 'extra_data/train.txt', USERDATA_DIR +
                      'extra_data/dev.txt', USERDATA_DIR + 'extra_data/test.txt'])
    extra_data = [''.join([x[0] for x in item]) for item in extra_data]
    extra_data = [normalize(x) for x in extra_data]
    # old_test = open(USERDATA_DIR + 'track3/final_test.txt').readlines()
    # old_test = [x.strip().split('\x01')[1] for x in old_test]

    texts = list(set(data+train+dev+extra_data))
    texts = [t for t in texts if t.strip()]
    random.shuffle(texts)

    with open(USERDATA_DIR + 'texts/raw_text.txt', 'w') as f:
        for x in texts:
            f.write(x+'\n')

def convert_distance(item,tags):
    # 根据规则将assit中与距离相关的转换为distance标签
    text = item['text']
    spans = [x for x in re.finditer('(0+|(十?[一二三四五六七八九几]+(十|百)?[一二三四五六七八九几]?))米',text)]
    for sp in spans:
        start,end = sp.span()
        if tags[start][2:] == 'assist':
            tags[start:end] = ['I-distance'] * (end-start)
            tags[start] = 'B-distance'
            if end < len(tags) and tags[end][0] == 'I':
                tags[end] = 'B' + tags[end][1:]
    return tags,spans

def convert_village(item,tags):
    # 根据规则转换village_group标签
    text = item['text']
    spans = [x for x in re.finditer('(0+|(十?[一二三四五六七八九])|([一二三四五六七八九]十[一二三四五六七八九]?))[组队社]',text)]
    for sp in spans:
        start,end = sp.span()
        if start > 0 and tags[start-1][2:] == 'community':
            tags[start:end] = ['I-village_group'] * (end-start)
            tags[start] = 'B-village_group'
            if end < len(tags) and tags[end][0] == 'I':
                tags[end] = 'B' + tags[end][1:]
    return tags, spans

def convert_intersection(item,tags,pattern):
    # 根据在训练验证集出现过的intersection字段对标签进行转换
    text = item['text']
    spans = [x for x in re.finditer(pattern,text)]
    for sp in spans:
        start,end = sp.span()
        if tags[start][2:] == 'assist' or text[start:end] == '路口':
            if text[start:end] == '路口':
                if tags[start][2:] == 'road' and tags[start+1][2:] == 'assist':
                    start = start + 1
                elif tags[start-1][2:] == 'road' and text[start-1] not in ['街','路']:
                    tags[start] = 'I-road'
                    start = start + 1
            tags[start:end] = ['I-intersection'] * (end-start)
            tags[start] = 'B-intersection'
            if end < len(tags) and tags[end][0] == 'I':
                tags[end] = 'B' + tags[end][1:]
    return tags,spans

def get_intersection_pattern():
    # 根据赛道2的训练数据获取路口的模式匹配
    train = read_data(TCDATA_DIR+'train.conll')
    dev = read_data(TCDATA_DIR+'dev.conll')
    train = [convert_data_format(x) for x in train]
    dev = [convert_data_format(x) for x in dev]

    inter_cnt = Counter()
    for x in train+dev:
        inter = x['label'].get('intersection','')
        if inter:
            for k in inter:
                inter_cnt[k] += 1

    inter_words = [x[0] for x in inter_cnt.most_common() if len(x[0]) > 1]
    pattern = '|'.join(['({})'.format(x) for x in inter_words])
    return pattern

def check_devzone(name):
    for x in ['经济开发区','园区','开发区','工业园','工业区','科技园','工业园区','创意园','产业园','软件谷','软件园','电商园','智慧国','智慧园','未来科技城','科创中心','机电城','工业城','商务园']:
        if name.endswith(x):
            return True
    return False

def convert_data_format_v2(sentence):
    word = ''
    tag = ''
    text = ''
    tag_words = []
    for i,(c,t) in enumerate(sentence):
        c = normalize(c)
        if t[0] in ['B','S','O']:
            if word:
                tag_words.append((word,len(text),tag))
            if t[0] == 'O':
                word = ''
                tag = ''
                continue
            word = c
            tag = t[2:]
        else:
            word += c
        text += c
        
    if word:
        tag_words.append((word,len(text),tag))
        

    entities = {}
    for w,i,t in tag_words:
        if check_devzone(w):
            t = 'devzone'
        if t not in entities:
            entities[t] = {}
        if w in entities[t]:
            entities[t][w].append([i-len(w),i-1])
        else:
            entities[t][w] = [[i-len(w),i-1]]
    
    return {"text":text,"label":entities}

def _get_refine_entity(raw_files):
    data = read_data(raw_files)
    ent_tp_cnt = defaultdict(Counter)
    ent_cnt = Counter()
    for sentence in data:
        entities = convert_data_format(sentence)['label']
        for k in entities:
            for name in entities[k]:
                ent_tp_cnt[name][k] += 1
                ent_cnt[name] += 1
    
    for name in ent_tp_cnt:
        if ent_cnt[name] < 10:
                continue
        if len(ent_tp_cnt[name]) == 1:
            continue
        if len(ent_tp_cnt[name]) >= 2:
            pop = []
            for tp in ent_tp_cnt[name]:
                if ent_tp_cnt[name][tp] / ent_cnt[name] < 0.1 and ent_tp_cnt[name][tp] < 5:
                    pop.append(tp)
            for tp in pop:
                ent_tp_cnt[name].pop(tp)
    return ent_tp_cnt

def _fix_data(ent_tp_cnt, update_files, iob=False):
    data = read_data(update_files)
    new_data = []
    wcnt = 0
    for sentence in data:
        entities = convert_data_format(sentence)['label']
        new_entities = {}
        for k in entities:
            for name in entities[k]:
                spans = entities[k][name]
                cnt = ent_tp_cnt[name]
                nk = k
                if k not in cnt:
                    # print(''.join([w[0] for w in sentence]))
                    try:
                        nk = ent_tp_cnt[name].most_common(1)[0][0]
                    except:
                        # print('no entity', name,ent_tp_cnt[name],k,entities[k])
                        continue
                    # print("wrong:",name,k,'->',nk)
                    wcnt += 1
                new_entities[nk] = {}
                new_entities[nk][name] = spans
        if iob:
            tags = convert_back_to_bio(new_entities,[w[0] for w in sentence])
        else:
            tags = iob_iobes(convert_back_to_bio(new_entities,[w[0] for w in sentence]))
        new_data.append([(a[0],b) for a,b in zip(sentence,tags)])
    print('# total wrong',wcnt)
    return new_data

def fix_data():
    ent_tp_cnt = _get_refine_entity([TCDATA_DIR + 'train.conll', TCDATA_DIR + 'dev.conll',TCDATA_DIR + 'extra_train.conll'])
    extra_files = TCDATA_DIR + 'extra_train.conll'
    new_data = _fix_data(ent_tp_cnt,extra_files,iob=True)
    with open(TCDATA_DIR + 'extra_train_v2.conll','w') as f:
        for s in new_data:
            for x in s:
                f.write(x[0] + ' ' + x[1] + '\n')
            f.write('\n')   
    
    new_data = _fix_data(ent_tp_cnt,TCDATA_DIR + 'train.conll')
    with open(TCDATA_DIR + 'train_v2.conll','w') as f:
        for s in new_data:
            for x in s:
                f.write(x[0] + ' ' + x[1] + '\n')
            f.write('\n')  
    new_data = _fix_data(ent_tp_cnt,TCDATA_DIR + 'dev.conll')
    with open(TCDATA_DIR + 'dev_v2.conll','w') as f:
        for s in new_data:
            for x in s:
                f.write(x[0] + ' ' + x[1] + '\n')
            f.write('\n')

def create_extra_train_data():
    # 额外的训练数据
    # 数据来源：https://github.com/leodotnet/neural-chinese-address-parsing
    data = read_data([USERDATA_DIR + 'extra_data/train.txt', USERDATA_DIR +
                      'extra_data/dev.txt', USERDATA_DIR + 'extra_data/test.txt'])
    pattern = get_intersection_pattern()

    new_data = []
    for sentence in data:
        item = convert_data_format_v2(sentence)
        tags = convert_back_to_bio(item['label'],item['text'])

        # 对数据标签进行映射    
        new_tags = []
        for i,t in enumerate(tags):
            tt = t[2:]
            if tt in ['country','roomno','otherinfo','redundant']:
                new_tags.append('O')
            elif tt == 'person':
                new_tags.append(t[:2] + 'subpoi')
            elif tt == 'devZone':
                new_tags.append(t[:2] + 'devzone')
            elif tt in ['subRoad','subroad']:
                new_tags.append(t[:2] + 'road')
            elif tt in ['subRoadno','subroadno']:
                new_tags.append(t[:2] + 'roadno')
            else:
                new_tags.append(t) 

        # 处理distance
        new_tags,_ = convert_distance(item,new_tags)
        # 处理village_group
        new_tags,_ = convert_village(item, new_tags)
        # 处理intersection
        new_tags,_ = convert_intersection(item,new_tags,pattern)

        # 两个路之间的和字改成O
        spans = re.finditer('与|和',item['text'])
        for sp in spans:
            start,end = sp.span()
            if new_tags[start][2:]=='assist' and start > 0 and new_tags[start-1][2:] == 'road' and start < len(new_tags) and new_tags[start+1][2:] == 'road':
                new_tags[start] = 'O'
        
        # 去除噪声开头
        valid_start = ['B-prov','B-city','B-district','B-town','B-road','B-poi','B-devzone','B-community']
        for i,t in enumerate(new_tags):
            if t not in valid_start:
                continue
            break     
        new_tags = new_tags[i:]
        text = item['text'][i:]

        # 去除过短文本
        if len(text) <= 2:
            continue
        
        text = normalize(text)
        assert len(new_tags) == len(text),(text,new_tags,item,sentence)
        s = [(a,b) for a,b in zip(text,new_tags)]
        new_data.append(s)
    
    with open(TCDATA_DIR + 'extra_train.conll','w') as f:
        for s in new_data:
            for x in s:
                f.write(x[0] + ' ' + x[1] + '\n')
            f.write('\n')

if __name__ == '__main__':
    print('# create pretrain data')
    create_preatrain_data()
    print('# create extra data')
    create_extra_train_data()
    print('# fix wrong data')
    fix_data()


================================================
FILE: code/electra-pretrain/.gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/


================================================
FILE: code/electra-pretrain/LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: code/electra-pretrain/README.md
================================================
# Electra Pretrain

在哈工大训练的electra基础上使用领域数据继续进行预训练，一般能够提升下游任务效果。

## 改动

- 由于我们的语料是单句粒度，修改数据构建方法，只构建单句的语料
- 针对中文，使用更简单的tokenizer，即将所有字符直接拆分【主要是适配下游的NER任务】
- 修改预训练代码，支持加载预训练的模型的参数

## 使用

新建个DATA_DIR，然后在里面新建texts目录，将文本数据放入。

需要根据自己的语料，修改configure_pretraining的参数，包括max_seq_len，num_train_steps等。

运行pretrain.sh【根据自己的实际场景修改参数】。

> 建议自己阅读run_pretrain.py的代码，理解里面的各种参数配置。

## 效果

在ccks2021-track2赛道上进行了测试，用track2和track3的数据继续预训练electra-base，训练33k步后，指标为：

```python
disc_accuracy = 0.96376425
disc_auc = 0.97588205
disc_loss = 0.11515158
disc_precision = 0.79076445
disc_recall = 0.32165003
global_step = 33000
loss = 6.575825
masked_lm_accuracy = 0.7298883
masked_lm_loss = 1.2599187
sampled_masked_lm_accuracy = 0.6684708
```

在track2这个NER任务上，直接使用中文的electra-base模型，dev的F1指标为94.19，继续预训练后可以提升到94.74【线上为92.567，单模型】。

================================================
FILE: code/electra-pretrain/build_pretraining_dataset.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Writes out text data as tfrecords that ELECTRA can be pre-trained on."""

import argparse
import multiprocessing
import os
import random
import time
import tensorflow.compat.v1 as tf

from model import tokenization
from util import utils


def create_int_feature(values):
  feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
  return feature


class ExampleBuilder(object):
  """Given a stream of input text, creates pretraining examples."""

  def __init__(self, tokenizer, max_length):
    self._tokenizer = tokenizer
    self._current_sentences = []
    self._max_length = max_length

  def add_line(self, line):
    """Adds a line of text to the current example being built."""
    line = line.strip().replace("\n", " ")
    bert_tokens = self._tokenizer.tokenize(line)
    bert_tokids = self._tokenizer.convert_tokens_to_ids(bert_tokens)
    self._current_sentences.append(bert_tokids)
    return self._create_example()

  def _create_example(self):
    """Creates a pre-training example from the current list of sentences."""
    first_segment = []
    for sentence in self._current_sentences:
        first_segment += sentence

    # trim to max_length while accounting for not-yet-added [CLS]/[SEP] tokens
    first_segment = first_segment[:self._max_length - 2]

    # prepare to start building the next example
    self._current_sentences = []
    return self._make_tf_example(first_segment, None)

  def _make_tf_example(self, first_segment, second_segment):
    """Converts two "segments" of text into a tf.train.Example."""
    vocab = self._tokenizer.vocab
    input_ids = [vocab["[CLS]"]] + first_segment + [vocab["[SEP]"]]
    segment_ids = [0] * len(input_ids)
    if second_segment:
      input_ids += second_segment + [vocab["[SEP]"]]
      segment_ids += [1] * (len(second_segment) + 1)
    input_mask = [1] * len(input_ids)
    input_ids += [0] * (self._max_length - len(input_ids))
    input_mask += [0] * (self._max_length - len(input_mask))
    segment_ids += [0] * (self._max_length - len(segment_ids))
    tf_example = tf.train.Example(features=tf.train.Features(feature={
        "input_ids": create_int_feature(input_ids),
        "input_mask": create_int_feature(input_mask),
        "segment_ids": create_int_feature(segment_ids)
    }))
    return tf_example


class ExampleWriter(object):
  """Writes pre-training examples to disk."""

  def __init__(self, job_id, vocab_file, output_dir, max_seq_length,
               num_jobs, blanks_separate_docs,
               num_out_files=1):
    self._blanks_separate_docs = blanks_separate_docs
    tokenizer = tokenization.SimpleTokenizer(vocab_file=vocab_file)
    self._example_builder = ExampleBuilder(tokenizer, max_seq_length)
    self._writers = []
    for i in range(num_out_files):
      if i % num_jobs == job_id:
        output_fname = os.path.join(
            output_dir, "pretrain_data.tfrecord-{:}-of-{:}".format(
                i, num_out_files))
        self._writers.append(tf.io.TFRecordWriter(output_fname))
    self.n_written = 0

  def write_examples(self, input_file):
    """Writes out examples from the provided input file."""
    with tf.io.gfile.GFile(input_file) as f:
      for line in f:
        line = line.strip()
        if line or self._blanks_separate_docs:
          example = self._example_builder.add_line(line)
          if example:
            self._writers[self.n_written % len(self._writers)].write(
                example.SerializeToString())
            self.n_written += 1
            if self.n_written % 5000 == 0:
              print('processed',self.n_written)

  def finish(self):
    for writer in self._writers:
      writer.close()


def write_examples(job_id, args):
  """A single process creating and writing out pre-processed examples."""

  def log(*args):
    msg = " ".join(map(str, args))
    print("Job {}:".format(job_id), msg)

  log("Creating example writer")
  example_writer = ExampleWriter(
      job_id=job_id,
      vocab_file=args.vocab_file,
      output_dir=args.output_dir,
      max_seq_length=args.max_seq_length,
      num_jobs=args.num_processes,
      blanks_separate_docs=args.blanks_separate_docs
  )
  log("Writing tf examples")
  fnames = sorted(tf.io.gfile.listdir(args.corpus_dir))
  fnames = [f for (i, f) in enumerate(fnames)
            if i % args.num_processes == job_id]
  random.shuffle(fnames)
  start_time = time.time()
  for file_no, fname in enumerate(fnames):
    if file_no > 0:
      elapsed = time.time() - start_time
      log("processed {:}/{:} files ({:.1f}%), ELAPSED: {:}s, ETA: {:}s, "
          "{:} examples written".format(
              file_no, len(fnames), 100.0 * file_no / len(fnames), int(elapsed),
              int((len(fnames) - file_no) / (file_no / elapsed)),
              example_writer.n_written))
    example_writer.write_examples(os.path.join(args.corpus_dir, fname))
  example_writer.finish()
  log("Done!")


def main():
  parser = argparse.ArgumentParser(description=__doc__)
  parser.add_argument("--corpus-dir", required=True,
                      help="Location of pre-training text files.")
  parser.add_argument("--vocab-file", required=True,
                      help="Location of vocabulary file.")
  parser.add_argument("--output-dir", required=True,
                      help="Where to write out the tfrecords.")
  parser.add_argument("--max-seq-length", default=64, type=int,
                      help="Number of tokens per example.")
  parser.add_argument("--num-processes", default=1, type=int,
                      help="Parallelize across multiple processes.")
  parser.add_argument("--blanks-separate-docs", default=False, type=bool,
                      help="Whether blank lines indicate document boundaries.")
  args = parser.parse_args()

  utils.rmkdir(args.output_dir)
  if args.num_processes == 1:
    write_examples(0, args)
  else:
    jobs = []
    for i in range(args.num_processes):
      job = multiprocessing.Process(target=write_examples, args=(i, args))
      jobs.append(job)
      job.start()
    for job in jobs:
      job.join()


if __name__ == "__main__":
  main()


================================================
FILE: code/electra-pretrain/config/base_discriminator_config.json
================================================
{
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "type_vocab_size": 2,
  "vocab_size": 21128
}

================================================
FILE: code/electra-pretrain/config/base_generator_config.json
================================================
{
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "embedding_size": 768,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 192,
  "initializer_range": 0.02,
  "intermediate_size": 768,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 3,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "type_vocab_size": 2,
  "vocab_size": 21128
}

================================================
FILE: code/electra-pretrain/config/large_discriminator_config.json
================================================
{
  "attention_probs_dropout_prob": 0.1,
  "embedding_size": 1024,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "type_vocab_size": 2,
  "vocab_size": 21128
}

================================================
FILE: code/electra-pretrain/config/large_generator_config.json
================================================
{
  "attention_probs_dropout_prob": 0.1,
  "embedding_size": 1024,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "type_vocab_size": 2,
  "vocab_size": 21128
}

================================================
FILE: code/electra-pretrain/configure_pretraining.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Config controlling hyperparameters for pre-training ELECTRA."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os


class PretrainingConfig(object):
  """Defines pre-training hyperparameters."""

  def __init__(self, model_name, data_dir, **kwargs):
    self.model_name = model_name
    self.init_checkpoint = kwargs.get('init_checkpoint',"/nfs/users/xueyou/data/bert_pretrain/electra_180g_base/electra_180g_base.ckpt")
    self.embedding_file = kwargs.get('embedding_file',None)
    self.debug = False  # debug mode for quickly running things
    self.do_train = True  # pre-train ELECTRA
    self.do_eval = False  # evaluate generator/discriminator on unlabeled data

    # loss functions
    self.electra_objective = True  # if False, use the BERT objective instead
    self.gen_weight = 1.0  # masked language modeling / generator loss
    self.disc_weight = 50.0  # discriminator loss
    self.mask_prob = 0.15  # percent of input tokens to mask out / replace

    # optimization
    self.learning_rate = 2e-4
    self.lr_decay_power = 1.0  # linear weight decay by default
    self.weight_decay_rate = 0.01
    self.num_warmup_steps = 700
    self.use_amp = False
    self.accumulation_step = 1

    # training settings
    self.iterations_per_loop = 200
    self.save_checkpoints_steps = 30000
    self.num_train_steps = 7000
    self.num_eval_steps = 100

    # model settings
    self.model_size = "large"  # one of "small", "base", or "large"
    # override the default transformer hparams for the provided model size; see
    # modeling.BertConfig for the possible hparams and util.training_utils for
    # the defaults
    self.model_hparam_overrides = (
        kwargs["model_hparam_overrides"]
        if "model_hparam_overrides" in kwargs else {})
    self.embedding_size = None  # bert hidden size by default
    self.vocab_size = 21128  # number of tokens in the vocabulary
    self.do_lower_case = True  # lowercase the input?

    # generator settings
    self.uniform_generator = False  # generator is uniform at random
    self.untied_generator_embeddings = False  # tie generator/discriminator
                                              # token embeddings?
    self.untied_generator = True  # tie all generator/discriminator weights?
    self.generator_layers = 1.0  # frac of discriminator layers for generator
    self.generator_hidden_size = 0.25  # frac of discrim hidden size for gen
    self.disallow_correct = False  # force the generator to sample incorrect
                                   # tokens (so 15% of tokens are always
                                   # fake)
    self.temperature = 1.0  # temperature for sampling from generator

    # batch sizes
    self.max_seq_length = 64
    self.train_batch_size = 32
    self.eval_batch_size = 128

    # TPU settings
    self.use_tpu = False
    self.num_tpu_cores = 1
    self.tpu_job_name = None
    self.tpu_name = None  # cloud TPU to use for training
    self.tpu_zone = None  # GCE zone where the Cloud TPU is located in
    self.gcp_project = None  # project name for the Cloud TPU-enabled project

    # default locations of data files
    self.pretrain_tfrecords = os.path.join(
        data_dir, "pretrain_tfrecords/pretrain_data.tfrecord*")
    self.vocab_file = kwargs.get('vocab_file','/nfs/users/xueyou/data/bert_pretrain/electra_180g_base/vocab.txt')
    self.model_dir = os.path.join(data_dir, "models", model_name)
    results_dir = os.path.join(self.model_dir, "results")
    self.results_txt = os.path.join(results_dir, "unsup_results.txt")
    self.results_pkl = os.path.join(results_dir, "unsup_results.pkl")

    # update defaults with passed-in hyperparameters
    self.update(kwargs)

    self.max_predictions_per_seq = int((self.mask_prob + 0.005) *
                                       self.max_seq_length)

    # debug-mode settings
    if self.debug:
      self.train_batch_size = 8
      self.num_train_steps = 20
      self.eval_batch_size = 4
      self.iterations_per_loop = 1
      self.num_eval_steps = 2

    # defaults for different-sized model
    if self.model_size == "small":
      self.embedding_size = 256
    if self.model_size == "base":
      self.embedding_size = 768
    
    if self.model_size == 'large':
      self.embedding_size = 1024

    # passed-in-arguments override (for example) debug-mode defaults
    self.update(kwargs)

  def update(self, kwargs):
    for k, v in kwargs.items():
      if k not in self.__dict__:
        raise ValueError("Unknown hparam " + k)
      self.__dict__[k] = v


================================================
FILE: code/electra-pretrain/model/__init__.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

================================================
FILE: code/electra-pretrain/model/modeling.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""The transformer encoder used by ELECTRA. Essentially BERT's with a few
additional functionalities added.
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import copy
import json
import math
import re

import numpy as np
import six
import tensorflow.compat.v1 as tf
from tensorflow.contrib import layers as contrib_layers


class BertConfig(object):
  """Configuration for `BertModel` (ELECTRA uses the same model as BERT)."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=2,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.io.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"


class BertModel(object):
  """BERT model. Although the training algorithm is different, the transformer
  model for ELECTRA is the same as BERT's.

  Example usage:

  ```python
  # Already been converted into WordPiece token ids
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
  token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
    num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

  model = modeling.BertModel(config=config, is_training=True,
    input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

  label_embeddings = tf.get_variable(...)
  pooled_output = model.get_pooled_output()
  logits = tf.matmul(pooled_output, label_embeddings)
  ...
  ```
  """

  def __init__(self,
               bert_config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=True,
               scope=None,
               embedding_size=None,
               input_embeddings=None,
               input_reprs=None,
               update_embeddings=True,
               untied_embeddings=False,
               embedding_file=None):
    """Constructor for BertModel.

    Args:
      bert_config: `BertConfig` instance.
      is_training: bool. true for training model, false for eval model. Controls
        whether dropout will be applied.
      input_ids: int32 Tensor of shape [batch_size, seq_length].
      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
        embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
        it is much faster if this is True, on the CPU or GPU, it is faster if
        this is False.
      scope: (optional) variable scope. Defaults to "electra".

    Raises:
      ValueError: The config is invalid or one of the input tensor shapes
        is invalid.
    """
    bert_config = copy.deepcopy(bert_config)
    if not is_training:
      bert_config.hidden_dropout_prob = 0.0
      bert_config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(token_type_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    assert token_type_ids is not None

    if input_reprs is None:
      if input_embeddings is None:
        with tf.variable_scope(
            (scope if untied_embeddings else "electra") + "/embeddings",
            reuse=tf.AUTO_REUSE):
          # Perform embedding lookup on the word ids
          if embedding_size is None:
            embedding_size = bert_config.hidden_size
          (self.token_embeddings, self.embedding_table) = embedding_lookup(
              input_ids=input_ids,
              vocab_size=bert_config.vocab_size,
              embedding_size=embedding_size,
              initializer_range=bert_config.initializer_range,
              word_embedding_name="word_embeddings",
              use_one_hot_embeddings=use_one_hot_embeddings,
              embedding_file=embedding_file)
      else:
        self.token_embeddings = input_embeddings

      with tf.variable_scope(
          (scope if untied_embeddings else "electra") + "/embeddings",
          reuse=tf.AUTO_REUSE):
        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.token_embeddings,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=bert_config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=bert_config.initializer_range,
            max_position_embeddings=bert_config.max_position_embeddings,
            dropout_prob=bert_config.hidden_dropout_prob)
    else:
      self.embedding_output = input_reprs
    if not update_embeddings:
      self.embedding_output = tf.stop_gradient(self.embedding_output)

    with tf.variable_scope(scope, default_name="electra"):
      if self.embedding_output.shape[-1] != bert_config.hidden_size:
        self.embedding_output = tf.layers.dense(
            self.embedding_output, bert_config.hidden_size,
            name="embeddings_project")

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
        attention_mask = create_attention_mask_from_input_mask(
            token_type_ids, input_mask)

        # Run the stacked transformer. Output shapes
        # sequence_output: [batch_size, seq_length, hidden_size]
        # pooled_output: [batch_size, hidden_size]
        # all_encoder_layers: [n_layers, batch_size, seq_length, hidden_size].
        # attn_maps: [n_layers, batch_size, n_heads, seq_length, seq_length]
        (self.all_layer_outputs, self.attn_maps) = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=bert_config.hidden_size,
            num_hidden_layers=bert_config.num_hidden_layers,
            num_attention_heads=bert_config.num_attention_heads,
            intermediate_size=bert_config.intermediate_size,
            intermediate_act_fn=get_activation(bert_config.hidden_act),
            hidden_dropout_prob=bert_config.hidden_dropout_prob,
            attention_probs_dropout_prob=
            bert_config.attention_probs_dropout_prob,
            initializer_range=bert_config.initializer_range,
            do_return_all_layers=True)
        self.sequence_output = self.all_layer_outputs[-1]
        self.pooled_output = self.sequence_output[:, 0]

  def get_pooled_output(self):
    return self.pooled_output

  def get_sequence_output(self):
    """Gets final hidden layer of encoder.

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the final hidden of the transformer encoder.
    """
    return self.sequence_output

  def get_all_encoder_layers(self):
    return self.all_layer_outputs

  def get_embedding_output(self):
    """Gets output of the embedding lookup (i.e., input to the transformer).

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the output of the embedding layer, after summing the word
      embeddings with the positional embeddings and the token type embeddings,
      then performing layer normalization. This is the input to the transformer.
    """
    return self.embedding_output

  def get_embedding_table(self):
    return self.embedding_table


def gelu(input_tensor):
  """Gaussian Error Linear Unit.

  This is a smoother version of the RELU.
  Original paper: https://arxiv.org/abs/1606.08415

  Args:
    input_tensor: float Tensor to perform activation.

  Returns:
    `input_tensor` with the GELU activation applied.
  """
  cdf = 0.5 * (1.0 + tf.math.erf(input_tensor / tf.sqrt(2.0)))
  return input_tensor * cdf


def get_activation(activation_string):
  """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.

  Args:
    activation_string: String name of the activation function.

  Returns:
    A Python function corresponding to the activation function. If
    `activation_string` is None, empty, or "linear", this will return None.
    If `activation_string` is not a string, it will return `activation_string`.

  Raises:
    ValueError: The `activation_string` does not correspond to a known
      activation.
  """

  # We assume that anything that"s not a string is already an activation
  # function, so we just return it.
  if not isinstance(activation_string, six.string_types):
    return activation_string

  if not activation_string:
    return None

  act = activation_string.lower()
  if act == "linear":
    return None
  elif act == "relu":
    return tf.nn.relu
  elif act == "gelu":
    return gelu
  elif act == "tanh":
    return tf.tanh
  else:
    raise ValueError("Unsupported activation: %s" % act)


def get_assignment_map_from_checkpoint(tvars, init_checkpoint, prefix="", update_vocab=False):
  """Compute the union of the current variables and checkpoint variables."""
  name_to_variable = collections.OrderedDict()
  for var in tvars:
    name = var.name
    m = re.match("^(.*):\\d+$", name)
    if m is not None:
      name = m.group(1)
    name_to_variable[name] = var

  initialized_variable_names = {}
  assignment_map = collections.OrderedDict()
  for x in tf.train.list_variables(init_checkpoint):
    (name, var) = (x[0], x[1])
    if prefix + name not in name_to_variable:
      continue
    if update_vocab:
      if 'word_embeddings' in name or 'output_bias' in name:
        continue
    assignment_map[name] = prefix + name
    initialized_variable_names[name] = 1
    initialized_variable_names[name + ":0"] = 1

  return assignment_map, initialized_variable_names


def dropout(input_tensor, dropout_prob):
  """Perform dropout.

  Args:
    input_tensor: float Tensor.
    dropout_prob: Python float. The probability of dropping out a value (NOT of
      *keeping* a dimension as in `tf.nn.dropout`).

  Returns:
    A version of `input_tensor` with dropout applied.
  """
  if dropout_prob is None or dropout_prob == 0.0:
    return input_tensor

  output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
  return output


def layer_norm(input_tensor, name=None):
  """Run layer normalization on the last dimension of the tensor."""
  return contrib_layers.layer_norm(
      inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)


def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
  """Runs layer normalization followed by dropout."""
  output_tensor = layer_norm(input_tensor, name)
  output_tensor = dropout(output_tensor, dropout_prob)
  return output_tensor


def create_initializer(initializer_range=0.02):
  """Creates a `truncated_normal_initializer` with the given range."""
  return tf.truncated_normal_initializer(stddev=initializer_range)

def load_pretrained_embedding(embedding_file, vocab_size, embedding_size):
  pretrained = np.random.normal(size=(vocab_size,embedding_size))
  for i,line in enumerate(open(embedding_file)):
    fields = line.strip().split()
    word = fields[0]
    ebd = np.asarray(fields[1:])
    if len(ebd) != embedding_size:
      tf.logging.warning(f'第{i}行embedding大小为{len(ebd)} != {embedding_size}')
      raise
    else:
      pretrained[i] = ebd
  return pretrained

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False,
                     embedding_file=None):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
      for TPUs.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  original_dims = input_ids.shape.ndims
  if original_dims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  if embedding_file:
    print(f'##### 从{embedding_file}加载预训练词向量 #####')
    pretrained = load_pretrained_embedding(embedding_file,vocab_size,embedding_size)
    if pretrained is not None:
      initializer = tf.constant_initializer(value=pretrained)
      embedding_table = tf.get_variable(
        name=word_embedding_name,
        initializer=lambda : initializer([vocab_size,embedding_size]))
    else:
      raise Exception('初始化词向量失败')
  else:
    embedding_table = tf.get_variable(
        name=word_embedding_name,
        shape=[vocab_size, embedding_size],
        initializer=create_initializer(initializer_range))

  if original_dims == 3:
    input_shape = get_shape_list(input_ids)
    tf.reshape(input_ids, [-1, input_shape[-1]])
    output = tf.matmul(input_ids, embedding_table)
    output = tf.reshape(output,
                        [input_shape[0], input_shape[1], embedding_size])
  else:
    if use_one_hot_embeddings:
      flat_input_ids = tf.reshape(input_ids, [-1])
      one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
      output = tf.matmul(one_hot_input_ids, embedding_table)
    else:
      output = tf.nn.embedding_lookup(embedding_table, input_ids)

    input_shape = get_shape_list(input_ids)

    output = tf.reshape(output,
                        input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return output, embedding_table


def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output


def create_attention_mask_from_input_mask(from_tensor, to_mask):
  """Create 3D attention mask from a 2D tensor mask.

  Args:
    from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
    to_mask: int32 Tensor of shape [batch_size, to_seq_length].

  Returns:
    float Tensor of shape [batch_size, from_seq_length, to_seq_length].
  """
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  # We don't assume that `from_tensor` is a mask (although it could be). We
  # don't actually care if we attend *from* padding tokens (only *to* padding)
  # tokens so we create a tensor of all ones.
  #
  # `broadcast_ones` = [batch_size, from_seq_length, 1]
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  # Here we broadcast along two dimensions to create the mask.
  mask = broadcast_ones * to_mask

  return mask


def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.

  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the same, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector.

  This function first projects `from_tensor` into a "query" tensor and
  `to_tensor` into "key" and "value" tensors. These are (effectively) a list
  of tensors of length `num_attention_heads`, where each tensor is of shape
  [batch_size, seq_length, size_per_head].

  Then, the query and key tensors are dot-producted and scaled. These are
  softmaxed to obtain attention probabilities. The value tensors are then
  interpolated by these probabilities, then concatenated back to a single
  tensor and returned.

  In practice, the multi-headed attention are done with transposes and
  reshapes rather than actual separate tensors.

  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
    float Tensor of shape [batch_size, from_seq_length,
      num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
      true, this will be of shape [batch_size * from_seq_length,
      num_attention_heads * size_per_head]).

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if batch_size is None or from_seq_length is None or to_seq_length is None:
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer, attention_probs


def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".

  This is almost an exact implementation of the original Transformer encoder.

  See the original paper:
  https://arxiv.org/abs/1706.03762

  Also see:
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.

  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.

  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor)

  attn_maps = []
  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head, probs = attention_layer(
              from_tensor=prev_output,
              to_tensor=prev_output,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)
          attn_maps.append(probs)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + prev_output)

      # The activation is only applied to the "intermediate" hidden layer.
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      with tf.variable_scope("output"):
        prev_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        prev_output = dropout(prev_output, hidden_dropout_prob)
        prev_output = layer_norm(prev_output + attention_output)
        all_layer_outputs.append(prev_output)

  attn_maps = tf.stack(attn_maps, 0)
  if do_return_all_layers:
    return tf.stack([reshape_from_matrix(layer, input_shape)
                     for layer in all_layer_outputs], 0), attn_maps
  else:
    return reshape_from_matrix(prev_output, input_shape), attn_maps


def get_shape_list(tensor, expected_rank=None, name=None):
  """Returns a list of the shape of tensor, preferring static dimensions.

  Args:
    tensor: A tf.Tensor object to find the shape of.
    expected_rank: (optional) int. The expected rank of `tensor`. If this is
      specified and the `tensor` has a different rank, and exception will be
      thrown.
    name: Optional name of the tensor for the error message.

  Returns:
    A list of dimensions of the shape of tensor. All static dimensions will
    be returned as python integers, and dynamic dimensions will be returned
    as tf.Tensor scalars.
  """
  if isinstance(tensor, np.ndarray) or isinstance(tensor, list):
    shape = np.array(tensor).shape
    if isinstance(expected_rank, six.integer_types):
      assert len(shape) == expected_rank
    elif expected_rank is not None:
      assert len(shape) in expected_rank
    return shape

  if name is None:
    name = tensor.name

  if expected_rank is not None:
    assert_rank(tensor, expected_rank, name)

  shape = tensor.shape.as_list()

  non_static_indexes = []
  for (index, dim) in enumerate(shape):
    if dim is None:
      non_static_indexes.append(index)

  if not non_static_indexes:
    return shape

  dyn_shape = tf.shape(tensor)
  for index in non_static_indexes:
    shape[index] = dyn_shape[index]
  return shape


def reshape_to_matrix(input_tensor):
  """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
  ndims = input_tensor.shape.ndims
  if ndims < 2:
    raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
                     (input_tensor.shape))
  if ndims == 2:
    return input_tensor

  width = input_tensor.shape[-1]
  output_tensor = tf.reshape(input_tensor, [-1, width])
  return output_tensor


def reshape_from_matrix(output_tensor, orig_shape_list):
  """Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
  if len(orig_shape_list) == 2:
    return output_tensor

  output_shape = get_shape_list(output_tensor)

  orig_dims = orig_shape_list[0:-1]
  width = output_shape[-1]

  return tf.reshape(output_tensor, orig_dims + [width])


def assert_rank(tensor, expected_rank, name=None):
  """Raises an exception if the tensor rank is not of the expected rank.

  Args:
    tensor: A tf.Tensor to check the rank of.
    expected_rank: Python integer or list of integers, expected rank.
    name: Optional name of the tensor for the error message.

  Raises:
    ValueError: If the expected shape doesn't match the actual shape.
  """
  if name is None:
    name = tensor.name

  expected_rank_dict = {}
  if isinstance(expected_rank, six.integer_types):
    expected_rank_dict[expected_rank] = True
  else:
    for x in expected_rank:
      expected_rank_dict[x] = True

  actual_rank = tensor.shape.ndims
  if actual_rank not in expected_rank_dict:
    scope_name = tf.get_variable_scope().name
    raise ValueError(
        "For the tensor `%s` in scope `%s`, the actual rank "
        "`%d` (shape = %s) is not equal to the expected rank `%s`" %
        (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))


================================================
FILE: code/electra-pretrain/model/optimization.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Functions and classes related to optimization (weight updates).
Modified from the original BERT code to allow for having separate learning
rates for different layers of the network.
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import re
import tensorflow.compat.v1 as tf


def create_optimizer(
    loss, learning_rate, num_train_steps, weight_decay_rate=0.0, use_tpu=False,
    warmup_steps=0, warmup_proportion=0, lr_decay_power=1.0,
    layerwise_lr_decay_power=-1, n_transformer_layers=None,
    amp=False,accumulation_step=1):
  """Creates an optimizer and training op."""
  global_step = tf.train.get_or_create_global_step()
  learning_rate = tf.train.polynomial_decay(
      learning_rate,
      global_step,
      num_train_steps,
      end_learning_rate=0.0,
      power=lr_decay_power,
      cycle=False)
  warmup_steps = max(num_train_steps * warmup_proportion, warmup_steps)
  learning_rate *= tf.minimum(
      1.0, tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))

  if layerwise_lr_decay_power > 0:
    learning_rate = _get_layer_lrs(learning_rate, layerwise_lr_decay_power,
                                   n_transformer_layers)
  optimizer = AdamWeightDecayOptimizer(
      learning_rate=learning_rate,
      weight_decay_rate=weight_decay_rate,
      beta_1=0.9,
      beta_2=0.999,
      epsilon=1e-6,
      exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
  if use_tpu:
    optimizer = tf.tpu.CrossShardOptimizer(optimizer)
  
  tvars = tf.trainable_variables()

  if amp:  
    optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)
    
  grads_and_vars = optimizer.compute_gradients(loss * 1.0 / accumulation_step, tvars)

  if accumulation_step > 1:
    print('### Using Gradient Accumulation with {} ###'.format(accumulation_step))
    local_step = tf.get_variable(name="local_step", shape=[], dtype=tf.int32, trainable=False,
                                    initializer=tf.zeros_initializer)
    batch_finite = tf.get_variable(name="batch_finite", shape=[], dtype=tf.bool, trainable=False,
                                    initializer=tf.ones_initializer)
    accum_vars = [tf.get_variable(
        name=tvar.name.split(":")[0] + "/accum",
        shape=tvar.shape.as_list(),
        dtype=tf.float32,
        trainable=False,
        initializer=tf.zeros_initializer()) for tvar in tf.trainable_variables()]

    reset_step = tf.cast(tf.math.equal(local_step % accumulation_step, 0), dtype=tf.bool)
    local_step = tf.cond(reset_step, lambda:local_step.assign(tf.ones_like(local_step)), lambda:local_step.assign_add(1))

    grads_and_vars_and_accums = [(gv[0],gv[1],accum_vars[i]) for i, gv in enumerate(grads_and_vars) if gv[0] is not None]
    grads, tvars, accum_vars = list(zip(*grads_and_vars_and_accums))

    all_are_finite = tf.reduce_all([tf.reduce_all(tf.is_finite(g)) for g in grads]) if amp else tf.constant(True, dtype=tf.bool)
    batch_finite = tf.cond(reset_step,
      lambda: batch_finite.assign(tf.math.logical_and(tf.constant(True, dtype=tf.bool), all_are_finite)),
      lambda: batch_finite.assign(tf.math.logical_and(batch_finite, all_are_finite)))

    # This is how the model was pre-trained.
    # ensure global norm is a finite number
    # to prevent clip_by_global_norm from having a hizzy fit.
    (clipped_grads, _) = tf.clip_by_global_norm(
          grads, clip_norm=1.0,
          use_norm=tf.cond(
              all_are_finite,
              lambda: tf.global_norm(grads),
              lambda: tf.constant(1.0)))

    accum_vars = tf.cond(reset_step,
            lambda: [accum_vars[i].assign(grad) for i, grad in enumerate(clipped_grads)],
            lambda: [accum_vars[i].assign_add(grad) for i, grad in enumerate(clipped_grads)])

    def update(accum_vars):
      return optimizer.apply_gradients(list(zip(accum_vars, tvars)))

    update_step = tf.identity(tf.cast(tf.math.equal(local_step % accumulation_step, 0), dtype=tf.bool), name="update_step")
    update_op = tf.cond(update_step,
                        lambda: update(accum_vars), lambda: tf.no_op())

    new_global_step = tf.cond(tf.math.logical_and(update_step, batch_finite),
                              lambda: global_step+1,
                              lambda: global_step)
    new_global_step = tf.identity(new_global_step, name='step_update')
    train_op = tf.group(update_op, [global_step.assign(new_global_step)])

  else:
    grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
    grads, tvars = list(zip(*grads_and_vars))
    all_are_finite = tf.reduce_all(
        [tf.reduce_all(tf.is_finite(g)) for g in grads]) if amp else tf.constant(True, dtype=tf.bool)

    # This is how the model was pre-trained.
    # ensure global norm is a finite number
    # to prevent clip_by_global_norm from having a hizzy fit.
    (clipped_grads, _) = tf.clip_by_global_norm(
        grads, clip_norm=1.0,
        use_norm=tf.cond(
            all_are_finite,
            lambda: tf.global_norm(grads),
            lambda: tf.constant(1.0)))

    train_op = optimizer.apply_gradients(
        list(zip(clipped_grads, tvars)))

    new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
    new_global_step = tf.identity(new_global_step, name='step_update')
    train_op = tf.group(train_op, [global_step.assign(new_global_step)])

    # grads = tf.gradients(loss, tvars)
    # (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
    # train_op = optimizer.apply_gradients(
    #     zip(grads, tvars), global_step=global_step)
    # new_global_step = global_step + 1
    # train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op


class AdamWeightDecayOptimizer(tf.train.Optimizer):
  """A basic Adam optimizer that includes "correct" L2 weight decay."""

  def __init__(self,
               learning_rate,
               weight_decay_rate=0.0,
               beta_1=0.9,
               beta_2=0.999,
               epsilon=1e-6,
               exclude_from_weight_decay=None,
               name="AdamWeightDecayOptimizer"):
    """Constructs a AdamWeightDecayOptimizer."""
    super(AdamWeightDecayOptimizer, self).__init__(False, name)

    self.learning_rate = learning_rate
    self.weight_decay_rate = weight_decay_rate
    self.beta_1 = beta_1
    self.beta_2 = beta_2
    self.epsilon = epsilon
    self.exclude_from_weight_decay = exclude_from_weight_decay

  def _apply_gradients(self, grads_and_vars, learning_rate):
    """See base class."""
    assignments = []
    for (grad, param) in grads_and_vars:
      if grad is None or param is None:
        continue

      param_name = self._get_variable_name(param.name)

      m = tf.get_variable(
          name=param_name + "/adam_m",
          shape=param.shape.as_list(),
          dtype=tf.float32,
          trainable=False,
          initializer=tf.zeros_initializer())
      v = tf.get_variable(
          name=param_name + "/adam_v",
          shape=param.shape.as_list(),
          dtype=tf.float32,
          trainable=False,
          initializer=tf.zeros_initializer())

      # Standard Adam update.
      next_m = (
          tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
      next_v = (
          tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
                                                    tf.square(grad)))
      update = next_m / (tf.sqrt(next_v) + self.epsilon)

      # Just adding the square of the weights to the loss function is *not*
      # the correct way of using L2 regularization/weight decay with Adam,
      # since that will interact with the m and v parameters in strange ways.
      #
      # Instead we want ot decay the weights in a manner that doesn't interact
      # with the m/v parameters. This is equivalent to adding the square
      # of the weights to the loss with plain (non-momentum) SGD.
      if self.weight_decay_rate > 0:
        if self._do_use_weight_decay(param_name):
          update += self.weight_decay_rate * param

      update_with_lr = learning_rate * update
      next_param = param - update_with_lr

      assignments.extend(
          [param.assign(next_param),
           m.assign(next_m),
           v.assign(next_v)])

    return assignments

  def apply_gradients(self, grads_and_vars, global_step=None, name=None):
    if isinstance(self.learning_rate, dict):
      key_to_grads_and_vars = {}
      for grad, var in grads_and_vars:
        update_for_var = False
        for key in self.learning_rate:
          if key in var.name:
            update_for_var = True
            if key not in key_to_grads_and_vars:
              key_to_grads_and_vars[key] = []
            key_to_grads_and_vars[key].append((grad, var))
        if not update_for_var:
          raise ValueError("No learning rate specified for variable", var)
      assignments = []
      for key, key_grads_and_vars in key_to_grads_and_vars.items():
        assignments += self._apply_gradients(key_grads_and_vars,
                                             self.learning_rate[key])
    else:
      assignments = self._apply_gradients(grads_and_vars, self.learning_rate)
    return tf.group(*assignments, name=name)

  def _do_use_weight_decay(self, param_name):
    """Whether to use L2 weight decay for `param_name`."""
    if not self.weight_decay_rate:
      return False
    if self.exclude_from_weight_decay:
      for r in self.exclude_from_weight_decay:
        if re.search(r, param_name) is not None:
          return False
    return True

  def _get_variable_name(self, param_name):
    """Get the variable name from the tensor name."""
    m = re.match("^(.*):\\d+$", param_name)
    if m is not None:
      param_name = m.group(1)
    return param_name


def _get_layer_lrs(learning_rate, layer_decay, n_layers):
  """Have lower learning rates for layers closer to the input."""
  key_to_depths = collections.OrderedDict({
      "/embeddings/": 0,
      "/embeddings_project/": 0,
      "task_specific/": n_layers + 2,
  })
  for layer in range(n_layers):
    key_to_depths["encoder/layer_" + str(layer) + "/"] = layer + 1
  return {
      key: learning_rate * (layer_decay ** (n_layers + 2 - depth))
      for key, depth in key_to_depths.items()
  }


================================================
FILE: code/electra-pretrain/model/tokenization.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Tokenization classes, the same as used for BERT."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import unicodedata
import six
import tensorflow.compat.v1 as tf



def convert_to_unicode(text):
  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text.decode("utf-8", "ignore")
    elif isinstance(text, unicode):
      return text
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")


def printable_text(text):
  """Returns text encoded in a way suitable for print or `tf.logging`."""

  # These functions want `str` for both Python2 and Python3, but in one case
  # it's a Unicode string and in the other it's a byte string.
  if six.PY3:
    if isinstance(text, str):
      return text
    elif isinstance(text, bytes):
      return text.decode("utf-8", "ignore")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  elif six.PY2:
    if isinstance(text, str):
      return text
    elif isinstance(text, unicode):
      return text.encode("utf-8")
    else:
      raise ValueError("Unsupported string type: %s" % (type(text)))
  else:
    raise ValueError("Not running on Python2 or Python 3?")


def load_vocab(vocab_file):
  """Loads a vocabulary file into a dictionary."""
  vocab = collections.OrderedDict()
  index = 0
  with tf.io.gfile.GFile(vocab_file, "r") as reader:
    while True:
      token = convert_to_unicode(reader.readline())
      if not token:
        break
      token = token.strip()
      vocab[token] = index
      index += 1
  return vocab


def convert_by_vocab(vocab, items):
  """Converts a sequence of [tokens|ids] using the vocab."""
  output = []
  for item in items:
    output.append(vocab[item])
  return output


def convert_tokens_to_ids(vocab, tokens):
  return convert_by_vocab(vocab, tokens)


def convert_ids_to_tokens(inv_vocab, ids):
  return convert_by_vocab(inv_vocab, ids)


def whitespace_tokenize(text):
  """Runs basic whitespace cleaning and splitting on a piece of text."""
  text = text.strip()
  if not text:
    return []
  tokens = text.split()
  return tokens

class SimpleTokenizer(object):
  def __init__(self, vocab_file):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}

  def tokenize(self, text):
    text = text.lower()
    return [token if token in self.vocab else '[UNK]' for token in text.strip()]

  def convert_tokens_to_ids(self, tokens):
    return convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)


class FullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}
    self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    return convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)


class BasicTokenizer(object):
  """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""

  def __init__(self, do_lower_case=True):
    """Constructs a BasicTokenizer.

    Args:
      do_lower_case: Whether to lower case the input.
    """
    self.do_lower_case = do_lower_case

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = self._clean_text(text)

    # This was added on November 1st, 2018 for the multilingual and Chinese
    # models. This is also applied to the English models now, but it doesn't
    # matter since the English models were not trained on any Chinese data
    # and generally don't have any Chinese data in them (there are Chinese
    # characters in the vocabulary because Wikipedia does have some Chinese
    # words in the English Wikipedia.).
    text = self._tokenize_chinese_chars(text)

    orig_tokens = whitespace_tokenize(text)
    split_tokens = []
    for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    return output_tokens

  def _run_strip_accents(self, text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
      cat = unicodedata.category(char)
      if cat == "Mn":
        continue
      output.append(char)
    return "".join(output)

  def _run_split_on_punc(self, text):
    """Splits punctuation on a piece of text."""
    chars = list(text)
    i = 0
    start_new_word = True
    output = []
    while i < len(chars):
      char = chars[i]
      if _is_punctuation(char):
        output.append([char])
        start_new_word = True
      else:
        if start_new_word:
          output.append([])
        start_new_word = False
        output[-1].append(char)
      i += 1

    return ["".join(x) for x in output]

  def _tokenize_chinese_chars(self, text):
    """Adds whitespace around any CJK character."""
    output = []
    for char in text:
      cp = ord(char)
      if self._is_chinese_char(cp):
        output.append(" ")
        output.append(char)
        output.append(" ")
      else:
        output.append(char)
    return "".join(output)

  def _is_chinese_char(self, cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    #
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.
    if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
        (cp >= 0x3400 and cp <= 0x4DBF) or  #
        (cp >= 0x20000 and cp <= 0x2A6DF) or  #
        (cp >= 0x2A700 and cp <= 0x2B73F) or  #
        (cp >= 0x2B740 and cp <= 0x2B81F) or  #
        (cp >= 0x2B820 and cp <= 0x2CEAF) or
        (cp >= 0xF900 and cp <= 0xFAFF) or  #
        (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
      return True

    return False

  def _clean_text(self, text):
    """Performs invalid character removal and whitespace cleanup on text."""
    output = []
    for char in text:
      cp = ord(char)
      if cp == 0 or cp == 0xfffd or _is_control(char):
        continue
      if _is_whitespace(char):
        output.append(" ")
      else:
        output.append(char)
    return "".join(output)


class WordpieceTokenizer(object):
  """Runs WordPiece tokenziation."""

  def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
    self.vocab = vocab
    self.unk_token = unk_token
    self.max_input_chars_per_word = max_input_chars_per_word

  def tokenize(self, text):
    """Tokenizes a piece of text into its word pieces.

    This uses a greedy longest-match-first algorithm to perform tokenization
    using the given vocabulary.

    For example:
      input = "unaffable"
      output = ["un", "##aff", "##able"]

    Args:
      text: A single token or whitespace separated tokens. This should have
        already been passed through `BasicTokenizer.

    Returns:
      A list of wordpiece tokens.
    """

    text = convert_to_unicode(text)

    output_tokens = []
    for token in whitespace_tokenize(text):
      chars = list(token)
      if len(chars) > self.max_input_chars_per_word:
        output_tokens.append(self.unk_token)
        continue

      is_bad = False
      start = 0
      sub_tokens = []
      while start < len(chars):
        end = len(chars)
        cur_substr = None
        while start < end:
          substr = "".join(chars[start:end])
          if start > 0:
            substr = "##" + substr
          if substr in self.vocab:
            cur_substr = substr
            break
          end -= 1
        if cur_substr is None:
          is_bad = True
          break
        sub_tokens.append(cur_substr)
        start = end

      if is_bad:
        output_tokens.append(self.unk_token)
      else:
        output_tokens.extend(sub_tokens)
    return output_tokens


def _is_whitespace(char):
  """Checks whether `chars` is a whitespace character."""
  # \t, \n, and \r are technically contorl characters but we treat them
  # as whitespace since they are generally considered as such.
  if char == " " or char == "\t" or char == "\n" or char == "\r":
    return True
  cat = unicodedata.category(char)
  if cat == "Zs":
    return True
  return False


def _is_control(char):
  """Checks whether `chars` is a control character."""
  # These are technically control characters but we count them as whitespace
  # characters.
  if char == "\t" or char == "\n" or char == "\r":
    return False
  cat = unicodedata.category(char)
  if cat.startswith("C"):
    return True
  return False


def _is_punctuation(char):
  """Checks whether `chars` is a punctuation character."""
  cp = ord(char)
  # We treat all non-letter/number ASCII as punctuation.
  # Characters such as "^", "$", and "`" are not in the Unicode
  # Punctuation class but we treat them as punctuation anyways, for
  # consistency.
  if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
      (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
    return True
  cat = unicodedata.category(char)
  if cat.startswith("P"):
    return True
  return False


================================================
FILE: code/electra-pretrain/pretrain/__init__.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

================================================
FILE: code/electra-pretrain/pretrain/pretrain_data.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Helpers for preparing pre-training data and supplying them to the model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections

import numpy as np
import tensorflow.compat.v1 as tf

import configure_pretraining
from model import tokenization
from util import utils


def get_input_fn(config: configure_pretraining.PretrainingConfig, is_training,
                 num_cpu_threads=8):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  input_files = []
  for input_pattern in config.pretrain_tfrecords.split(","):
    input_files.extend(tf.io.gfile.glob(input_pattern))

  def input_fn(params):
    """The actual input function."""
    # batch_size = params["batch_size"]
    batch_size = config.train_batch_size


    name_to_features = {
        "input_ids": tf.io.FixedLenFeature([config.max_seq_length], tf.int64),
        "input_mask": tf.io.FixedLenFeature([config.max_seq_length], tf.int64),
        "segment_ids": tf.io.FixedLenFeature([config.max_seq_length], tf.int64),
    }

    d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
    d = d.repeat()
    d = d.shuffle(buffer_size=len(input_files))

    # `cycle_length` is the number of parallel files that get read.
    cycle_length = min(num_cpu_threads, len(input_files))

    # `sloppy` mode means that the interleaving is not exact. This adds
    # even more randomness to the training pipeline.
    d = d.apply(
        tf.data.experimental.parallel_interleave(
            tf.data.TFRecordDataset,
            sloppy=is_training,
            cycle_length=cycle_length))
    d = d.shuffle(buffer_size=10000)

    # We must `drop_remainder` on training because the TPU requires fixed
    # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
    # and we *don"t* want to drop the remainder, otherwise we wont cover
    # every sample.
    d = d.apply(
        tf.data.experimental.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            num_parallel_batches=num_cpu_threads,
            drop_remainder=True))
    return d

  return input_fn


def _decode_record(record, name_to_features):
  """Decodes a record to a TensorFlow example."""
  example = tf.io.parse_single_example(record, name_to_features)

  # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
  # So cast all int64 to int32.
  for name in list(example.keys()):
    t = example[name]
    if t.dtype == tf.int64:
      t = tf.cast(t, tf.int32)
    example[name] = t

  return example


# model inputs - it's a bit nicer to use a namedtuple rather than keep the
# features as a dict
Inputs = collections.namedtuple(
    "Inputs", ["input_ids", "input_mask", "segment_ids", "masked_lm_positions",
               "masked_lm_ids", "masked_lm_weights"])


def features_to_inputs(features):
  return Inputs(
      input_ids=features["input_ids"],
      input_mask=features["input_mask"],
      segment_ids=features["segment_ids"],
      masked_lm_positions=(features["masked_lm_positions"]
                           if "masked_lm_positions" in features else None),
      masked_lm_ids=(features["masked_lm_ids"]
                     if "masked_lm_ids" in features else None),
      masked_lm_weights=(features["masked_lm_weights"]
                         if "masked_lm_weights" in features else None),
  )


def get_updated_inputs(inputs, **kwargs):
  features = inputs._asdict()
  for k, v in kwargs.items():
    features[k] = v
  return features_to_inputs(features)


ENDC = "\033[0m"
COLORS = ["\033[" + str(n) + "m" for n in list(range(91, 97)) + [90]]
RED = COLORS[0]
BLUE = COLORS[3]
CYAN = COLORS[5]
GREEN = COLORS[1]


def print_tokens(inputs: Inputs, inv_vocab, updates_mask=None):
  """Pretty-print model inputs."""
  pos_to_tokid = {}
  for tokid, pos, weight in zip(
      inputs.masked_lm_ids[0], inputs.masked_lm_positions[0],
      inputs.masked_lm_weights[0]):
    if weight == 0:
      pass
    else:
      pos_to_tokid[pos] = tokid

  text = ""
  provided_update_mask = (updates_mask is not None)
  if not provided_update_mask:
    updates_mask = np.zeros_like(inputs.input_ids)
  for pos, (tokid, um) in enumerate(
      zip(inputs.input_ids[0], updates_mask[0])):
    token = inv_vocab[tokid]
    if token == "[PAD]":
      break
    if pos in pos_to_tokid:
      token = RED + token + " (" + inv_vocab[pos_to_tokid[pos]] + ")" + ENDC
      if provided_update_mask:
        assert um == 1
    else:
      if provided_update_mask:
        assert um == 0
    text += token + " "
  utils.log(tokenization.printable_text(text))


================================================
FILE: code/electra-pretrain/pretrain/pretrain_helpers.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Helper functions for pre-training. These mainly deal with the gathering and
scattering needed so the generator only makes predictions for the small number
of masked tokens.
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow.compat.v1 as tf

import configure_pretraining
from model import modeling
from model import tokenization
from pretrain import pretrain_data


def gather_positions(sequence, positions):
  """Gathers the vectors at the specific positions over a minibatch.

  Args:
    sequence: A [batch_size, seq_length] or
        [batch_size, seq_length, depth] tensor of values
    positions: A [batch_size, n_positions] tensor of indices

  Returns: A [batch_size, n_positions] or
    [batch_size, n_positions, depth] tensor of the values at the indices
  """
  shape = modeling.get_shape_list(sequence, expected_rank=[2, 3])
  depth_dimension = (len(shape) == 3)
  if depth_dimension:
    B, L, D = shape
  else:
    B, L = shape
    D = 1
    sequence = tf.expand_dims(sequence, -1)
  position_shift = tf.expand_dims(L * tf.range(B), -1)
  flat_positions = tf.reshape(positions + position_shift, [-1])
  flat_sequence = tf.reshape(sequence, [B * L, D])
  gathered = tf.gather(flat_sequence, flat_positions)
  if depth_dimension:
    return tf.reshape(gathered, [B, -1, D])
  else:
    return tf.reshape(gathered, [B, -1])


def scatter_update(sequence, updates, positions):
  """Scatter-update a sequence.

  Args:
    sequence: A [batch_size, seq_len] or [batch_size, seq_len, depth] tensor
    updates: A tensor of size batch_size*seq_len(*depth)
    positions: A [batch_size, n_positions] tensor

  Returns: A tuple of two tensors. First is a [batch_size, seq_len] or
    [batch_size, seq_len, depth] tensor of "sequence" with elements at
    "positions" replaced by the values at "updates." Updates to index 0 are
    ignored. If there are duplicated positions the update is only applied once.
    Second is a [batch_size, seq_len] mask tensor of which inputs were updated.
  """
  shape = modeling.get_shape_list(sequence, expected_rank=[2, 3])
  depth_dimension = (len(shape) == 3)
  if depth_dimension:
    B, L, D = shape
  else:
    B, L = shape
    D = 1
    sequence = tf.expand_dims(sequence, -1)
  N = modeling.get_shape_list(positions)[1]

  shift = tf.expand_dims(L * tf.range(B), -1)
  flat_positions = tf.reshape(positions + shift, [-1, 1])
  flat_updates = tf.reshape(updates, [-1, D])
  updates = tf.scatter_nd(flat_positions, flat_updates, [B * L, D])
  updates = tf.reshape(updates, [B, L, D])

  flat_updates_mask = tf.ones([B * N], tf.int32)
  updates_mask = tf.scatter_nd(flat_positions, flat_updates_mask, [B * L])
  updates_mask = tf.reshape(updates_mask, [B, L])
  not_first_token = tf.concat([tf.zeros((B, 1), tf.int32),
                               tf.ones((B, L - 1), tf.int32)], -1)
  updates_mask *= not_first_token
  updates_mask_3d = tf.expand_dims(updates_mask, -1)

  # account for duplicate positions
  if sequence.dtype == tf.float32:
    updates_mask_3d = tf.cast(updates_mask_3d, tf.float32)
    updates /= tf.maximum(1.0, updates_mask_3d)
  else:
    assert sequence.dtype == tf.int32
    updates = tf.math.floordiv(updates, tf.maximum(1, updates_mask_3d))
  updates_mask = tf.minimum(updates_mask, 1)
  updates_mask_3d = tf.minimum(updates_mask_3d, 1)

  updated_sequence = (((1 - updates_mask_3d) * sequence) +
                      (updates_mask_3d * updates))
  if not depth_dimension:
    updated_sequence = tf.squeeze(updated_sequence, -1)

  return updated_sequence, updates_mask


def _get_candidates_mask(inputs: pretrain_data.Inputs, vocab,
                         disallow_from_mask=None):
  """Returns a mask tensor of positions in the input that can be masked out."""
  ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]
  candidates_mask = tf.ones_like(inputs.input_ids, tf.bool)
  for ignore_id in ignore_ids:
    candidates_mask &= tf.not_equal(inputs.input_ids, ignore_id)
  candidates_mask &= tf.cast(inputs.input_mask, tf.bool)
  if disallow_from_mask is not None:
    candidates_mask &= ~disallow_from_mask
  return candidates_mask


def mask(config: configure_pretraining.PretrainingConfig,
         inputs: pretrain_data.Inputs, mask_prob, proposal_distribution=1.0,
         disallow_from_mask=None, already_masked=None):
  """Implementation of dynamic masking. The optional arguments aren't needed for
  BERT/ELECTRA and are from early experiments in "strategically" masking out
  tokens instead of uniformly at random.

  Args:
    config: configure_pretraining.PretrainingConfig
    inputs: pretrain_data.Inputs containing input input_ids/input_mask
    mask_prob: percent of tokens to mask
    proposal_distribution: for non-uniform masking can be a [B, L] tensor
                           of scores for masking each position.
    disallow_from_mask: a boolean tensor of [B, L] of positions that should
                        not be masked out
    already_masked: a boolean tensor of [B, N] of already masked-out tokens
                    for multiple rounds of masking
  Returns: a pretrain_data.Inputs with masking added
  """
  # Get the batch size, sequence length, and max masked-out tokens
  N = config.max_predictions_per_seq
  B, L = modeling.get_shape_list(inputs.input_ids)

  # Find indices where masking out a token is allowed
  vocab = tokenization.FullTokenizer(
      config.vocab_file, do_lower_case=config.do_lower_case).vocab
  candidates_mask = _get_candidates_mask(inputs, vocab, disallow_from_mask)

  # Set the number of tokens to mask out per example
  num_tokens = tf.cast(tf.reduce_sum(inputs.input_mask, -1), tf.float32)
  num_to_predict = tf.maximum(1, tf.minimum(
      N, tf.cast(tf.round(num_tokens * mask_prob), tf.int32)))
  masked_lm_weights = tf.cast(tf.sequence_mask(num_to_predict, N), tf.float32)
  if already_masked is not None:
    masked_lm_weights *= (1 - already_masked)

  # Get a probability of masking each position in the sequence
  candidate_mask_float = tf.cast(candidates_mask, tf.float32)
  sample_prob = (proposal_distribution * candidate_mask_float)
  sample_prob /= tf.reduce_sum(sample_prob, axis=-1, keepdims=True)

  # Sample the positions to mask out
  sample_prob = tf.stop_gradient(sample_prob)
  sample_logits = tf.log(sample_prob)
  masked_lm_positions = tf.random.categorical(
      sample_logits, N, dtype=tf.int32)
  masked_lm_positions *= tf.cast(masked_lm_weights, tf.int32)

  # Get the ids of the masked-out tokens
  shift = tf.expand_dims(L * tf.range(B), -1)
  flat_positions = tf.reshape(masked_lm_positions + shift, [-1, 1])
  masked_lm_ids = tf.gather_nd(tf.reshape(inputs.input_ids, [-1]),
                               flat_positions)
  masked_lm_ids = tf.reshape(masked_lm_ids, [B, -1])
  masked_lm_ids *= tf.cast(masked_lm_weights, tf.int32)

  # Update the input ids
  replace_with_mask_positions = masked_lm_positions * tf.cast(
      tf.less(tf.random.uniform([B, N]), 0.85), tf.int32)
  inputs_ids, _ = scatter_update(
      inputs.input_ids, tf.fill([B, N], vocab["[MASK]"]),
      replace_with_mask_positions)

  return pretrain_data.get_updated_inputs(
      inputs,
      input_ids=tf.stop_gradient(inputs_ids),
      masked_lm_positions=masked_lm_positions,
      masked_lm_ids=masked_lm_ids,
      masked_lm_weights=masked_lm_weights
  )


def unmask(inputs: pretrain_data.Inputs):
  unmasked_input_ids, _ = scatter_update(
      inputs.input_ids, inputs.masked_lm_ids, inputs.masked_lm_positions)
  return pretrain_data.get_updated_inputs(inputs, input_ids=unmasked_input_ids)


def sample_from_softmax(logits, disallow=None):
  if disallow is not None:
    logits -= 1000.0 * disallow
  uniform_noise = tf.random.uniform(
      modeling.get_shape_list(logits), minval=0, maxval=1)
  gumbel_noise = -tf.log(-tf.log(uniform_noise + 1e-9) + 1e-9)
  return tf.one_hot(tf.argmax(tf.nn.softmax(logits + gumbel_noise), -1,
                              output_type=tf.int32), logits.shape[-1])


================================================
FILE: code/electra-pretrain/pretrain.sh
================================================
export DATA_DIR=../../user_data
export ELECTRA_DIR=../../user_data/electra

echo 'Prepare pretraining data...'
python build_pretraining_dataset.py \
  --corpus-dir=${DATA_DIR}/texts \
  --max-seq-length=64 \
  --vocab-file=${ELECTRA_DIR}/electra_180g_base/vocab.txt \
  --output-dir=${DATA_DIR}/pretrain_tfrecords 

echo "Pretrain base electra model ~= 1 hour on V100"
python run_pretraining.py \
  --data-dir=${DATA_DIR} \
  --model-name=base \
  --hparams='{"use_amp": true, "learning_rate": 0.0002,"model_size": "base","eval_batch_size":128,"train_batch_size": 128, "init_checkpoint": "../../user_data/electra/electra_180g_base/electra_180g_base.ckpt", "vocab_file": "../../user_data/electra/electra_180g_base/vocab.txt"}'


echo "pretrain large electra model ~= 2.2 hours on V100"
python run_pretraining.py \
  --data-dir=${DATA_DIR} \
  --model-name=large \
  --hparams='{"num_train_steps": 5000, "num_warmup_steps": 500, "model_size": "large", "train_batch_size": 43, "learning_rate": 5e-05, "init_checkpoint": "../../user_data/electra/electra_180g_large/electra_180g_large.ckpt", "use_amp": true, "accumulation_step": 3, "vocab_file": "../../user_data/electra/electra_180g_large/vocab.txt"}'


================================================
FILE: code/electra-pretrain/run_pretraining.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Pre-trains an ELECTRA model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import collections
import json

import tensorflow.compat.v1 as tf

import configure_pretraining
from model import modeling
from model import optimization
from pretrain import pretrain_data
from pretrain import pretrain_helpers
from util import training_utils
from util import utils


class PretrainingModel(object):
  """Transformer pre-training using the replaced-token-detection task."""

  def __init__(self, config: configure_pretraining.PretrainingConfig,
               features, is_training):
    # Set up model config
    self._config = config
    self._bert_config = training_utils.get_bert_config(config)
    if config.debug:
      self._bert_config.num_hidden_layers = 3
      self._bert_config.hidden_size = 144
      self._bert_config.intermediate_size = 144 * 4
      self._bert_config.num_attention_heads = 4

    # Mask the input
    masked_inputs = pretrain_helpers.mask(
        config, pretrain_data.features_to_inputs(features), config.mask_prob)

    # Generator
    embedding_size = (
        self._bert_config.hidden_size if config.embedding_size is None else
        config.embedding_size)
    if config.uniform_generator:
      mlm_output = self._get_masked_lm_output(masked_inputs, None)
    elif config.electra_objective and config.untied_generator:
      generator = self._build_transformer(
          masked_inputs, is_training,
          bert_config=get_generator_config(config, self._bert_config),
          embedding_size=(None if config.untied_generator_embeddings
                          else embedding_size),
          untied_embeddings=config.untied_generator_embeddings,
          name="generator",
          embedding_file=config.embedding_file)
      mlm_output = self._get_masked_lm_output(masked_inputs, generator)
    else:
      generator = self._build_transformer(
          masked_inputs, is_training, embedding_size=embedding_size,
          embedding_file=config.embedding_file)
      mlm_output = self._get_masked_lm_output(masked_inputs, generator)
    fake_data = self._get_fake_data(masked_inputs, mlm_output.logits)
    self.mlm_output = mlm_output
    self.total_loss = config.gen_weight * mlm_output.loss

    # Discriminator
    disc_output = None
    if config.electra_objective:
      discriminator = self._build_transformer(
          fake_data.inputs, is_training, reuse=not config.untied_generator,
          embedding_size=embedding_size,
          embedding_file=config.embedding_file)
      disc_output = self._get_discriminator_output(
          fake_data.inputs, discriminator, fake_data.is_fake_tokens)
      self.total_loss += config.disc_weight * disc_output.loss

    # Evaluation
    eval_fn_inputs = {
        "input_ids": masked_inputs.input_ids,
        "masked_lm_preds": mlm_output.preds,
        "mlm_loss": mlm_output.per_example_loss,
        "masked_lm_ids": masked_inputs.masked_lm_ids,
        "masked_lm_weights": masked_inputs.masked_lm_weights,
        "input_mask": masked_inputs.input_mask
    }
    if config.electra_objective:
      eval_fn_inputs.update({
          "disc_loss": disc_output.per_example_loss,
          "disc_labels": disc_output.labels,
          "disc_probs": disc_output.probs,
          "disc_preds": disc_output.preds,
          "sampled_tokids": tf.argmax(fake_data.sampled_tokens, -1,
                                      output_type=tf.int32)
      })
    eval_fn_keys = eval_fn_inputs.keys()
    eval_fn_values = [eval_fn_inputs[k] for k in eval_fn_keys]

    def metric_fn(*args):
      """Computes the loss and accuracy of the model."""
      d = {k: arg for k, arg in zip(eval_fn_keys, args)}
      metrics = dict()
      metrics["masked_lm_accuracy"] = tf.metrics.accuracy(
          labels=tf.reshape(d["masked_lm_ids"], [-1]),
          predictions=tf.reshape(d["masked_lm_preds"], [-1]),
          weights=tf.reshape(d["masked_lm_weights"], [-1]))
      metrics["masked_lm_loss"] = tf.metrics.mean(
          values=tf.reshape(d["mlm_loss"], [-1]),
          weights=tf.reshape(d["masked_lm_weights"], [-1]))
      if config.electra_objective:
        metrics["sampled_masked_lm_accuracy"] = tf.metrics.accuracy(
            labels=tf.reshape(d["masked_lm_ids"], [-1]),
            predictions=tf.reshape(d["sampled_tokids"], [-1]),
            weights=tf.reshape(d["masked_lm_weights"], [-1]))
        if config.disc_weight > 0:
          metrics["disc_loss"] = tf.metrics.mean(d["disc_loss"])
          metrics["disc_auc"] = tf.metrics.auc(
              d["disc_labels"] * d["input_mask"],
              d["disc_probs"] * tf.cast(d["input_mask"], tf.float32))
          metrics["disc_accuracy"] = tf.metrics.accuracy(
              labels=d["disc_labels"], predictions=d["disc_preds"],
              weights=d["input_mask"])
          metrics["disc_precision"] = tf.metrics.accuracy(
              labels=d["disc_labels"], predictions=d["disc_preds"],
              weights=d["disc_preds"] * d["input_mask"])
          metrics["disc_recall"] = tf.metrics.accuracy(
              labels=d["disc_labels"], predictions=d["disc_preds"],
              weights=d["disc_labels"] * d["input_mask"])
      return metrics
    self.eval_metrics = (metric_fn, eval_fn_values)

  def _get_masked_lm_output(self, inputs: pretrain_data.Inputs, model):
    """Masked language modeling softmax layer."""
    masked_lm_weights = inputs.masked_lm_weights
    with tf.variable_scope("generator_predictions"):
      if self._config.uniform_generator:
        logits = tf.zeros(self._bert_config.vocab_size)
        logits_tiled = tf.zeros(
            modeling.get_shape_list(inputs.masked_lm_ids) +
            [self._bert_config.vocab_size])
        logits_tiled += tf.reshape(logits, [1, 1, self._bert_config.vocab_size])
        logits = logits_tiled
      else:
        relevant_hidden = pretrain_helpers.gather_positions(
            model.get_sequence_output(), inputs.masked_lm_positions)
        hidden = tf.layers.dense(
            relevant_hidden,
            units=modeling.get_shape_list(model.get_embedding_table())[-1],
            activation=modeling.get_activation(self._bert_config.hidden_act),
            kernel_initializer=modeling.create_initializer(
                self._bert_config.initializer_range))
        hidden = modeling.layer_norm(hidden)
        output_bias = tf.get_variable(
            "output_bias",
            shape=[self._bert_config.vocab_size],
            initializer=tf.zeros_initializer())
        logits = tf.matmul(hidden, model.get_embedding_table(),
                           transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)

      oh_labels = tf.one_hot(
          inputs.masked_lm_ids, depth=self._bert_config.vocab_size,
          dtype=tf.float32)

      probs = tf.nn.softmax(logits)
      log_probs = tf.nn.log_softmax(logits)
      label_log_probs = -tf.reduce_sum(log_probs * oh_labels, axis=-1)

      numerator = tf.reduce_sum(inputs.masked_lm_weights * label_log_probs)
      denominator = tf.reduce_sum(masked_lm_weights) + 1e-6
      loss = numerator / denominator
      preds = tf.argmax(log_probs, axis=-1, output_type=tf.int32)

      MLMOutput = collections.namedtuple(
          "MLMOutput", ["logits", "probs", "loss", "per_example_loss", "preds"])
      return MLMOutput(
          logits=logits, probs=probs, per_example_loss=label_log_probs,
          loss=loss, preds=preds)

  def _get_discriminator_output(self, inputs, discriminator, labels):
    """Discriminator binary classifier."""
    with tf.variable_scope("discriminator_predictions"):
      hidden = tf.layers.dense(
          discriminator.get_sequence_output(),
          units=self._bert_config.hidden_size,
          activation=modeling.get_activation(self._bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              self._bert_config.initializer_range))
      logits = tf.squeeze(tf.layers.dense(hidden, units=1), -1)
      weights = tf.cast(inputs.input_mask, tf.float32)
      labelsf = tf.cast(labels, tf.float32)
      losses = tf.nn.sigmoid_cross_entropy_with_logits(
          logits=logits, labels=labelsf) * weights
      per_example_loss = (tf.reduce_sum(losses, axis=-1) /
                          (1e-6 + tf.reduce_sum(weights, axis=-1)))
      loss = tf.reduce_sum(losses) / (1e-6 + tf.reduce_sum(weights))
      probs = tf.nn.sigmoid(logits)
      preds = tf.cast(tf.round((tf.sign(logits) + 1) / 2), tf.int32)
      DiscOutput = collections.namedtuple(
          "DiscOutput", ["loss", "per_example_loss", "probs", "preds",
                         "labels"])
      return DiscOutput(
          loss=loss, per_example_loss=per_example_loss, probs=probs,
          preds=preds, labels=labels,
      )

  def _get_fake_data(self, inputs, mlm_logits):
    """Sample from the generator to create corrupted input."""
    inputs = pretrain_helpers.unmask(inputs)
    disallow = tf.one_hot(
        inputs.masked_lm_ids, depth=self._bert_config.vocab_size,
        dtype=tf.float32) if self._config.disallow_correct else None
    sampled_tokens = tf.stop_gradient(pretrain_helpers.sample_from_softmax(
        mlm_logits / self._config.temperature, disallow=disallow))
    sampled_tokids = tf.argmax(sampled_tokens, -1, output_type=tf.int32)
    updated_input_ids, masked = pretrain_helpers.scatter_update(
        inputs.input_ids, sampled_tokids, inputs.masked_lm_positions)
    labels = masked * (1 - tf.cast(
        tf.equal(updated_input_ids, inputs.input_ids), tf.int32))
    updated_inputs = pretrain_data.get_updated_inputs(
        inputs, input_ids=updated_input_ids)
    FakedData = collections.namedtuple("FakedData", [
        "inputs", "is_fake_tokens", "sampled_tokens"])
    return FakedData(inputs=updated_inputs, is_fake_tokens=labels,
                     sampled_tokens=sampled_tokens)

  def _build_transformer(self, inputs: pretrain_data.Inputs, is_training,
                         bert_config=None, name="electra", reuse=False, embedding_file=None, **kwargs):
    """Build a transformer encoder network."""
    if bert_config is None:
      bert_config = self._bert_config
    with tf.variable_scope(tf.get_variable_scope(), reuse=reuse):
      return modeling.BertModel(
          bert_config=bert_config,
          is_training=is_training,
          input_ids=inputs.input_ids,
          input_mask=inputs.input_mask,
          token_type_ids=inputs.segment_ids,
          use_one_hot_embeddings=self._config.use_tpu,
          scope=name,
          embedding_file=embedding_file,
          **kwargs)


def get_generator_config(config: configure_pretraining.PretrainingConfig,
                         bert_config: modeling.BertConfig):
  """Get model config for the generator network."""
  gen_config = modeling.BertConfig.from_dict(bert_config.to_dict())
  gen_config.hidden_size = int(round(
      bert_config.hidden_size * config.generator_hidden_size))
  gen_config.num_hidden_layers = int(round(
      bert_config.num_hidden_layers * config.generator_layers))
  gen_config.intermediate_size = 4 * gen_config.hidden_size
  gen_config.num_attention_heads = max(1, gen_config.hidden_size // 64)
  return gen_config


def model_fn_builder(config: configure_pretraining.PretrainingConfig):
  """Build the model for training."""

  def model_fn(features, labels, mode, params):
    """Build the model for training."""
    model = PretrainingModel(config, features,
                             mode == tf.estimator.ModeKeys.TRAIN)
    utils.log("Model is built!")
    if mode == tf.estimator.ModeKeys.TRAIN:
      tvars = tf.trainable_variables()
      # for t in tvars:
      #   print(t)
      initialized_variable_names = {}
      init_checkpoint = config.init_checkpoint
      if init_checkpoint:
          (assignment_map,
            initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint,update_vocab=config.embedding_file is not None)
          tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
      utils.log("**** Trainable Variables ****")

      for var in tvars:
          init_string = ""
          if var.name in initialized_variable_names:
              init_string = ", *INIT_FROM_CKPT*"
          utils.log("  name = %s, shape = %s%s"% ( var.name, var.shape,
                                    init_string))
      train_op = optimization.create_optimizer(
          model.total_loss, config.learning_rate, config.num_train_steps,
          weight_decay_rate=config.weight_decay_rate,
          use_tpu=config.use_tpu,
          warmup_steps=config.num_warmup_steps,
          lr_decay_power=config.lr_decay_power,
          amp=config.use_amp,
          accumulation_step=config.accumulation_step
      )
      # output_spec = tf.estimator.tpu.TPUEstimatorSpec(
      output_spec = tf.estimator.EstimatorSpec(
          mode=mode,
          loss=model.total_loss,
          train_op=train_op,
          training_hooks=[training_utils.ETAHook(
              {} if config.use_tpu else dict(loss=model.total_loss),
              config.num_train_steps, config.iterations_per_loop,
              config.use_tpu,model_name=config.model_name)]
      )
    elif mode == tf.estimator.ModeKeys.EVAL:
      output_spec = tf.estimator.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=model.total_loss,
          eval_metrics=model.eval_metrics,
          evaluation_hooks=[training_utils.ETAHook(
              {} if config.use_tpu else dict(loss=model.total_loss),
              config.num_eval_steps, config.iterations_per_loop,
              config.use_tpu, is_training=False)])
    else:
      raise ValueError("Only TRAIN and EVAL modes are supported")
    return output_spec

  return model_fn


def train_or_eval(config: configure_pretraining.PretrainingConfig):
  """Run pre-training or evaluate the pre-trained model."""
  if config.do_train == config.do_eval:
    raise ValueError("Exactly one of `do_train` or `do_eval` must be True.")
  if config.debug:
    utils.rmkdir(config.model_dir)
  utils.heading("Config:")
  utils.log_config(config)

  is_per_host = tf.estimator.tpu.InputPipelineConfig.PER_HOST_V2
  tpu_cluster_resolver = None
  tf_config = tf.ConfigProto()
  tf_config.gpu_options.allow_growth = True
  tf_config.allow_soft_placement = True

  if config.use_tpu and config.tpu_name:
    tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
        config.tpu_name, zone=config.tpu_zone, project=config.gcp_project)
  
  run_config = tf.estimator.RunConfig(
        model_dir=config.model_dir,
        session_config=tf_config,
        save_checkpoints_steps=config.save_checkpoints_steps,
        keep_checkpoint_max=1)

  # tpu_config = tf.estimator.tpu.TPUConfig(
  #     iterations_per_loop=config.iterations_per_loop,
  #     num_shards=(config.num_tpu_cores if config.do_train else
  #                 config.num_tpu_cores),
  #     tpu_job_name=config.tpu_job_name,
  #     per_host_input_for_training=is_per_host)
  # run_config = tf.estimator.tpu.RunConfig(
  #     cluster=tpu_cluster_resolver,
  #     model_dir=config.model_dir,
  #     save_checkpoints_steps=config.save_checkpoints_steps,
  #     keep_checkpoint_max=1,
  #     tpu_config=tpu_config)
  model_fn = model_fn_builder(config=config)

  estimator = tf.estimator.Estimator(
        model_fn=model_fn,
        config=run_config)
  # estimator = tf.estimator.tpu.TPUEstimator(
  #     use_tpu=config.use_tpu,
  #     model_fn=model_fn,
  #     config=run_config,
  #     train_batch_size=config.train_batch_size,
  #     eval_batch_size=config.eval_batch_size)

  if config.do_train:
    utils.heading("Running training")
    estimator.train(input_fn=pretrain_data.get_input_fn(config, True),
                    max_steps=config.num_train_steps)
  if config.do_eval:
    utils.heading("Running evaluation")
    result = estimator.evaluate(
        input_fn=pretrain_data.get_input_fn(config, False),
        steps=config.num_eval_steps)
    for key in sorted(result.keys()):
      utils.log("  {:} = {:}".format(key, str(result[key])))
    return result


def train_one_step(config: configure_pretraining.PretrainingConfig):
  """Builds an ELECTRA model an trains it for one step; useful for debugging."""
  train_input_fn = pretrain_data.get_input_fn(config, True)
  features = tf.data.make_one_shot_iterator(train_input_fn(dict(
      batch_size=config.train_batch_size))).get_next()
  model = PretrainingModel(config, features, True)
  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    utils.log(sess.run(model.total_loss))


def main():
  parser = argparse.ArgumentParser(description=__doc__)
  parser.add_argument("--data-dir", required=True,
                      help="Location of data files (model weights, etc).")
  parser.add_argument("--model-name", required=True,
                      help="The name of the model being fine-tuned.")
  parser.add_argument("--hparams", default="{}",
                      help="JSON dict of model hyperparameters.")
  args = parser.parse_args()
  if args.hparams.endswith(".json"):
    hparams = utils.load_json(args.hparams)
  else:
    hparams = json.loads(args.hparams)
  tf.logging.set_verbosity(tf.logging.ERROR)
  train_or_eval(configure_pretraining.PretrainingConfig(
      args.model_name, args.data_dir, **hparams))


if __name__ == "__main__":
  main()


================================================
FILE: code/electra-pretrain/util/__init__.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

================================================
FILE: code/electra-pretrain/util/training_utils.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Utilities for training the models."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import datetime
import re
import time
import tensorflow.compat.v1 as tf

from model import modeling
from util import utils


class ETAHook(tf.estimator.SessionRunHook):
  """Print out the time remaining during training/evaluation."""

  def __init__(self, to_log, n_steps, iterations_per_loop, on_tpu,
               log_every=1, is_training=True, model_name='base'):
    self._to_log = to_log
    self._n_steps = n_steps
    self._iterations_per_loop = iterations_per_loop
    self._on_tpu = on_tpu
    self._log_every = log_every
    self._is_training = is_training
    self._steps_run_so_far = 0
    self._global_step = None
    self._global_step_tensor = None
    self._start_step = None
    self._start_time = None
    self.log_file = open('/tmp/{}_pretrain.log'.format(model_name),'w')

  def begin(self):
    self._global_step_tensor = tf.train.get_or_create_global_step()

  def before_run(self, run_context):
    if self._start_time is None:
      self._start_time = time.time()
    return tf.estimator.SessionRunArgs(self._to_log)

  def after_run(self, run_context, run_values):
    self._global_step = run_context.session.run(self._global_step_tensor)
    self._steps_run_so_far += self._iterations_per_loop if self._on_tpu else 1
    if self._start_step is None:
      self._start_step = self._global_step - (self._iterations_per_loop
                                              if self._on_tpu else 1)
    self.log(run_values)

  def end(self, session):
    self._global_step = session.run(self._global_step_tensor)
    self.log()
    self.log_file.close()

  def log(self, run_values=None):
    step = self._global_step if self._is_training else self._steps_run_so_far
    if step % self._log_every != 0:
      return
    msg = "{:}/{:} = {:.1f}%".format(step, self._n_steps,
                                     100.0 * step / self._n_steps)
    time_elapsed = time.time() - self._start_time
    time_per_step = time_elapsed / (
        (step - self._start_step) if self._is_training else step)
    msg += ", SPS: {:.1f}".format(1 / time_per_step)
    msg += ", ELAP: " + secs_to_str(time_elapsed)
    msg += ", ETA: " + secs_to_str(
        (self._n_steps - step) * time_per_step)
    if run_values is not None:
      for tag, value in run_values.results.items():
        msg += " - " + str(tag) + (": {:.4f}".format(value))
    utils.log(msg)
    self.log_file.write(msg + '\n')
    self.log_file.flush()


def secs_to_str(secs):
  s = str(datetime.timedelta(seconds=int(round(secs))))
  s = re.sub("^0:", "", s)
  s = re.sub("^0", "", s)
  s = re.sub("^0:", "", s)
  s = re.sub("^0", "", s)
  return s


def get_bert_config(config):
  """Get model hyperparameters based on a pretraining/finetuning config"""
  if config.model_size == "large":
    args = {"hidden_size": 1024, "num_hidden_layers": 24}
  elif config.model_size == "base":
    args = {"hidden_size": 768, "num_hidden_layers": 12}
  elif config.model_size == "small":
    args = {"hidden_size": 256, "num_hidden_layers": 24}
  else:
    raise ValueError("Unknown model size", config.model_size)
  args["vocab_size"] = config.vocab_size
  args.update(**config.model_hparam_overrides)
  # by default the ff size and num attn heads are determined by the hidden size
  args["num_attention_heads"] = max(1, args["hidden_size"] // 64)
  args["intermediate_size"] = 4 * args["hidden_size"]
  args.update(**config.model_hparam_overrides)
  return modeling.BertConfig.from_dict(args)


================================================
FILE: code/electra-pretrain/util/utils.py
================================================
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""A collection of general utility functions."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import pickle
import sys

import tensorflow.compat.v1 as tf


def load_json(path):
  with tf.io.gfile.GFile(path, "r") as f:
    return json.load(f)


def write_json(o, path):
  if "/" in path:
    tf.io.gfile.makedirs(path.rsplit("/", 1)[0])
  with tf.io.gfile.GFile(path, "w") as f:
    json.dump(o, f)


def load_pickle(path):
  with tf.io.gfile.GFile(path, "rb") as f:
    return pickle.load(f)


def write_pickle(o, path):
  if "/" in path:
    tf.io.gfile.makedirs(path.rsplit("/", 1)[0])
  with tf.io.gfile.GFile(path, "wb") as f:
    pickle.dump(o, f, -1)


def mkdir(path):
  if not tf.io.gfile.exists(path):
    tf.io.gfile.makedirs(path)


def rmrf(path):
  if tf.io.gfile.exists(path):
    tf.io.gfile.rmtree(path)


def rmkdir(path):
  rmrf(path)
  mkdir(path)


def log(*args):
  msg = " ".join(map(str, args))
  sys.stdout.write(msg + "\n")
  sys.stdout.flush()


def log_config(config):
  for key, value in sorted(config.__dict__.items()):
    log(key, value)
  log()


def heading(*args):
  log(80 * "=")
  log(*args)
  log(80 * "=")


def nest_dict(d, prefixes, delim="_"):
  """Go from {prefix_key: value} to {prefix: {key: value}}."""
  nested = {}
  for k, v in d.items():
    for prefix in prefixes:
      if k.startswith(prefix + delim):
        if prefix not in nested:
          nested[prefix] = {}
        nested[prefix][k.split(delim, 1)[1]] = v
      else:
        nested[k] = v
  return nested


def flatten_dict(d, delim="_"):
  """Go from {prefix: {key: value}} to {prefix_key: value}."""
  flattened = {}
  for k, v in d.items():
    if isinstance(v, dict):
      for k2, v2 in v.items():
        flattened[k + delim + k2] = v2
    else:
      flattened[k] = v
  return flattened


================================================
FILE: code/modeling.py
================================================
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""The main BERT model and related functions."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import copy
import json
import math
import re
import six
import tensorflow as tf
import numpy as np

def layer_norm(input_tensor, name=None):
  """Run layer normalization on the last dimension of the tensor."""
  return tf.contrib.layers.layer_norm(
      inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)


def scale_l2(x, norm_length=1.0):
  # shape(x) = (batch, num_timesteps, d)
  # Divide x by max(abs(x)) for a numerically stable L2 norm.
  # 2norm(x) = a * 2norm(x/a)
  # Scale over the full sequence, dims (1, 2)
  alpha = tf.reduce_max(tf.abs(x), (1, 2), keep_dims=True) + 1e-12
  l2_norm = alpha * tf.sqrt(
      tf.reduce_sum(tf.pow(x / alpha, 2), (1, 2), keep_dims=True) + 1e-8)
  x_unit = x / l2_norm
  return norm_length * x_unit


class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"


class BertModel(object):
  """BERT model ("Bidirectional Encoder Representations from Transformers").

  Example usage:

  ```python
  # Already been converted into WordPiece token ids
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
  token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
    num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

  model = modeling.BertModel(config=config, is_training=True,
    input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

  label_embeddings = tf.get_variable(...)
  pooled_output = model.get_pooled_output()
  logits = tf.matmul(pooled_output, label_embeddings)
  ...
  ```
  """

  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=False,
               use_fgm=False,
               perturbation=None,
               spatial_dropout=None,
               electra=False,
               embedding_dropout=0.0,
               embedding_file=None,
               scope='bert'):
    """Constructor for BertModel.

    Args:
      config: `BertConfig` instance.
      is_training: bool. true for training model, false for eval model. Controls
        whether dropout will be applied.
      input_ids: int32 Tensor of shape [batch_size, seq_length].
      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
        embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
        it is much faster if this is True, on the CPU or GPU, it is faster if
        this is False.
      use_fgm: whether to use FGM
      perturbation: FGM perturbation
      scope: (optional) variable scope. Defaults to "bert".

    Raises:
      ValueError: The config is invalid or one of the input tensor shapes
        is invalid.
    """
    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert", reuse=tf.AUTO_REUSE if use_fgm else None):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings,
            embedding_file=embedding_file,
            embedding_dropout=embedding_dropout if (is_training and embedding_dropout >0.0) else 0.0)

        if use_fgm and perturbation is not None:
          embedding_output = embedding_output + perturbation
        else:
          embedding_output = embedding_output

        if is_training and spatial_dropout is not None:
          embedding_output = spatial_dropout(embedding_output,is_training)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output, self.position_embeddings = embedding_postprocessor(
            input_tensor=embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      if electra:
        # electra没有pooler层
        self.pooled_output = self.sequence_output[:,0]
      else:
        with tf.variable_scope("pooler"):
          # We "pool" the model by simply taking the hidden state corresponding
          # to the first token. We assume that this has been pre-trained
          first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
          self.pooled_output = tf.layers.dense(
              first_token_tensor,
              config.hidden_size,
              activation=tf.tanh,
              kernel_initializer=create_initializer(config.initializer_range))

  def get_pooled_output(self):
    return self.pooled_output

  def get_sequence_output(self):
    """Gets final hidden layer of encoder.

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the final hidden of the transformer encoder.
    """
    return self.sequence_output

  def get_all_encoder_layers(self):
    return self.all_encoder_layers

  def get_position_embedding_output(self):
    return self.position_embeddings

  def get_embedding_output(self):
    """Gets output of the embedding lookup (i.e., input to the transformer).

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the output of the embedding layer, after summing the word
      embeddings with the positional embeddings and the token type embeddings,
      then performing layer normalization. This is the input to the transformer.
    """
    return self.embedding_output

  def get_embedding_table(self):
    return self.embedding_table


def gelu(input_tensor):
  """Gaussian Error Linear Unit.

  This is a smoother version of the RELU.
  Original paper: https://arxiv.org/abs/1606.08415

  Args:
    input_tensor: float Tensor to perform activation.

  Returns:
    `input_tensor` with the GELU activation applied.
  """
  cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
  return input_tensor * cdf


def get_activation(activation_string):
  """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.

  Args:
    activation_string: String name of the activation function.

  Returns:
    A Python function corresponding to the activation function. If
    `activation_string` is None, empty, or "linear", this will return None.
    If `activation_string` is not a string, it will return `activation_string`.

  Raises:
    ValueError: The `activation_string` does not correspond to a known
      activation.
  """

  # We assume that anything that"s not a string is already an activation
  # function, so we just return it.
  if not isinstance(activation_string, six.string_types):
    return activation_string

  if not activation_string:
    return None

  act = activation_string.lower()
  if act == "linear":
    return None
  elif act == "relu":
    return tf.nn.relu
  elif act == "gelu":
    return gelu
  elif act == "tanh":
    return tf.tanh
  else:
    raise ValueError("Unsupported activation: %s" % act)


def get_assignment_map_from_checkpoint(tvars, init_checkpoint, ignore_names=[], convert_electra=False):
  """Compute the union of the current variables and checkpoint variables."""
  assignment_map = {}
  initialized_variable_names = {}

  name_to_variable = collections.OrderedDict()
  for var in tvars:
    name = var.name
    m = re.match("^(.*):\\d+$", name)
    if m is not None:
      name = m.group(1)
    if ignore_names and name in ignore_names:
      continue
    name_to_variable[name] = var

  init_vars = tf.train.list_variables(init_checkpoint)

  assignment_map = collections.OrderedDict()
  for x in init_vars:
    (name, var) = (x[0], x[1])
    new_name = name
    if convert_electra:
      new_name = name.replace('electra','bert')
    if new_name not in name_to_variable:
      continue
    assignment_map[name] = new_name
    initialized_variable_names[new_name] = 1
    initialized_variable_names[new_name + ":0"] = 1

  return (assignment_map, initialized_variable_names)


def dropout(input_tensor, dropout_prob):
  """Perform dropout.

  Args:
    input_tensor: float Tensor.
    dropout_prob: Python float. The probability of dropping out a value (NOT of
      *keeping* a dimension as in `tf.nn.dropout`).

  Returns:
    A version of `input_tensor` with dropout applied.
  """
  if dropout_prob is None or dropout_prob == 0.0:
    return input_tensor

  output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
  return output


def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
  """Runs layer normalization followed by dropout."""
  output_tensor = layer_norm(input_tensor, name)
  output_tensor = dropout(output_tensor, dropout_prob)
  return output_tensor

def create_initializer(initializer_range=0.02):
  """Creates a `truncated_normal_initializer` with the given range."""
  return tf.truncated_normal_initializer(stddev=initializer_range)

def load_pretrained_embedding(embedding_file, vocab_size, embedding_size):
  pretrained = np.random.normal(size=(vocab_size,embedding_size))
  for i,line in enumerate(open(embedding_file)):
    fields = line.strip().split()
    word = fields[0]
    ebd = np.asarray(fields[1:])
    if len(ebd) != embedding_size:
      tf.logging.warning(f'第{i}行embedding大小为{len(ebd)} != {embedding_size}')
      return None
    else:
      pretrained[i] = ebd
  return pretrained

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False,
                     embedding_file=None,
                     embedding_dropout=0.0):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
      for TPUs.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  if embedding_file:
    tf.logging.info(f'从{embedding_file}加载预训练词向量')
    pretrained = load_pretrained_embedding(embedding_file,vocab_size,embedding_size)
    if pretrained is not None:
      initializer = tf.constant_initializer(value=pretrained)
      embedding_table = tf.get_variable(
        name=word_embedding_name,
        initializer=lambda : initializer([vocab_size,embedding_size]))
    else:
      raise Exception('初始化词向量失败')
  else:
    embedding_table = tf.get_variable(
        name=word_embedding_name,
        shape=[vocab_size, embedding_size],
        initializer=create_initializer(initializer_range))

  if embedding_dropout > 0.0:
    mask = tf.nn.dropout(tf.ones([vocab_size]),keep_prob=1-embedding_dropout) * (1-embedding_dropout)
    mask = tf.expand_dims(mask,1)
    embedding_table = mask * embedding_table

  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)


def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output, position_embeddings


def create_attention_mask_from_input_mask(from_tensor, to_mask):
  """Create 3D attention mask from a 2D tensor mask.

  Args:
    from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
    to_mask: int32 Tensor of shape [batch_size, to_seq_length].

  Returns:
    float Tensor of shape [batch_size, from_seq_length, to_seq_length].
  """
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  # We don't assume that `from_tensor` is a mask (although it could be). We
  # don't actually care if we attend *from* padding tokens (only *to* padding)
  # tokens so we create a tensor of all ones.
  #
  # `broadcast_ones` = [batch_size, from_seq_length, 1]
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  # Here we broadcast along two dimensions to create the mask.
  mask = broadcast_ones * to_mask

  return mask


def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.

  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the same, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector.

  This function first projects `from_tensor` into a "query" tensor and
  `to_tensor` into "key" and "value" tensors. These are (effectively) a list
  of tensors of length `num_attention_heads`, where each tensor is of shape
  [batch_size, seq_length, size_per_head].

  Then, the query and key tensors are dot-producted and scaled. These are
  softmaxed to obtain attention probabilities. The value tensors are then
  interpolated by these probabilities, then concatenated back to a single
  tensor and returned.

  In practice, the multi-headed attention are done with transposes and
  reshapes rather than actual separate tensors.

  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
    float Tensor of shape [batch_size, from_seq_length,
      num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
      true, this will be of shape [batch_size * from_seq_length,
      num_attention_heads * size_per_head]).

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = g

Download .txt

gitextract_8qq4ysmv/

├── .dockerignore
├── .gitignore
├── .vscode/
│   └── settings.json
├── Dockerfile
├── README.md
├── code/
│   ├── assemble.py
│   ├── conlleval.py
│   ├── create_raw_text.py
│   ├── electra-pretrain/
│   │   ├── .gitignore
│   │   ├── LICENSE
│   │   ├── README.md
│   │   ├── build_pretraining_dataset.py
│   │   ├── config/
│   │   │   ├── base_discriminator_config.json
│   │   │   ├── base_generator_config.json
│   │   │   ├── large_discriminator_config.json
│   │   │   └── large_generator_config.json
│   │   ├── configure_pretraining.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── modeling.py
│   │   │   ├── optimization.py
│   │   │   └── tokenization.py
│   │   ├── pretrain/
│   │   │   ├── __init__.py
│   │   │   ├── pretrain_data.py
│   │   │   └── pretrain_helpers.py
│   │   ├── pretrain.sh
│   │   ├── run_pretraining.py
│   │   └── util/
│   │       ├── __init__.py
│   │       ├── training_utils.py
│   │       └── utils.py
│   ├── modeling.py
│   ├── optimization.py
│   ├── pipeline.py
│   ├── prepare.sh
│   ├── pretrain.sh
│   ├── run.sh
│   ├── run_biaffine_ner.py
│   ├── simple_run.sh
│   ├── tokenization.py
│   └── utils.py
└── user_data/
    └── extra_data/
        ├── dev.txt
        ├── test.txt
        └── train.txt

Download .txt

SYMBOL INDEX (283 symbols across 19 files)

FILE: code/assemble.py
  function refine_entity (line 11) | def refine_entity(w,s,e):
  function convert (line 25) | def convert(entity, refine=False):
  function get_entities (line 37) | def get_entities(text,tags):
  function check_special (line 66) | def check_special(text):
  function merge_by_4_tuple (line 74) | def merge_by_4_tuple(raw_texts,data,weights,threshold=3.0, refine=False):
  function assemble_fake (line 129) | def assemble_fake():
  function assemble_final (line 173) | def assemble_final():

FILE: code/conlleval.py
  class FormatError (line 20) | class FormatError(Exception):
  class EvalCounts (line 26) | class EvalCounts(object):
    method __init__ (line 27) | def __init__(self):
  function parse_args (line 40) | def parse_args(argv):
  function parse_tag (line 57) | def parse_tag(t):
  function evaluate (line 62) | def evaluate(iterable, options=None):
  function uniq (line 144) | def uniq(iterable):
  function calculate_metrics (line 149) | def calculate_metrics(correct, guessed, total):
  function metrics (line 157) | def metrics(counts):
  function report (line 170) | def report(counts, out=None):
  function report_notprint (line 196) | def report_notprint(counts, out=None):
  function end_of_chunk (line 230) | def end_of_chunk(prev_tag, tag, prev_type, type_):
  function start_of_chunk (line 255) | def start_of_chunk(prev_tag, tag, prev_type, type_):
  function return_report (line 280) | def return_report(input_file):
  function main (line 286) | def main(argv):

FILE: code/create_raw_text.py
  function read_conll (line 15) | def read_conll(fname):
  function read_track3 (line 28) | def read_track3(fname):
  function create_preatrain_data (line 37) | def create_preatrain_data():
  function convert_distance (line 63) | def convert_distance(item,tags):
  function convert_village (line 76) | def convert_village(item,tags):
  function convert_intersection (line 89) | def convert_intersection(item,tags,pattern):
  function get_intersection_pattern (line 108) | def get_intersection_pattern():
  function check_devzone (line 126) | def check_devzone(name):
  function convert_data_format_v2 (line 132) | def convert_data_format_v2(sentence):
  function _get_refine_entity (line 169) | def _get_refine_entity(raw_files):
  function _fix_data (line 194) | def _fix_data(ent_tp_cnt, update_files, iob=False):
  function fix_data (line 225) | def fix_data():
  function create_extra_train_data (line 248) | def create_extra_train_data():

FILE: code/electra-pretrain/build_pretraining_dataset.py
  function create_int_feature (line 29) | def create_int_feature(values):
  class ExampleBuilder (line 34) | class ExampleBuilder(object):
    method __init__ (line 37) | def __init__(self, tokenizer, max_length):
    method add_line (line 42) | def add_line(self, line):
    method _create_example (line 50) | def _create_example(self):
    method _make_tf_example (line 63) | def _make_tf_example(self, first_segment, second_segment):
  class ExampleWriter (line 83) | class ExampleWriter(object):
    method __init__ (line 86) | def __init__(self, job_id, vocab_file, output_dir, max_seq_length,
    method write_examples (line 101) | def write_examples(self, input_file):
    method finish (line 115) | def finish(self):
  function write_examples (line 120) | def write_examples(job_id, args):
  function main (line 155) | def main():

FILE: code/electra-pretrain/configure_pretraining.py
  class PretrainingConfig (line 25) | class PretrainingConfig(object):
    method __init__ (line 28) | def __init__(self, model_name, data_dir, **kwargs):
    method update (line 128) | def update(self, kwargs):

FILE: code/electra-pretrain/model/modeling.py
  class BertConfig (line 36) | class BertConfig(object):
    method __init__ (line 39) | def __init__(self,
    method from_dict (line 88) | def from_dict(cls, json_object):
    method from_json_file (line 96) | def from_json_file(cls, json_file):
    method to_dict (line 102) | def to_dict(self):
    method to_json_string (line 107) | def to_json_string(self):
  class BertModel (line 112) | class BertModel(object):
    method __init__ (line 137) | def __init__(self,
    method get_pooled_output (line 258) | def get_pooled_output(self):
    method get_sequence_output (line 261) | def get_sequence_output(self):
    method get_all_encoder_layers (line 270) | def get_all_encoder_layers(self):
    method get_embedding_output (line 273) | def get_embedding_output(self):
    method get_embedding_table (line 284) | def get_embedding_table(self):
  function gelu (line 288) | def gelu(input_tensor):
  function get_activation (line 304) | def get_activation(activation_string):
  function get_assignment_map_from_checkpoint (line 341) | def get_assignment_map_from_checkpoint(tvars, init_checkpoint, prefix=""...
  function dropout (line 367) | def dropout(input_tensor, dropout_prob):
  function layer_norm (line 385) | def layer_norm(input_tensor, name=None):
  function layer_norm_and_dropout (line 391) | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
  function create_initializer (line 398) | def create_initializer(initializer_range=0.02):
  function load_pretrained_embedding (line 402) | def load_pretrained_embedding(embedding_file, vocab_size, embedding_size):
  function embedding_lookup (line 415) | def embedding_lookup(input_ids,
  function embedding_postprocessor (line 484) | def embedding_postprocessor(input_tensor,
  function create_attention_mask_from_input_mask (line 580) | def create_attention_mask_from_input_mask(from_tensor, to_mask):
  function attention_layer (line 614) | def attention_layer(from_tensor,
  function transformer_model (line 810) | def transformer_model(input_tensor,
  function get_shape_list (line 947) | def get_shape_list(tensor, expected_rank=None, name=None):
  function reshape_to_matrix (line 992) | def reshape_to_matrix(input_tensor):
  function reshape_from_matrix (line 1006) | def reshape_from_matrix(output_tensor, orig_shape_list):
  function assert_rank (line 1019) | def assert_rank(tensor, expected_rank, name=None):

FILE: code/electra-pretrain/model/optimization.py
  function create_optimizer (line 30) | def create_optimizer(
  class AdamWeightDecayOptimizer (line 151) | class AdamWeightDecayOptimizer(tf.train.Optimizer):
    method __init__ (line 154) | def __init__(self,
    method _apply_gradients (line 172) | def _apply_gradients(self, grads_and_vars, learning_rate):
    method apply_gradients (line 223) | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
    method _do_use_weight_decay (line 244) | def _do_use_weight_decay(self, param_name):
    method _get_variable_name (line 254) | def _get_variable_name(self, param_name):
  function _get_layer_lrs (line 262) | def _get_layer_lrs(learning_rate, layer_decay, n_layers):

FILE: code/electra-pretrain/model/tokenization.py
  function convert_to_unicode (line 29) | def convert_to_unicode(text):
  function printable_text (line 49) | def printable_text(text):
  function load_vocab (line 72) | def load_vocab(vocab_file):
  function convert_by_vocab (line 87) | def convert_by_vocab(vocab, items):
  function convert_tokens_to_ids (line 95) | def convert_tokens_to_ids(vocab, tokens):
  function convert_ids_to_tokens (line 99) | def convert_ids_to_tokens(inv_vocab, ids):
  function whitespace_tokenize (line 103) | def whitespace_tokenize(text):
  class SimpleTokenizer (line 111) | class SimpleTokenizer(object):
    method __init__ (line 112) | def __init__(self, vocab_file):
    method tokenize (line 116) | def tokenize(self, text):
    method convert_tokens_to_ids (line 120) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 123) | def convert_ids_to_tokens(self, ids):
  class FullTokenizer (line 127) | class FullTokenizer(object):
    method __init__ (line 130) | def __init__(self, vocab_file, do_lower_case=True):
    method tokenize (line 136) | def tokenize(self, text):
    method convert_tokens_to_ids (line 144) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 147) | def convert_ids_to_tokens(self, ids):
  class BasicTokenizer (line 151) | class BasicTokenizer(object):
    method __init__ (line 154) | def __init__(self, do_lower_case=True):
    method tokenize (line 162) | def tokenize(self, text):
    method _run_strip_accents (line 186) | def _run_strip_accents(self, text):
    method _run_split_on_punc (line 197) | def _run_split_on_punc(self, text):
    method _tokenize_chinese_chars (line 217) | def _tokenize_chinese_chars(self, text):
    method _is_chinese_char (line 230) | def _is_chinese_char(self, cp):
    method _clean_text (line 252) | def _clean_text(self, text):
  class WordpieceTokenizer (line 266) | class WordpieceTokenizer(object):
    method __init__ (line 269) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
    method tokenize (line 274) | def tokenize(self, text):
  function _is_whitespace (line 328) | def _is_whitespace(char):
  function _is_control (line 340) | def _is_control(char):
  function _is_punctuation (line 352) | def _is_punctuation(char):

FILE: code/electra-pretrain/pretrain/pretrain_data.py
  function get_input_fn (line 32) | def get_input_fn(config: configure_pretraining.PretrainingConfig, is_tra...
  function _decode_record (line 83) | def _decode_record(record, name_to_features):
  function features_to_inputs (line 105) | def features_to_inputs(features):
  function get_updated_inputs (line 119) | def get_updated_inputs(inputs, **kwargs):
  function print_tokens (line 134) | def print_tokens(inputs: Inputs, inv_vocab, updates_mask=None):

FILE: code/electra-pretrain/pretrain/pretrain_helpers.py
  function gather_positions (line 33) | def gather_positions(sequence, positions):
  function scatter_update (line 62) | def scatter_update(sequence, updates, positions):
  function _get_candidates_mask (line 118) | def _get_candidates_mask(inputs: pretrain_data.Inputs, vocab,
  function mask (line 131) | def mask(config: configure_pretraining.PretrainingConfig,
  function unmask (line 203) | def unmask(inputs: pretrain_data.Inputs):
  function sample_from_softmax (line 209) | def sample_from_softmax(logits, disallow=None):

FILE: code/electra-pretrain/run_pretraining.py
  class PretrainingModel (line 37) | class PretrainingModel(object):
    method __init__ (line 40) | def __init__(self, config: configure_pretraining.PretrainingConfig,
    method _get_masked_lm_output (line 145) | def _get_masked_lm_output(self, inputs: pretrain_data.Inputs, model):
    method _get_discriminator_output (line 193) | def _get_discriminator_output(self, inputs, discriminator, labels):
    method _get_fake_data (line 220) | def _get_fake_data(self, inputs, mlm_logits):
    method _build_transformer (line 240) | def _build_transformer(self, inputs: pretrain_data.Inputs, is_training,
  function get_generator_config (line 258) | def get_generator_config(config: configure_pretraining.PretrainingConfig,
  function model_fn_builder (line 271) | def model_fn_builder(config: configure_pretraining.PretrainingConfig):
  function train_or_eval (line 332) | def train_or_eval(config: configure_pretraining.PretrainingConfig):
  function train_one_step (line 395) | def train_one_step(config: configure_pretraining.PretrainingConfig):
  function main (line 406) | def main():

FILE: code/electra-pretrain/util/training_utils.py
  class ETAHook (line 31) | class ETAHook(tf.estimator.SessionRunHook):
    method __init__ (line 34) | def __init__(self, to_log, n_steps, iterations_per_loop, on_tpu,
    method begin (line 49) | def begin(self):
    method before_run (line 52) | def before_run(self, run_context):
    method after_run (line 57) | def after_run(self, run_context, run_values):
    method end (line 65) | def end(self, session):
    method log (line 70) | def log(self, run_values=None):
  function secs_to_str (line 91) | def secs_to_str(secs):
  function get_bert_config (line 100) | def get_bert_config(config):

FILE: code/electra-pretrain/util/utils.py
  function load_json (line 29) | def load_json(path):
  function write_json (line 34) | def write_json(o, path):
  function load_pickle (line 41) | def load_pickle(path):
  function write_pickle (line 46) | def write_pickle(o, path):
  function mkdir (line 53) | def mkdir(path):
  function rmrf (line 58) | def rmrf(path):
  function rmkdir (line 63) | def rmkdir(path):
  function log (line 68) | def log(*args):
  function log_config (line 74) | def log_config(config):
  function heading (line 80) | def heading(*args):
  function nest_dict (line 86) | def nest_dict(d, prefixes, delim="_"):
  function flatten_dict (line 100) | def flatten_dict(d, delim="_"):

FILE: code/modeling.py
  function layer_norm (line 30) | def layer_norm(input_tensor, name=None):
  function scale_l2 (line 36) | def scale_l2(x, norm_length=1.0):
  class BertConfig (line 48) | class BertConfig(object):
    method __init__ (line 51) | def __init__(self,
    method from_dict (line 100) | def from_dict(cls, json_object):
    method from_json_file (line 108) | def from_json_file(cls, json_file):
    method to_dict (line 114) | def to_dict(self):
    method to_json_string (line 119) | def to_json_string(self):
  class BertModel (line 124) | class BertModel(object):
    method __init__ (line 148) | def __init__(self,
    method get_pooled_output (line 275) | def get_pooled_output(self):
    method get_sequence_output (line 278) | def get_sequence_output(self):
    method get_all_encoder_layers (line 287) | def get_all_encoder_layers(self):
    method get_position_embedding_output (line 290) | def get_position_embedding_output(self):
    method get_embedding_output (line 293) | def get_embedding_output(self):
    method get_embedding_table (line 304) | def get_embedding_table(self):
  function gelu (line 308) | def gelu(input_tensor):
  function get_activation (line 324) | def get_activation(activation_string):
  function get_assignment_map_from_checkpoint (line 361) | def get_assignment_map_from_checkpoint(tvars, init_checkpoint, ignore_na...
  function dropout (line 393) | def dropout(input_tensor, dropout_prob):
  function layer_norm_and_dropout (line 411) | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
  function create_initializer (line 417) | def create_initializer(initializer_range=0.02):
  function load_pretrained_embedding (line 421) | def load_pretrained_embedding(embedding_file, vocab_size, embedding_size):
  function embedding_lookup (line 434) | def embedding_lookup(input_ids,
  function embedding_postprocessor (line 501) | def embedding_postprocessor(input_tensor,
  function create_attention_mask_from_input_mask (line 597) | def create_attention_mask_from_input_mask(from_tensor, to_mask):
  function attention_layer (line 631) | def attention_layer(from_tensor,
  function transformer_model (line 827) | def transformer_model(input_tensor,
  function get_shape_list (line 968) | def get_shape_list(tensor, expected_rank=None, name=None):
  function reshape_to_matrix (line 1005) | def reshape_to_matrix(input_tensor):
  function reshape_from_matrix (line 1019) | def reshape_from_matrix(output_tensor, orig_shape_list):
  function assert_rank (line 1032) | def assert_rank(tensor, expected_rank, name=None):

FILE: code/optimization.py
  function create_optimizer (line 24) | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, h...
  class AdamWeightDecayOptimizer (line 169) | class AdamWeightDecayOptimizer(tf.train.Optimizer):
    method __init__ (line 172) | def __init__(self,
    method _apply_gradients (line 192) | def _apply_gradients(self, grads_and_vars, learning_rate):
    method apply_gradients (line 243) | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
    method _do_use_weight_decay (line 262) | def _do_use_weight_decay(self, param_name):
    method _get_variable_name (line 272) | def _get_variable_name(self, param_name):

FILE: code/pipeline.py
  class Timer (line 21) | class Timer(object):
    method __init__ (line 22) | def __init__(self):
    method get_current_time (line 25) | def get_current_time(self):
  function train_model (line 28) | def train_model(args,cmd):

FILE: code/run_biaffine_ner.py
  class InputExample (line 175) | class InputExample(object):
    method __init__ (line 178) | def __init__(self, guid, text, label=None, raw_text=None):
  class InputFeatures (line 194) | class InputFeatures(object):
    method __init__ (line 197) | def __init__(self, input_ids, input_mask, segment_ids, span_mask, gold...
  function data_enhance (line 205) | def data_enhance(sentences, num=10):
  class DataProcessor (line 246) | class DataProcessor(object):
    method get_train_examples (line 249) | def get_train_examples(self, data_dir):
    method get_dev_examples (line 253) | def get_dev_examples(self, data_dir):
    method get_labels (line 257) | def get_labels(self):
  class NERProcessor (line 262) | class NERProcessor(DataProcessor):
    method __init__ (line 263) | def __init__(self, fold_id=0, fold_num=0):
    method get_train_examples (line 267) | def get_train_examples(self, data_dir, file_name='train.conll'):
    method get_dev_examples (line 307) | def get_dev_examples(self, data_dir, file_name="dev.conll"):
    method get_test_examples (line 325) | def get_test_examples(self, data_dir, file_name="final_test.txt"):
    method get_labels (line 339) | def get_labels(self):
    method check (line 344) | def check(self, text, label):
  function convert_single_example (line 358) | def convert_single_example(ex_index, example, label_list, max_seq_length...
  function filed_based_convert_examples_to_features (line 425) | def filed_based_convert_examples_to_features(
  function file_based_input_fn_builder (line 452) | def file_based_input_fn_builder(input_file, batch_size, seq_length, is_t...
  function biaffine_mapping (line 506) | def biaffine_mapping(vector_set_1,
  function create_model (line 590) | def create_model(bert_config, is_training, input_ids, input_mask,
  function focal_loss (line 672) | def focal_loss(logits, labels, gamma=2.0):
  function model_fn_builder (line 680) | def model_fn_builder(bert_config, num_labels, init_checkpoint=None, lear...
  function main (line 841) | def main(_):

FILE: code/tokenization.py
  function convert_to_unicode (line 29) | def convert_to_unicode(text):
  function printable_text (line 49) | def printable_text(text):
  function load_vocab (line 72) | def load_vocab(vocab_file):
  function convert_by_vocab (line 87) | def convert_by_vocab(vocab, items):
  function convert_tokens_to_ids (line 95) | def convert_tokens_to_ids(vocab, tokens):
  function convert_ids_to_tokens (line 99) | def convert_ids_to_tokens(inv_vocab, ids):
  function whitespace_tokenize (line 103) | def whitespace_tokenize(text):
  class SimpleTokenizer (line 111) | class SimpleTokenizer(object):
    method __init__ (line 112) | def __init__(self, vocab_file, do_lower_case=True):
    method tokenize (line 117) | def tokenize(self, text):
    method convert_tokens_to_ids (line 122) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 125) | def convert_ids_to_tokens(self, ids):
  class FullTokenizer (line 129) | class FullTokenizer(object):
    method __init__ (line 132) | def __init__(self, vocab_file, do_lower_case=True):
    method tokenize (line 138) | def tokenize(self, text):
    method convert_tokens_to_ids (line 146) | def convert_tokens_to_ids(self, tokens):
    method convert_ids_to_tokens (line 149) | def convert_ids_to_tokens(self, ids):
  class BasicTokenizer (line 153) | class BasicTokenizer(object):
    method __init__ (line 156) | def __init__(self, do_lower_case=True):
    method tokenize (line 164) | def tokenize(self, text):
    method _run_strip_accents (line 188) | def _run_strip_accents(self, text):
    method _run_split_on_punc (line 199) | def _run_split_on_punc(self, text):
    method _tokenize_chinese_chars (line 219) | def _tokenize_chinese_chars(self, text):
    method _is_chinese_char (line 232) | def _is_chinese_char(self, cp):
    method _clean_text (line 254) | def _clean_text(self, text):
  class WordpieceTokenizer (line 268) | class WordpieceTokenizer(object):
    method __init__ (line 271) | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=...
    method tokenize (line 276) | def tokenize(self, text):
  function _is_whitespace (line 330) | def _is_whitespace(char):
  function _is_control (line 342) | def _is_control(char):
  function _is_punctuation (line 354) | def _is_punctuation(char):

FILE: code/utils.py
  function normalize (line 16) | def normalize(text):
  function convert_data_format (line 21) | def convert_data_format(sentence):
  function convert_back_to_bio (line 53) | def convert_back_to_bio(entities,text):
  function iobes_iob (line 62) | def iobes_iob(tags):
  function iob_iobes (line 82) | def iob_iobes(tags):
  function read_data (line 106) | def read_data(fnames, zeros=False, lower=False):
  function iob2 (line 151) | def iob2(tags):
  function update_tag_scheme (line 172) | def update_tag_scheme(sentences, tag_scheme='iobes', convert_to_iob=False):
  function eval_ner (line 198) | def eval_ner(results, path, name):
  function convert_to_bio (line 217) | def convert_to_bio(tags):
  function get_biaffine_pred_prob (line 234) | def get_biaffine_pred_prob(text, span_scores, label_list):
  function get_biaffine_pred_ner (line 260) | def get_biaffine_pred_ner(text, span_scores, is_flat_ner=True):
  function get_biaffine_pred_ner_with_dp (line 297) | def get_biaffine_pred_ner_with_dp(text, span_scores, with_logits=True, t...
  class SWAHook (line 348) | class SWAHook(tf.train.SessionRunHook):
    method __init__ (line 349) | def __init__(self, swa_steps, start_swa_step, checkpoint_path):
    method begin (line 355) | def begin(self):
    method after_run (line 375) | def after_run(self, run_context, run_values):
    method end (line 386) | def end(self, session):
  class BestF1Exporter (line 390) | class BestF1Exporter(tf.estimator.Exporter):
    method __init__ (line 391) | def __init__(self, input_fn, examples, label_list, max_seq_length, dp=...
    method name (line 401) | def name(self):
    method get_biaffine_result (line 404) | def get_biaffine_result(self,estimator):
    method export (line 440) | def export(self, estimator, export_path, checkpoint_path, eval_result,...

Download .json

Condensed preview — 42 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,622K chars).

[
  {
    "path": ".dockerignore",
    "chars": 184,
    "preview": ".git/\ncode/__pycache__\n__pycache__/\nuser_data/models/\nuser_data/pretrain_tfrecords/\nuser_data/texts/\nuser_data/tcdata/\nu"
  },
  {
    "path": ".gitignore",
    "chars": 1860,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".vscode/settings.json",
    "chars": 74,
    "preview": "{\n  \"python.pythonPath\": \"/home/xueyou/.conda/envs/jason_py3/bin/python\"\n}"
  },
  {
    "path": "Dockerfile",
    "chars": 559,
    "preview": "FROM nvcr.io/nvidia/tensorflow:19.10-py3\n\n# set noninteractive installation\nENV DEBIAN_FRONTEND=noninteractive\n\n# instal"
  },
  {
    "path": "README.md",
    "chars": 6121,
    "preview": "# CCKS2021-赛道二-中文NLP地址要素解析\n\n团队：xueyouluo\n\n初赛：1 - 93.63\n\n复赛：3 - 91.32\n\n> 这里的代码是复赛的全流程代码，需要在32G显存的卡上才能正常跑通，如果没有这么大的显存，可以考虑"
  },
  {
    "path": "code/assemble.py",
    "chars": 5146,
    "preview": "'''\n模型结果融合\n'''\nimport re\nfrom collections import Counter, defaultdict\nfrom glob import glob\n\nfrom utils import convert_d"
  },
  {
    "path": "code/conlleval.py",
    "chars": 10116,
    "preview": "# Python version of the evaluation script from CoNLL'00-\n# Originates from: https://github.com/spyysalo/conlleval.py\n\n\n#"
  },
  {
    "path": "code/create_raw_text.py",
    "chars": 11074,
    "preview": "import re\nimport json\nimport random\n\nfrom collections import Counter,defaultdict\n\nfrom utils import normalize, read_data"
  },
  {
    "path": "code/electra-pretrain/.gitignore",
    "chars": 1799,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "code/electra-pretrain/LICENSE",
    "chars": 11358,
    "preview": "\n                                 Apache License\n                           Version 2.0, January 2004\n                  "
  },
  {
    "path": "code/electra-pretrain/README.md",
    "chars": 805,
    "preview": "# Electra Pretrain\n\n在哈工大训练的electra基础上使用领域数据继续进行预训练，一般能够提升下游任务效果。\n\n## 改动\n\n- 由于我们的语料是单句粒度，修改数据构建方法，只构建单句的语料\n- 针对中文，使用更简单的t"
  },
  {
    "path": "code/electra-pretrain/build_pretraining_dataset.py",
    "chars": 6750,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/config/base_discriminator_config.json",
    "chars": 558,
    "preview": "{\n  \"attention_probs_dropout_prob\": 0.1,\n  \"directionality\": \"bidi\",\n  \"embedding_size\": 768,\n  \"hidden_act\": \"gelu\",\n  "
  },
  {
    "path": "code/electra-pretrain/config/base_generator_config.json",
    "chars": 556,
    "preview": "{\n  \"attention_probs_dropout_prob\": 0.1,\n  \"directionality\": \"bidi\",\n  \"embedding_size\": 768,\n  \"hidden_act\": \"gelu\",\n  "
  },
  {
    "path": "code/electra-pretrain/config/large_discriminator_config.json",
    "chars": 532,
    "preview": "{\n  \"attention_probs_dropout_prob\": 0.1,\n  \"embedding_size\": 1024,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,"
  },
  {
    "path": "code/electra-pretrain/config/large_generator_config.json",
    "chars": 530,
    "preview": "{\n  \"attention_probs_dropout_prob\": 0.1,\n  \"embedding_size\": 1024,\n  \"hidden_act\": \"gelu\",\n  \"hidden_dropout_prob\": 0.1,"
  },
  {
    "path": "code/electra-pretrain/configure_pretraining.py",
    "chars": 5224,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/model/__init__.py",
    "chars": 606,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/model/modeling.py",
    "chars": 40541,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/model/optimization.py",
    "chars": 11000,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/model/tokenization.py",
    "chars": 11074,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/pretrain/__init__.py",
    "chars": 606,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/pretrain/pretrain_data.py",
    "chars": 5309,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/pretrain/pretrain_helpers.py",
    "chars": 8668,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/pretrain.sh",
    "chars": 1199,
    "preview": "export DATA_DIR=../../user_data\nexport ELECTRA_DIR=../../user_data/electra\n\necho 'Prepare pretraining data...'\npython bu"
  },
  {
    "path": "code/electra-pretrain/run_pretraining.py",
    "chars": 18142,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/util/__init__.py",
    "chars": 606,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/util/training_utils.py",
    "chars": 4219,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/electra-pretrain/util/utils.py",
    "chars": 2492,
    "preview": "# coding=utf-8\n# Copyright 2020 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "code/modeling.py",
    "chars": 40981,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "code/optimization.py",
    "chars": 11488,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "code/pipeline.py",
    "chars": 8697,
    "preview": "import subprocess\nimport time\nimport os\nimport logging\nimport copy\n\nimport threading\nfrom multiprocessing import Process"
  },
  {
    "path": "code/prepare.sh",
    "chars": 157,
    "preview": "#!/usr/bin/env bash\n\nmkdir -p ../user_data/tcdata\nmkdir -p ../user_data/texts\n\ncp -r ../tcdata ../user_data\n\necho \"Data "
  },
  {
    "path": "code/pretrain.sh",
    "chars": 65,
    "preview": "#!/usr/bin/env bash\n\ncd electra-pretrain\n\nbash pretrain.sh\n\ncd .."
  },
  {
    "path": "code/run.sh",
    "chars": 38,
    "preview": "#!/usr/bin/env bash\npython pipeline.py"
  },
  {
    "path": "code/run_biaffine_ner.py",
    "chars": 41937,
    "preview": "#! usr/bin/env python3\n# -*- coding:utf-8 -*-\n\"\"\"\nCopyright 2018 The Google AI Language Team Authors.\nBASED ON Google_BE"
  },
  {
    "path": "code/simple_run.sh",
    "chars": 3355,
    "preview": "#!/usr/bin/env bash\n\n# 数据预处理\nmkdir -p ../user_data/tcdata\nmkdir -p ../user_data/texts\n\ncp -r ../tcdata ../user_data\n\nech"
  },
  {
    "path": "code/tokenization.py",
    "chars": 11139,
    "preview": "# coding=utf-8\n# Copyright 2018 The Google AI Language Team Authors.\n#\n# Licensed under the Apache License, Version 2.0 "
  },
  {
    "path": "code/utils.py",
    "chars": 16725,
    "preview": "import tensorflow as tf\nimport numpy as np\nimport re\nimport random\nimport json\nimport glob\nimport codecs\nimport os\nimpor"
  },
  {
    "path": "user_data/extra_data/dev.txt",
    "chars": 519756,
    "preview": "宁 B-city\n波 I-city\n市 I-city\n江 B-district\n东 I-district\n区 I-district\n金 B-road\n家 I-road\n一 I-road\n路 I-road\n_ B-redundant\n寰 B-"
  },
  {
    "path": "user_data/extra_data/test.txt",
    "chars": 512188,
    "preview": "龙 B-town\n港 I-town\n镇 I-town\n泰 B-poi\n和 I-poi\n小 I-poi\n区 I-poi\nB B-houseno\n懂 I-houseno\n1097 B-roomno\n\n浙 B-prov\n江 I-prov\n省 I-"
  },
  {
    "path": "user_data/extra_data/train.txt",
    "chars": 1552409,
    "preview": "龙 B-town\n山 I-town\n镇 I-town\n慈 B-community\n东 I-community\n滨 B-redundant\n海 I-redundant\n区 I-redundant\n海 B-road\n丰 I-road\n北 I-r"
  }
]

About this extraction

This page contains the full source code of the xueyouluo/ccks2021-track2-code GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 42 files (2.8 MB), approximately 723.7k tokens, and a symbol index with 283 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo