Full Code of ZhuiyiTechnology/t5-pegasus for AI

main a5211b7d4c6a cached

5 files

29.9 KB

10.3k tokens

22 symbols

1 requests

Download .txt

Repository: ZhuiyiTechnology/t5-pegasus
Branch: main
Commit: a5211b7d4c6a
Files: 5
Total size: 29.9 KB

Directory structure:
gitextract_gk80n49w/

├── LICENSE
├── README.md
├── finetune.py
├── train.py
└── train.tsv

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# T5 PEGASUS

中文生成式预训练模型，以mT5为基础架构和初始权重，通过类似PEGASUS的方式进行预训练。

详情可见：https://kexue.fm/archives/8209

## Tokenizer

我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer，它对中文更加友好。同时，我们重新整理了一版词表，使得里边的字、词都更加完善，目前的vocab.txt共包含5万个token，真正覆盖了中文的常用字、词。

## 预训练任务

预训练任务模仿了PEGASUS的摘要式预训练。具体来说，假设一个文档有n个句子，我们从中挑出大约n/4个句子（可以不连续），使得这n/4个句子拼起来的文本，跟剩下的3n/4个句子拼起来的文本，最长公共子序列尽可能长，然后我们将3n/4个句子拼起来的文本视为原文，n/4个句子拼起来的文本视为摘要，通过这样的方式构成一个“(原文, 摘要)”的伪摘要数据对。

<img src="https://raw.githubusercontent.com/ZhuiyiTechnology/t5-pegasus/main/data-sample.png" width=500>

## 模型下载

目前开源的T5 PEGASUS是base版，总参数量为2.75亿，训练时最大长度为512，batch_size为96，学习率为10<sup>-4</sup>，使用6张3090训练了100万步，训练时间约13天，数据是30多G的精处理通用语料，训练acc约47%，训练loss约2.97。模型使用<a href="bert4keras" target="_blank">bert4keras</a>进行编写、训练和测试。

运行环境：tensorflow 1.15 + keras 2.3.1 + bert4keras 0.10.0

链接: [chinese_t5_pegasus_base.zip](https://open.zhuiyi.ai/releases/nlp/models/zhuiyi/chinese_t5_pegasus_base.zip)

**2021年03月16日：** 新增T5 PEGASUS的small版，参数量为0.95亿，对显存更友好，训练参数与base版一致（最大长度为512，batch_size为96，学习率为10<sup>-4</sup>，使用3张TITAN训练了100万步，训练时间约12天，数据是30多G的精处理通用语料，训练acc约42.3%，训练loss约3.40。），中文效果相比base版略降，比mT5 small版要好。

链接: [chinese_t5_pegasus_small.zip](https://open.zhuiyi.ai/releases/nlp/models/zhuiyi/chinese_t5_pegasus_small.zip)

## 其他框架

网友renmada转的pytorch版：https://github.com/renmada/t5-pegasus-pytorch

## 部分评测

摘要生成效果：

<img src="https://raw.githubusercontent.com/ZhuiyiTechnology/t5-pegasus/main/csl-lcsts.png" width=500>

小样本学习：

<img src="https://raw.githubusercontent.com/ZhuiyiTechnology/t5-pegasus/main/few-shot.png" width=500>

## 如何引用

Bibtex：

```latex
@techreport{zhuiyit5pegasus,
  title={T5 PEGASUS - ZhuiyiAI},
  author={Jianlin Su},
  year={2021},
  url="https://github.com/ZhuiyiTechnology/t5-pegasus",
}
```

## 联系我们

邮箱：ai@wezhuiyi.com 追一科技：https://zhuiyi.ai










================================================
FILE: finetune.py
================================================
#! -*- coding: utf-8 -*-
# 微调T5 PEGASUS做Seq2Seq任务
# 介绍链接：https://kexue.fm/archives/8209

from __future__ import print_function
import json
import numpy as np
from tqdm import tqdm
from bert4keras.backend import keras, K
from bert4keras.layers import Loss
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
from bert4keras.optimizers import Adam
from bert4keras.snippets import sequence_padding, open
from bert4keras.snippets import DataGenerator, AutoRegressiveDecoder
from keras.models import Model
from rouge import Rouge  # pip install rouge
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import jieba
jieba.initialize()

# 基本参数
max_c_len = 256
max_t_len = 32
batch_size = 32
epochs = 40

# 模型路径
config_path = '/root/kg/bert/chinese_t5_pegasus_base/config.json'
checkpoint_path = '/root/kg/bert/chinese_t5_pegasus_base/model.ckpt'
dict_path = '/root/kg/bert/chinese_t5_pegasus_base/vocab.txt'


def load_data(filename):
    """加载数据
    单条格式：(标题, 正文)
    """
    D = []
    with open(filename, encoding='utf-8') as f:
        for l in f:
            title, content = l.strip().split('\t')
            D.append((title, content))
    return D


# 加载数据集
train_data = load_data('/root/csl/train.tsv')
valid_data = load_data('/root/csl/val.tsv')
test_data = load_data('/root/csl/test.tsv')

# 构建分词器
tokenizer = Tokenizer(
    dict_path,
    do_lower_case=True,
    pre_tokenize=lambda s: jieba.cut(s, HMM=False)
)


class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_c_token_ids, batch_t_token_ids = [], []
        for is_end, (title, content) in self.sample(random):
            c_token_ids, _ = tokenizer.encode(content, maxlen=max_c_len)
            t_token_ids, _ = tokenizer.encode(title, maxlen=max_t_len)
            batch_c_token_ids.append(c_token_ids)
            batch_t_token_ids.append(t_token_ids)
            if len(batch_c_token_ids) == self.batch_size or is_end:
                batch_c_token_ids = sequence_padding(batch_c_token_ids)
                batch_t_token_ids = sequence_padding(batch_t_token_ids)
                yield [batch_c_token_ids, batch_t_token_ids], None
                batch_c_token_ids, batch_t_token_ids = [], []


class CrossEntropy(Loss):
    """交叉熵作为loss，并mask掉输入部分
    """
    def compute_loss(self, inputs, mask=None):
        y_true, y_pred = inputs
        y_true = y_true[:, 1:]  # 目标token_ids
        y_mask = K.cast(mask[1], K.floatx())[:, 1:]  # 解码器自带mask
        y_pred = y_pred[:, :-1]  # 预测序列，错开一位
        loss = K.sparse_categorical_crossentropy(y_true, y_pred)
        loss = K.sum(loss * y_mask) / K.sum(y_mask)
        return loss


t5 = build_transformer_model(
    config_path=config_path,
    checkpoint_path=checkpoint_path,
    model='mt5.1.1',
    return_keras_model=False,
    name='T5',
)

encoder = t5.encoder
decoder = t5.decoder
model = t5.model
model.summary()

output = CrossEntropy(1)([model.inputs[1], model.outputs[0]])

model = Model(model.inputs, output)
model.compile(optimizer=Adam(2e-4))


class AutoTitle(AutoRegressiveDecoder):
    """seq2seq解码器
    """
    @AutoRegressiveDecoder.wraps(default_rtype='probas')
    def predict(self, inputs, output_ids, states):
        c_encoded = inputs[0]
        return self.last_token(decoder).predict([c_encoded, output_ids])

    def generate(self, text, topk=1):
        c_token_ids, _ = tokenizer.encode(text, maxlen=max_c_len)
        c_encoded = encoder.predict(np.array([c_token_ids]))[0]
        output_ids = self.beam_search([c_encoded], topk=topk)  # 基于beam search
        return tokenizer.decode(output_ids)


autotitle = AutoTitle(
    start_id=tokenizer._token_start_id,
    end_id=tokenizer._token_end_id,
    maxlen=max_t_len
)


class Evaluator(keras.callbacks.Callback):
    """评估与保存
    """
    def __init__(self):
        self.rouge = Rouge()
        self.smooth = SmoothingFunction().method1
        self.best_bleu = 0.

    def on_epoch_end(self, epoch, logs=None):
        metrics = self.evaluate(valid_data)  # 评测模型
        if metrics['bleu'] > self.best_bleu:
            self.best_bleu = metrics['bleu']
            model.save_weights('./best_model.weights')  # 保存模型
        metrics['best_bleu'] = self.best_bleu
        print('valid_data:', metrics)

    def evaluate(self, data, topk=1):
        total = 0
        rouge_1, rouge_2, rouge_l, bleu = 0, 0, 0, 0
        for title, content in tqdm(data):
            total += 1
            title = ' '.join(title).lower()
            pred_title = ' '.join(autotitle.generate(content,
                                                     topk=topk)).lower()
            if pred_title.strip():
                scores = self.rouge.get_scores(hyps=pred_title, refs=title)
                rouge_1 += scores[0]['rouge-1']['f']
                rouge_2 += scores[0]['rouge-2']['f']
                rouge_l += scores[0]['rouge-l']['f']
                bleu += sentence_bleu(
                    references=[title.split(' ')],
                    hypothesis=pred_title.split(' '),
                    smoothing_function=self.smooth
                )
        rouge_1 /= total
        rouge_2 /= total
        rouge_l /= total
        bleu /= total
        return {
            'rouge-1': rouge_1,
            'rouge-2': rouge_2,
            'rouge-l': rouge_l,
            'bleu': bleu,
        }


if __name__ == '__main__':

    evaluator = Evaluator()
    train_generator = data_generator(train_data, batch_size)

    model.fit(
        train_generator.forfit(),
        steps_per_epoch=len(train_generator),
        epochs=epochs,
        callbacks=[evaluator]
    )

else:

    model.load_weights('./best_model.weights')


================================================
FILE: train.py
================================================
#! -*- coding: utf-8 -*-
# 词级别的中文PEGASUS预训练

import os
os.environ['TF_KERAS'] = '1'  # 必须使用tf.keras

import json
import numpy as np
import tensorflow as tf
from bert4keras.backend import keras, K
from bert4keras.layers import Loss
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer, SpTokenizer
from bert4keras.tokenizers import load_vocab, save_vocab
from bert4keras.optimizers import Adam
from bert4keras.optimizers import extend_with_weight_decay
from bert4keras.optimizers import extend_with_piecewise_linear_lr
from bert4keras.snippets import sequence_padding, open
from bert4keras.snippets import DataGenerator
from bert4keras.snippets import text_segmentate
import pylcs
import jieba
jieba.initialize()

# 基本参数
maxlen = 512
batch_size = 96
epochs = 100000
summary_rate = 0.25
t_maxlen = maxlen // 4
s_maxlen = maxlen - t_maxlen

# T5配置
config_path = '/root/kg/bert/mt5/mt5_base/mt5_base_config.json'
checkpoint_path = '/root/kg/bert/mt5/mt5_base/model.ckpt-1000000'
spm_path = '/root/kg/bert/mt5/sentencepiece.model'

# PEGASUS
dict_path_1 = '/root/kg/bert/chinese_pegasus_L-12_H-768_A-12/vocab.txt'
dict_path_2 = '/root/kg/bert/chinese_t5_pegasus_base/vocab.txt'

# 构建词表
sp_tokenizer = SpTokenizer(spm_path, token_start=None, token_end=None)
token_dict = load_vocab(dict_path_1)
keep_tokens, new_token_dict, n = [], {}, 0
for t, _ in sorted(token_dict.items(), key=lambda s: s[1]):
    if n < 106:
        new_token_dict[t] = n
        n += 1
        continue
    if t.startswith('##'):
        i = sp_tokenizer.token_to_id(t[2:])
        if i == 2:
            i = sp_tokenizer.token_to_id(u'\u2581' + t)
    else:
        i = sp_tokenizer.token_to_id(u'\u2581' + t)
        if i == 2:
            i = sp_tokenizer.token_to_id(t)
    if i != 2:
        keep_tokens.append(i)
        new_token_dict[t] = len(new_token_dict)

keep_tokens = [2] * 106 + keep_tokens
keep_tokens_inv = {j: i for i, j in enumerate(keep_tokens)}

compound_tokens = []
for t, _ in sorted(token_dict.items(), key=lambda s: s[1]):
    if t not in new_token_dict:
        new_token_dict[t] = len(new_token_dict)
        ids = [keep_tokens_inv.get(i, 0) for i in sp_tokenizer.encode(t)[0]]
        compound_tokens.append(ids)

save_vocab(dict_path_2, new_token_dict)

# 构建分词器
tokenizer = Tokenizer(
    new_token_dict,
    do_lower_case=True,
    pre_tokenize=lambda s: jieba.cut(s, HMM=False)
)


def corpus():
    """语料生成器
    """
    while True:
        f = '/root/data_pretrain/data_shuf.json'
        with open(f) as f:
            for l in f:
                l = json.loads(l)
                for texts in text_process(l['text']):
                    yield texts


def text_process(text):
    """分割文本
    """
    texts = text_segmentate(text, 32, u'\n。')
    result, length = [], 0
    for text in texts:
        if length + len(text) > maxlen * 1.5 and len(result) >= 3:
            yield result
            result, length = [], 0
        result.append(text)
        length += len(text)
    if result and len(result) >= 3:
        yield result


def gather_join(texts, idxs):
    """取出对应的text，然后拼接起来
    """
    return ''.join([texts[i] for i in idxs])


def pseudo_summary(texts):
    """构建伪标签摘要数据集
    """
    source_idxs, target_idxs = list(range(len(texts))), []
    while True:
        sims = []
        for i in source_idxs:
            new_source_idxs = [j for j in source_idxs if j != i]
            new_target_idxs = sorted(target_idxs + [i])
            new_source = gather_join(texts, new_source_idxs)
            new_target = gather_join(texts, new_target_idxs)
            sim = pylcs.lcs(new_source, new_target)
            sims.append(sim)
        new_idx = source_idxs[np.argmax(sims)]
        source_idxs.remove(new_idx)
        target_idxs = sorted(target_idxs + [new_idx])
        source = gather_join(texts, source_idxs)
        target = gather_join(texts, target_idxs)
        if (
            len(source_idxs) == 1 or
            1.0 * len(target) / len(source) > summary_rate
        ):
            break
    if len(source) < len(target):
        source, target = target, source
    return source, target


class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        for is_end, texts in self.sample(random):
            source, target = pseudo_summary(texts)
            source_ids, _ = tokenizer.encode(source, maxlen=s_maxlen)
            target_ids, _ = tokenizer.encode(target, maxlen=t_maxlen)
            yield source_ids, target_ids


class CrossEntropy(Loss):
    """交叉熵作为loss，并mask掉输入部分
    """
    def compute_loss(self, inputs, mask=None):
        y_true, y_pred = inputs
        y_mask = K.cast(K.not_equal(y_true, 0), K.floatx())
        y_true = y_true[:, 1:]  # 目标token_ids
        y_mask = y_mask[:, 1:]  # segment_ids，刚好指示了要预测的部分
        y_pred = y_pred[:, :-1]  # 预测序列，错开一位
        acc = keras.metrics.sparse_categorical_accuracy(y_true, y_pred)
        acc = K.sum(acc * y_mask) / K.sum(y_mask)
        self.add_metric(acc, name='accuracy', aggregation='mean')
        loss = K.sparse_categorical_crossentropy(
            y_true, y_pred, from_logits=True
        )
        loss = K.sum(loss * y_mask) / K.sum(y_mask)
        return loss


strategy = tf.distribute.MirroredStrategy()

with strategy.scope():

    t5 = build_transformer_model(
        config_path,
        checkpoint_path=None,
        model='t5.1.1',
        with_lm='linear',
        keep_tokens=keep_tokens,
        compound_tokens=compound_tokens,
        return_keras_model=False,
    )

    model = t5.model
    output = CrossEntropy(1)(model.inputs[1:] + model.outputs)
    model = keras.models.Model(model.inputs, output)

    AdamW = extend_with_weight_decay(Adam, name='AdamW')
    AdamWLR = extend_with_piecewise_linear_lr(AdamW, name='AdamWLR')
    optimizer = AdamWLR(
        learning_rate=1e-4,
        weight_decay_rate=0.01,
        exclude_from_weight_decay=['Norm', 'bias'],
        lr_schedule={10000: 1}
    )
    model.compile(optimizer=optimizer)
    model.summary()
    t5.load_weights_from_checkpoint(checkpoint_path)


class Evaluator(keras.callbacks.Callback):
    """训练回调
    """
    def on_epoch_end(self, epoch, logs=None):
        model.save_weights('t5_pegasus_model.weights')  # 保存模型


if __name__ == '__main__':

    # 启动训练
    evaluator = Evaluator()
    train_generator = data_generator(corpus(), batch_size, 10**5)
    dataset = train_generator.to_dataset(
        types=('float32', 'float32'),
        shapes=([None], [None]),
        names=('Encoder-Input-Token', 'Decoder-Input-Token'),
        padded_batch=True
    )

    model.fit(
        dataset, steps_per_epoch=1000, epochs=epochs, callbacks=[evaluator]
    )

else:

    model.load_weights('t5_pegasus_model.weights')


================================================
FILE: train.tsv
================================================
交换超立方体网络容错路由研究	为了研究交换超立方体网络容错路由问题,引入了相邻结点集合类的概念,提出了相邻结点集的求解公式。对于满足任意子连通性条件的交换超立方体网络,给出了基于相邻结点集合类的自适应容错路由算法及算法的步长上界。仿真实验结果表明算法是有效的。
一种基于通讯痕迹的社会网络团伙分析模型	研究在已知目标团伙中某节点以及目标团伙特征的前提下,基于通讯痕迹特征寻找社会网络团伙。研究过程中引入了社会圈、节点中心度和事件集合关联矩阵等概念,重点将聚类分析方法与社会团伙发现相结合,以期得到一种基于通讯痕迹的社会网络团伙分析模型。
基于Hadoop平台的XML文档重复数据检测	XML数据越来越广泛地被用于信息交换与集成中,其数据质量问题引起了人们的关注.解决由数据质量引发的问题,实体识别技术非常关键.为了克服现有方法的不足,在海量XML数据上进行高效的重复对象检测,以实体识别技术为基础提出了基于Hadoop平台的XML文档重复检测算法,它将所有标签节点统称为属性,用实体来描述属性,通过属性的比较,快速地找到在某些属性上相同的所有实体对象,并利用Hadoop应用框架处理海量数据的优势实现并行处理.经过试验验证该方法良好的扩展性,伸缩性和高效性.
快速码字搜索算法中一维特征量的最佳选择方法	矢量量化编码过程中的最近邻码字搜索需要进行大量的矢量间距离的计算,这个过程的计算复杂度极高,严重限制了其实际使用.为了加速矢量量化的编码过程,许多文献提出了各种不同组合的基于均值、2-范数、方差和角度的矢量一维特征量的快速最近邻矢量量化码字搜索算法.通过实验给出了这四个一维特征量单独使用以及相互组合的所有情况下各算法的搜索范围和编码时间,并对它们进行了比较和分析,进而提出了在实际进行编码时如何最优地进行一维特征量选取的准则.
海量病例CT图像的快速查找检索模型仿真	在海量病例CT图像的快速查找检索过程中,采用传统算法进行检索,由于计算复杂、计算量大等原因,造成病例CT图像查找检索效率过低的问题。为解决上述问题,提出了一种改进高阶统计量算法的海量病例CT图像的快速查找检索方法。通过Radon变换方法将病例CT图像代入到一维空间中,获取病例CT图像投影数据的双谱信息,将高阶统计量算法与亚像素边缘特征算法相融合,将亚像素级精度位置搜索的问题变为最小化函数,对病例CT图像的亚像素边缘特征进行有效的提取。采用奇异值-迭代最近点法(SVD-ICP)和小波极大值完成病例CT图像轮廓间配准融合,进而实现了海量病例CT图像的快速查找与检索。实验结果表明,提出的改进高阶统计量算法的海量病例CT图像的快速查找检索方法精确度高,实用性强。
基于像素分解的圆形标志点亚像素定位研究	影像中圆形标志点的定位对于数字摄影测量具有重要作用.通过对圆形标志点边缘处的混合像素进行亚像素定位,提取出标志点的亚像素级边缘,再基于最小二乘原理进行椭圆拟合得到圆形标志的中心坐标.运用三种实验表明,与直接采用像素级边缘进行拟合定位相比,该方法的精度明显提高.
基于稀疏低秩描述的图像检索方法	使用颜色、形状、纹理等特征的基于内容的图像检索技术,将图像看作向量空间中的点,通过计算两点之间的某种距离来衡量图像间的相似度,然而在提取图像特征时相同类型的图像会出现不一致的特征,极大地影响了检索算法的准确率。针对该问题,提出一种稀疏低秩描述的多特征图像检索方法。通过对图像集的稀疏低秩描述,保持了相同类别特征的全局结构,同时也降低了对于局部噪声的敏感度,增强了检索算法的鲁棒性。在Corel图像集上的检索实验结果表明,该方法较已有的基于内容的图像检索方法有更好的检索效果。
基于神经网络的铁水KR脱硫预报模型	将神经网络理论应用于铁水脱硫过程,研究工艺参数与其影响因子之间的关系,建立预报模型,为生产过程中工艺参数(搅拌时间、搅拌次数和加入剂量)的设定选择提供准确的预报。研究分析表明,该预报模型可以应用于实际生产,提高铁水的脱硫成功的命中率,降低铁水的脱硫成本。
VANET安全技术综述	随着车载自组织网络技术的不断发展,研究者对车载自组织网络系统安全进行了深入研究.论文阐述了车载自组织网络领域中安全研究的重要性;介绍了该领域中目前最新研究进展和存在的主要问题;讨论并比较了各种安全协议应用于车载自组织网络的优缺点;分析总结了系统中安全协议的设计要素;最后展望了车载自组织网络安全技术的未来研究方向.
基于DSpace构建传统蒙古文学科机构知识库平台	本文主要阐述了基于DSpace构建传统蒙古文学科机构知识库的难点以及解决的技术路线,包括蒙古文数字资料的采集、存储、检索以及显示等。针对蒙古文的构词和语法等方面的特点,对开源搜索引擎Lucene进行改进——采用B树管理Term、简化了特征词权值的计算、采用EC方法确定了蒙古文停用词表,实现了基于Lucene的蒙古文检索。
一种基于DWT和HVS的图像版权保护研究	提出一种基于离散小波变换(DWT)的数字水印进行图像版权保护的新方法。将一幅有意义的二值图像作为水印来隐藏,先将水印图像使用推广Arnold的变换进行置乱后和图像同时进行多尺度分解,然后将分解后的水印系数根据人类视觉系统(HVS)特性自适应地嵌入到具有相同尺寸的低中频系数中,重构得到水印图像。实验结果表明,该算法具有较好的不可见性、鲁棒性和安全性。
用块稀疏贝叶斯学习算法重构识别体域网步态模式	针对低功耗体域网步态远程监测终端非稀疏加速度数据重构和步态模式识别性能优化问题,提出了一种基于块稀疏贝叶斯学习的体域网远程步态模式重构识别新方法,该方法基于体域网远程步态监测系统架构和压缩感知框架,在体域网传感节点利用线性稀疏矩阵压缩原始加速度数据,减少传输数据量,降低其功耗,同时在远程终端基于块稀疏贝叶斯学习算法充分利用加速度数据块结构内在相关性,获取加速度数据内在稀疏性,有效提高非稀疏加速度数据重构性能,为准确识别步态模式提供可靠的数据支撑。采用USC-HAD数据库中行走、跑、跳、上楼、下楼五种步态运动的加速度数据验证新方法的有效性,实验结果表明,基于所提算法的加速度数据重构性能明显优于传统压缩感知重构算法性能,使基于支持向量机多步态分类器识别准确率可达98%,显著提高体域网远程步态模式识别性能。所提新方法不仅有效提高非稀疏加速度数据重构和步态模式识别性能,并且也有助于设计低功耗、低成本的体域网加速度数据采集系统,为体域网远程监测步态模式变化提供一个新方法和新思路。
考虑后视和最优速度记忆的跟驰模型及仿真	为提高交通流的稳定性,在考虑后视效应和速度差信息(Backward Looking and Velocity Difference,BLVD)模型的基础上,综合考虑后视和最优速度记忆效应,提出了一个扩展的跟驰模型。采用线性稳定性分析,推导出该模型的交通流稳定判据,发现在模型中引入后视和最优速度记忆效应的共同作用后,交通流的稳定区域有明显增大。通过数值仿真验证了理论分析,仿真结果表明:在初始扰动相同的条件下,与BLVD模型相比,新提出的扩展模型具有更好的交通流致稳性能。最后,使用NGSIM数据对所提出的跟驰模型进行参数标定和评价,证明其能更准确地刻画车流演变规律。
砂轮位置对成形磨齿齿廓偏差的补偿	为提高成形磨齿加工的精度,提出一种通过调整砂轮位置实现齿廓偏差补偿的方法。应用包络理论,建立已知砂轮轴向廓形和砂轮位置误差计算齿轮端面廓形的数学模型。通过数值研究发现,齿廓倾斜偏差与砂轮径向位置误差和切向位置误差成正比例关系而且满足叠加原理。应用这些规律,依据测量的齿廓偏差可以方便地计算出砂轮位置调整量。试验结果表明,该方法可以将齿廓倾斜偏差由7级精度(ISO1328-1:1997)提高到2级精度。
基于物联网智能的独居老人自动监控方法研究	研究基于物联网框架下的独居老年人智能看护的问题。独居老人在家的行为存在较大突发性和随机性,关键反映特征受到手臂、角度、房屋结构等遮挡,存在监控死角。传统的智能监控方法缺少独立行为识别能力,框架下的设备无法对突发特征进行报警,由于遮挡的存在,对一些疑似行为缺少准确的识别。提出一种物联网框架下的人工智能独居老年人自动看护方法。在物联网的框架下,对老年人活动空间中视觉传感器采集的信号进行增强处理,为了适应物联网设备众多的需要,利用混沌粒子群算法,根据上述监控信号,完成老年人行为的寻优识别,克服死角、遮挡、异常无行为运动的干扰,实现老年人智能看护。实验结果表明,运用该算法进行人工智能独居老年人自动看护,能够极大的降低看护过程中的误识别率,从而保证独居老年人的安全。
动能橡胶圆球弹优化选择及外弹道仿真	为了为某型动能防暴发射器选择最优橡胶弹丸,建立了橡胶圆球弹外弹道模型,从国内外典型橡胶圆球弹的大小、空气阻力对弹道的影响和终点效应三方面分析了对橡胶圆球弹选择的影响,利用MATLAB仿真软件计算了8mm、10mm、15mm3种直径不同质量条件下的空气阻力对弹道的影响、最大射程、终点速度、飞行时间、动能和比动能;重点分析比动能和K值对弹道特性及终点效应的影响,通过分析和比较,得出了直径为10mm,质量为2g的橡胶圆球弹最适合作为某型动能防暴发射器的战斗弹丸;通过分析直径10mm,质量2g的橡胶圆球弹不同发射角度情况下的弹道特性和终点效应,结果表明,该橡胶圆球弹存速能力强,发射距离远,安全性高,为某型动能防暴发射器的弹丸制造提供了理论支撑。
一种自适应调制的鱼群优化部分传输序列算法	针对现有OFDM系统单一调制方式下峰均功率比过高问题,提出采用自适应调制的AFSA-PTS算法。该算法基于子载波信道增益对子载波进行自适应比特分配,根据比特数确定各子载波的调制方式,实现自适应调制,并采用AFSA快速寻优到PTS算法中的最佳序列,从而在降低系统计算复杂度的同时,实现峰均功率比的有效降低。仿真结果表明,采用自适应调制AFSA-PTS算法在降低峰均比的同时可有效降低系统的计算复杂度,证明了其优越性。
基于排队论的升降横移立体车库控制策略研究	关于升降横移自动化立体车库的结构优化设计问题,为车库结构安全可靠,操作方便,给出了三种不同存取车辆策略的定义,以排队论为理论依据,结合最优车位计算原则,对实际情况中遇到的存(取)车高峰期和非高峰期时间段建立了使用存车优先、原地复位、交叉存取不同控制策略车库相应的数学模型,对特定参数的模型进行了Matlab仿真,并给出分析结果,综合分析结果表明采用原地复位存取车策略车库的平均存取车时间较短,车位利用率较高,对车库设计前期和管理者决策具有重要的现实意义。
在网络流量中搜索恶意输入并自动修复验证	"为了真正实施自我修复技术,提高它们在系统中信任级别,在它自动发展后,""修复""的功效必须进行测试和验证。但在实际部署之前,由于攻击的特性,这种验证必须是自动进行,该问题称为自动修复验证(automatic repair validation,ARV)。为了说明ARV所面临的困难,提出了一种系统的设计,该系统跟踪和存储恶意的网络流量,为自我修复软件在验证阶段后重放提供条件。实例验证了该方法的可行性。"
云计算中虚拟资源的智能多代理设计	针对随着网络数据传输速度和复杂性的不断增加,网络管理变得更加困难的现状,提出了一种虚拟资源的智能多代理模型。描述了虚拟资源的智能多代理的处理过程,讨论了不同代理的处理机制。通过分析用户上下文和系统状态,可实时地分析社会媒体资源。根据虚拟资源的使用类型,对用户上下信息的需求进行分析和推断,自动地给用户分配资源。采用云计算中虚拟资源动态调度方法及MovieLens系统评估该模型,结果证明所提出的模型具有较好的性能,可实现虚拟资源的动态调度,动态地实现负载均衡,使云计算中的虚拟资源得到高效的利用。

Download .txt

gitextract_gk80n49w/

├── LICENSE
├── README.md
├── finetune.py
├── train.py
└── train.tsv

Download .txt

SYMBOL INDEX (22 symbols across 2 files)

FILE: finetune.py
  function load_data (line 34) | def load_data(filename):
  class data_generator (line 59) | class data_generator(DataGenerator):
    method __iter__ (line 62) | def __iter__(self, random=False):
  class CrossEntropy (line 76) | class CrossEntropy(Loss):
    method compute_loss (line 79) | def compute_loss(self, inputs, mask=None):
  class AutoTitle (line 108) | class AutoTitle(AutoRegressiveDecoder):
    method predict (line 112) | def predict(self, inputs, output_ids, states):
    method generate (line 116) | def generate(self, text, topk=1):
  class Evaluator (line 130) | class Evaluator(keras.callbacks.Callback):
    method __init__ (line 133) | def __init__(self):
    method on_epoch_end (line 138) | def on_epoch_end(self, epoch, logs=None):
    method evaluate (line 146) | def evaluate(self, data, topk=1):

FILE: train.py
  function corpus (line 83) | def corpus():
  function text_process (line 95) | def text_process(text):
  function gather_join (line 110) | def gather_join(texts, idxs):
  function pseudo_summary (line 116) | def pseudo_summary(texts):
  class data_generator (line 144) | class data_generator(DataGenerator):
    method __iter__ (line 147) | def __iter__(self, random=False):
  class CrossEntropy (line 155) | class CrossEntropy(Loss):
    method compute_loss (line 158) | def compute_loss(self, inputs, mask=None):
  class Evaluator (line 205) | class Evaluator(keras.callbacks.Callback):
    method on_epoch_end (line 208) | def on_epoch_end(self, epoch, logs=None):

Download .json

Condensed preview — 5 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (42K chars).

[
  {
    "path": "LICENSE",
    "chars": 11358,
    "preview": "\n                                 Apache License\n                           Version 2.0, January 2004\n                  "
  },
  {
    "path": "README.md",
    "chars": 1800,
    "preview": "# T5 PEGASUS\n\n中文生成式预训练模型，以mT5为基础架构和初始权重，通过类似PEGASUS的方式进行预训练。\n\n详情可见：https://kexue.fm/archives/8209\n\n## Tokenizer\n\n我们将T5 P"
  },
  {
    "path": "finetune.py",
    "chars": 5734,
    "preview": "#! -*- coding: utf-8 -*-\n# 微调T5 PEGASUS做Seq2Seq任务\n# 介绍链接：https://kexue.fm/archives/8209\n\nfrom __future__ import print_fu"
  },
  {
    "path": "train.py",
    "chars": 6803,
    "preview": "#! -*- coding: utf-8 -*-\n# 词级别的中文PEGASUS预训练\n\nimport os\nos.environ['TF_KERAS'] = '1'  # 必须使用tf.keras\n\nimport json\nimport "
  },
  {
    "path": "train.tsv",
    "chars": 4876,
    "preview": "交换超立方体网络容错路由研究\t为了研究交换超立方体网络容错路由问题,引入了相邻结点集合类的概念,提出了相邻结点集的求解公式。对于满足任意子连通性条件的交换超立方体网络,给出了基于相邻结点集合类的自适应容错路由算法及算法的步长上界。仿真实验结"
  }
]

About this extraction

This page contains the full source code of the ZhuiyiTechnology/t5-pegasus GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 5 files (29.9 KB), approximately 10.3k tokens, and a symbol index with 22 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo