Repository: Happy-zyy/Competition Branch: master Commit: bceb130a7261 Files: 45 Total size: 220.1 KB Directory structure: gitextract_uf73lln7/ ├── ReadMe.md └── zhihu-text-classification-master/ ├── data_process/ │ ├── .idea/ │ │ ├── .name │ │ ├── data_process.iml │ │ ├── deployment.xml │ │ ├── encodings.xml │ │ ├── misc.xml │ │ ├── modules.xml │ │ └── workspace.xml │ ├── README.md │ ├── char2id.py │ ├── creat_batch_data.py │ ├── creat_batch_seg.py │ ├── embed2ndarray.py │ ├── question_and_topic_2id.py │ ├── run_all_data_process.sh │ ├── test.py │ └── word2id.py └── models/ ├── wd_1_1_cnn_concat/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py ├── wd_1_2_cnn_max/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py ├── wd_2_hcnn/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py ├── wd_3_bigru/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py ├── wd_4_han/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py ├── wd_5_bigru_cnn/ │ ├── __init__.py │ ├── network.py │ ├── predict.py │ └── train.py └── wd_6_rcnn/ ├── __init__.py ├── network.py ├── predict.py └── train.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: ReadMe.md ================================================ # 竞赛列表 + [2017 知乎看山杯机器学习挑战赛](https://www.biendata.com/competition/zhihu/) ================================================ FILE: zhihu-text-classification-master/data_process/.idea/.name ================================================ data_process ================================================ FILE: zhihu-text-classification-master/data_process/.idea/data_process.iml ================================================ ================================================ FILE: zhihu-text-classification-master/data_process/.idea/deployment.xml ================================================ ================================================ FILE: zhihu-text-classification-master/data_process/.idea/encodings.xml ================================================ ================================================ FILE: zhihu-text-classification-master/data_process/.idea/misc.xml ================================================ ================================================ FILE: zhihu-text-classification-master/data_process/.idea/modules.xml ================================================ ================================================ FILE: zhihu-text-classification-master/data_process/.idea/workspace.xml ================================================

true

1504013687331

================================================ FILE: zhihu-text-classification-master/data_process/README.md ================================================ ## 数据处理 1.把比赛提供的所有数据解压到 raw_data/ 目录下。
2.按照顺序依次执行各个 .py，不带任何参数。
或者在当前目录下输入下面命令运行所有文件：
dos2unix run_all_data_process.sh # 使用cygwin工具dos2unix将script改为unix格式
sh run_all_data_process.sh
3.环境依赖(下面是我使用的版本)
- numpy 1.12.1 - pandas 0.19.2 - word2vec 0.9.1 - tqdm 4.11.2 ### embed2ndarray.py 赛方提供了txt格式的词向量和字向量，这里把embedding矩阵转成 np.ndarray 形式，分别保存为 data/word_embedding.npy 和 data/char_embedding.npy。在赛方提供的词向量基础上，添加 '\' 和 '\' 两个特殊符号。其中 '\' 用于将序列补全到固定长度， '\' 用于替换低频词（字）。用 pd.Series 保存词(字)对应 embedding 中的行号(id),存储在 data/sr_word2id.pkl 和 data/sr_char2id.pkl 中。 ### question_and_topic_2id.py 把问题和话题转为id形式，保存在 data/sr_question2id.pkl 和 data/sr_id2question.pkl 中。 ### char2id.py 利用上面得到的 sr_char2id，把所有问题的字转为对应的id, 存储为 data/ch_train_title.npy data/ch_train_content.npy data/ch_eval_title.npy data/ch_eval_content.npy ### word2id.py 同 char2id.py ### creat_batch_data.py 把所有的数据按照 batch_size(128) 进行打包，固定seed，随机取 10 万样本作为验证集。每个batch存储为一个 npz 文件，包括 X, y 两部分。这里所有的序列都进行了截断，长度不足的用0进行padding到固定长度。保存位置： wd_train_path = '../data/wd-data/data_train/' wd_valid_path = '../data/wd-data/data_valid/' wd_test_path = '../data/wd-data/data_test/' ch_train_path = '../data/ch-data/data_train/' ch_valid_path = '../data/ch-data/data_valid/' ch_test_path = '../data/ch-data/data_test/' ### creat_batch_seg.py 和 creat_batch_data.py 相同，只是对 content 部分进行句子划分。用于分层模型。划分句子长度： wd_title_len = 30, wd_sent_len = 30, wd_doc_len = 10.(即content划分为10个句子，每个句子长度为30个词) ch_title_len = 52, ch_sent_len = 52, ch_doc_len = 10. 不划分句子： wd_title_len = 30, wd_content_len = 150. ch_title_len = 52, ch_content_len = 300. ### To do - 在数据读取中使用 tfrecord 文件进行数据读取。这样能够随时改变 batch_size，而且 shuffle 会比使用 numpy 更加均匀。 - 添加序列长度信息。在这里所有的序列都截断或者padding为固定长度，在误差计算中没有处理padding部分，可能会使准确率下降。在使用 dynamic_rnn 的时候加上 sequence_length 信息，在计算的时候忽略 padding 部分。同时结合 tf.train.SequenceExample() 和 tf.train.batch() 自动 padding，也可以减少数据量。 ================================================ FILE: zhihu-text-classification-master/data_process/char2id.py ================================================ # -*- coding:utf-8 -*- from __future__ import division from __future__ import print_function import numpy as np import pandas as pd import pickle from multiprocessing import Pool from tqdm import tqdm import time save_path = '../data/' with open(save_path + 'sr_char2id.pkl', 'rb') as inp: sr_id2char = pickle.load(inp) sr_char2id = pickle.load(inp) dict_char2id = dict() for i in range(len(sr_char2id)): dict_char2id[sr_char2id.index[i]] = sr_char2id.values[i] def get_id(char): """获取 char 所对应的 id. 如果该字不在字典中，用1进行替换。 """ if char not in dict_char2id: return 1 else: return dict_char2id[char] def get_id4chars(chars): """把 chars 转为对应的 id""" chars = chars.strip().split(',') # 先分开字 ids = list(map(get_id, chars)) # 获取id return ids def test_char2id(): """把测试集的所有字转成对应的id。""" time0 = time.time() print('Processing eval data.') df_eval = pd.read_csv('../raw_data/question_eval_set.txt', sep='\t', usecols=[0, 1, 3], names=['question_id', 'char_title', 'char_content'], dtype={'question_id': object}) print('test question number %d' % len(df_eval)) # 没有 title 的问题用 content 来替换 na_title_indexs = list() for i in range(len(df_eval)): char_title = df_eval.char_title.values[i] if type(char_title) is float: na_title_indexs.append(i) print('There are %d test questions without title.' % len(na_title_indexs)) for na_index in na_title_indexs: df_eval.at[na_index, 'char_title'] = df_eval.at[na_index, 'char_content'] # 没有 content 的问题用 title 来替换 na_content_indexs = list() for i in tqdm(range(len(df_eval))): char_content = df_eval.char_content.values[i] if type(char_content) is float: na_content_indexs.append(i) print('There are %d test questions without content.' % len(na_content_indexs)) for na_index in tqdm(na_content_indexs): df_eval.at[na_index, 'char_content'] = df_eval.at[na_index, 'char_title'] # 转为 id 形式 p = Pool() eval_title = np.asarray(p.map(get_id4chars, df_eval.char_title.values)) np.save('../data/ch_eval_title.npy', eval_title) eval_content = np.asarray(p.map(get_id4chars, df_eval.char_content.values)) np.save('../data/ch_eval_content.npy', eval_content) p.close() p.join() print('Finished changing the eval chars to ids. Costed time %g s' % (time.time()-time0)) def train_char2id(): """把训练集的所有字转成对应的id。""" time0 = time.time() print('Processing train data.') df_train = pd.read_csv('../raw_data/question_train_set.txt', sep='\t', usecols=[0, 1, 3], names=['question_id', 'char_title', 'char_content'], dtype={'question_id': object}) print('training question number %d ' % len(df_train)) # 没有 content 的问题用 title 来替换 na_content_indexs = list() for i in tqdm(range(len(df_train))): char_content = df_train.char_content.values[i] if type(char_content) is float: na_content_indexs.append(i) print('There are %d train questions without content.' % len(na_content_indexs)) for na_index in tqdm(na_content_indexs): df_train.at[na_index, 'char_content'] = df_train.at[na_index, 'char_title'] # 没有 title 的问题，与词一样丢弃下面样本 na_title_indexs = [328877, 422123, 633584, 768738, 818616, 876828, 1273673, 1527297, 1636237, 1682969, 2052477, 2628516, 2657464, 2904162, 2993517] for i in range(len(df_train)): char_title = df_train.char_title.values[i] if type(char_title) is float: na_title_indexs.append(i) print('There are %d train questions without title.' % len(na_title_indexs)) df_train = df_train.drop(na_title_indexs) print('After dropping, training question number(should be 2999952) = %d' % len(df_train)) # 转为 id 形式 p = Pool() train_title = np.asarray(list(p.map(get_id4chars, df_train.char_title.values))) np.save('../data/ch_train_title.npy', train_title) train_content = np.asarray(p.map(get_id4chars, df_train.char_content.values)) np.save('../data/ch_train_content.npy', train_content) p.close() p.join() print('Finished changing the training chars to ids. Costed time %g s' % (time.time() - time0)) if __name__ == '__main__': test_char2id() train_char2id() ================================================ FILE: zhihu-text-classification-master/data_process/creat_batch_data.py ================================================ # -*- coding:utf-8 -*- from __future__ import division from __future__ import print_function import numpy as np import pandas as pd import pickle from multiprocessing import Pool import sys import os sys.path.append('../') from data_helpers import pad_X30 from data_helpers import pad_X150 from data_helpers import pad_X52 from data_helpers import pad_X300 from data_helpers import train_batch from data_helpers import eval_batch """ 把所有的数据按照 batch_size(128) 进行打包。取 10万样本作为验证集。 word_title_len = 30. word_content_len = 150. char_title_len = 52. char_content_len = 300. """ wd_train_path = '../data/wd-data/data_train/' wd_valid_path = '../data/wd-data/data_valid/' wd_test_path = '../data/wd-data/data_test/' ch_train_path = '../data/ch-data/data_train/' ch_valid_path = '../data/ch-data/data_valid/' ch_test_path = '../data/ch-data/data_test/' paths = [wd_train_path, wd_valid_path, wd_test_path, ch_train_path, ch_valid_path, ch_test_path] for each in paths: if not os.path.exists(each): os.makedirs(each) with open('../data/sr_topic2id.pkl', 'rb') as inp: sr_topic2id = pickle.load(inp) dict_topic2id = dict() for i in range(len(sr_topic2id)): dict_topic2id[sr_topic2id.index[i]] = sr_topic2id.values[i] def topics2ids(topics): """把 chars 转为对应的 id""" topics = topics.split(',') ids = list(map(lambda topic: dict_topic2id[topic], topics)) # 获取id return ids def get_lables(): """获取训练集所有样本的标签。注意之前在处理数据时丢弃了部分没有 title 的样本。""" df_question_topic = pd.read_csv('../raw_data/question_topic_train_set.txt', sep='\t', names=['questions', 'topics'], dtype={'questions': object, 'topics': object}) na_title_indexs = [328877, 422123, 633584, 768738, 818616, 876828, 1273673, 1527297, 1636237, 1682969, 2052477, 2628516, 2657464, 2904162, 2993517] df_question_topic = df_question_topic.drop(na_title_indexs) p = Pool() y = p.map(topics2ids, df_question_topic.topics.values) p.close() p.join() return np.asarray(y) # word 数据打包 def wd_train_get_batch(title_len=30, content_len=150, batch_size=128): print('loading word train_title and train_content.') train_title = np.load('../data/wd_train_title.npy') train_content = np.load('../data/wd_train_content.npy') p = Pool() X_title = np.asarray(p.map(pad_X30, train_title)) X_content = np.asarray(p.map(pad_X150, train_content)) p.close() p.join() X = np.hstack([X_title, X_content]) print('getting labels, this should cost minutes, please wait.') y = get_lables() print('y.shape=', y.shape) np.save('../data/y_tr.npy', y) # 划分验证集 sample_num = X.shape[0] np.random.seed(13) valid_num = 100000 new_index = np.random.permutation(sample_num) X = X[new_index] y = y[new_index] X_valid = X[:valid_num] y_valid = y[:valid_num] X_train = X[valid_num:] y_train = y[valid_num:] print('X_train.shape=', X_train.shape, 'y_train.shape=', y_train.shape) print('X_valid.shape=', X_valid.shape, 'y_valid.shape=', y_valid.shape) print('creating batch data.') # 验证集打batch sample_num = len(X_valid) print('valid_sample_num=%d' % sample_num) train_batch(X_valid, y_valid, wd_valid_path, batch_size) # 训练集打batch sample_num = len(X_train) print('train_sample_num=%d' % sample_num) train_batch(X_train, y_train, wd_train_path, batch_size) def wd_test_get_batch(title_len=30, content_len=150, batch_size=128): eval_title = np.load('../data/wd_eval_title.npy') eval_content = np.load('../data/wd_eval_content.npy') p = Pool() X_title = np.asarray(p.map(pad_X30, eval_title)) X_content = np.asarray(p.map(pad_X150, eval_content)) p.close() p.join() X = np.hstack([X_title, X_content]) sample_num = len(X) print('eval_sample_num=%d' % sample_num) eval_batch(X, wd_test_path, batch_size) # char 数据打包 def ch_train_get_batch(title_len=52, content_len=300, batch_size=128): print('loading char train_title and train_content.') train_title = np.load('../data/ch_train_title.npy') train_content = np.load('../data/ch_train_content.npy') p = Pool() X_title = np.asarray(p.map(pad_X52, train_title)) X_content = np.asarray(p.map(pad_X300, train_content)) p.close() p.join() X = np.hstack([X_title, X_content]) y = np.load('../data/y_tr.npy') # 划分验证集 sample_num = X.shape[0] np.random.seed(13) valid_num = 100000 new_index = np.random.permutation(sample_num) X = X[new_index] y = y[new_index] X_valid = X[:valid_num] y_valid = y[:valid_num] X_train = X[valid_num:] y_train = y[valid_num:] print('X_train.shape=', X_train.shape, 'y_train.shape=', y_train.shape) print('X_valid.shape=', X_valid.shape, 'y_valid.shape=', y_valid.shape) # 验证集打batch print('creating batch data.') sample_num = len(X_valid) print('valid_sample_num=%d' % sample_num) train_batch(X_valid, y_valid, ch_valid_path, batch_size) # 训练集打batch sample_num = len(X_train) print('train_sample_num=%d' % sample_num) train_batch(X_train, y_train, ch_train_path, batch_size) def ch_test_get_batch(title_len=52, content_len=300, batch_size=128): eval_title = np.load('../data/ch_eval_title.npy') eval_content = np.load('../data/ch_eval_content.npy') p = Pool() X_title = np.asarray(p.map(pad_X52, eval_title)) X_content = np.asarray(p.map(pad_X300, eval_content)) p.close() p.join() X = np.hstack([X_title, X_content]) sample_num = len(X) print('eval_sample_num=%d' % sample_num) eval_batch(X, ch_test_path, batch_size) if __name__ == '__main__': wd_train_get_batch() wd_test_get_batch() ch_train_get_batch() ch_test_get_batch() ================================================ FILE: zhihu-text-classification-master/data_process/creat_batch_seg.py ================================================ # -*- coding:utf-8 -*- from __future__ import division from __future__ import print_function import numpy as np from multiprocessing import Pool import sys import os sys.path.append('../') from data_helpers import pad_X30 from data_helpers import pad_X52 from data_helpers import wd_pad_cut_docs from data_helpers import ch_pad_cut_docs from data_helpers import train_batch from data_helpers import eval_batch wd_train_path = '../data/wd-data/seg_train/' wd_valid_path = '../data/wd-data/seg_valid/' wd_test_path = '../data/wd-data/seg_test/' ch_train_path = '../data/ch-data/seg_train/' ch_valid_path = '../data/ch-data/seg_valid/' ch_test_path = '../data/ch-data/seg_test/' paths = [wd_train_path, wd_valid_path, wd_test_path, ch_train_path, ch_valid_path, ch_test_path] for each in paths: if not os.path.exists(each): os.makedirs(each) # word 数据打包 def wd_train_get_batch(title_len=30, batch_size=128): print('loading word train_title and train_content, this should cost minutes, please wait.') train_title = np.load('../data/wd_train_title.npy') train_content = np.load('../data/wd_train_content.npy') p = Pool(6) X_title = np.asarray(p.map(pad_X30, train_title)) X_content = np.asarray(p.map(wd_pad_cut_docs, train_content)) p.close() p.join() X_content.shape = [-1, 30*10] X = np.hstack([X_title, X_content]) y = np.load('../data/y_tr.npy') # 划分验证集 sample_num = X.shape[0] np.random.seed(13) valid_num = 100000 new_index = np.random.permutation(sample_num) X = X[new_index] y = y[new_index] X_valid = X[:valid_num] y_valid = y[:valid_num] X_train = X[valid_num:] y_train = y[valid_num:] print('X_train.shape=', X_train.shape, 'y_train.shape=', y_train.shape) print('X_valid.shape=', X_valid.shape, 'y_valid.shape=', y_valid.shape) # 验证集打 batch print('creating batch data.') sample_num = len(X_valid) print('valid_sample_num=%d' % sample_num) train_batch(X_valid, y_valid, wd_valid_path, batch_size) # 训练集打 batch sample_num = len(X_train) print('train_sample_num=%d' % sample_num) train_batch(X_train, y_train, wd_train_path, batch_size) def wd_test_get_batch(title_len=30, batch_size=128): print('loading word eval_title and eval_content.') eval_title = np.load('../data/wd_eval_title.npy') eval_content = np.load('../data/wd_eval_content.npy') p = Pool(6) X_title = np.asarray(p.map(pad_X30, eval_title)) X_content = np.asarray(p.map(wd_pad_cut_docs, eval_content)) p.close() p.join() X_content.shape = [-1, 30*10] X = np.hstack([X_title, X_content]) sample_num = len(X) print('eval_sample_num=%d' % sample_num) eval_batch(X, wd_test_path, batch_size) # char 数据打包 def ch_train_get_batch(title_len=52, batch_size=128): print('loading char train_title and train_content, this should cost minutes, please wait.') train_title = np.load('../data/ch_train_title.npy') train_content = np.load('../data/ch_train_content.npy') p = Pool(8) X_title = np.asarray(p.map(pad_X52, train_title)) X_content = np.asarray(p.map(ch_pad_cut_docs, train_content)) p.close() p.join() X_content.shape = [-1, 52*10] X = np.hstack([X_title, X_content]) y = np.load('../data/y_tr.npy') # 划分验证集 sample_num = X.shape[0] np.random.seed(13) valid_num = 100000 new_index = np.random.permutation(sample_num) X = X[new_index] y = y[new_index] X_valid = X[:valid_num] y_valid = y[:valid_num] X_train = X[valid_num:] y_train = y[valid_num:] print('X_train.shape=', X_train.shape, 'y_train.shape=', y_train.shape) print('X_valid.shape=', X_valid.shape, 'y_valid.shape=', y_valid.shape) # 验证集打batch print('creating batch data.') sample_num = len(X_valid) print('valid_sample_num=%d' % sample_num) train_batch(X_valid, y_valid, ch_valid_path, batch_size) # 训练集打batch sample_num = len(X_train) print('train_sample_num=%d' % sample_num) train_batch(X_train, y_train, ch_train_path, batch_size) def ch_test_get_batch(title_len=52, batch_size=128): print('loading char eval_title and eval_content.') eval_title = np.load('../data/ch_eval_title.npy') eval_content = np.load('../data/ch_eval_content.npy') p = Pool() X_title = np.asarray(p.map(pad_X52, eval_title)) X_content = np.asarray(p.map(ch_pad_cut_docs, eval_content)) p.close() p.join() X_content.shape = [-1, 52*10] X = np.hstack([X_title, X_content]) sample_num = len(X) print('eval_sample_num=%d' % sample_num) eval_batch(X, ch_test_path, batch_size) if __name__ == '__main__': wd_train_get_batch() wd_test_get_batch() ch_train_get_batch() ch_test_get_batch() ================================================ FILE: zhihu-text-classification-master/data_process/embed2ndarray.py ================================================ # -*- coding:utf-8 -*- from __future__ import division from __future__ import print_function import numpy as np import pandas as pd import word2vec import pickle import os SPECIAL_SYMBOL = ['', ''] # add these special symbols to word(char) embeddings. def get_word_embedding(): """提取词向量，并保存至 ../data/word_embedding.npy""" print('getting the word_embedding.npy') wv = word2vec.load('../raw_data/word_embedding.txt') word_embedding = wv.vectors words = wv.vocab n_special_sym = len(SPECIAL_SYMBOL) sr_id2word = pd.Series(words, index=range(n_special_sym, n_special_sym + len(words))) sr_word2id = pd.Series(range(n_special_sym, n_special_sym + len(words)), index=words) # 添加特殊符号：:0, :1 embedding_size = 256 vec_special_sym = np.random.randn(n_special_sym, embedding_size) for i in range(n_special_sym): sr_id2word[i] = SPECIAL_SYMBOL[i] sr_word2id[SPECIAL_SYMBOL[i]] = i word_embedding = np.vstack([vec_special_sym, word_embedding]) # 保存词向量 save_path = '../data/' if not os.path.exists(save_path): os.makedirs(save_path) np.save(save_path + 'word_embedding.npy', word_embedding) # 保存词与id的对应关系 with open(save_path + 'sr_word2id.pkl', 'wb') as outp: pickle.dump(sr_id2word, outp) pickle.dump(sr_word2id, outp) print('Saving the word_embedding.npy to ../data/word_embedding.npy') def get_char_embedding(): """提取字向量，并保存至 ../data/char_embedding.npy""" print('getting the char_embedding.npy') wv = word2vec.load('../raw_data/char_embedding.txt') char_embedding = wv.vectors chars = wv.vocab n_special_sym = len(SPECIAL_SYMBOL) sr_id2char = pd.Series(chars, index=range(n_special_sym, n_special_sym + len(chars))) sr_char2id = pd.Series(range(n_special_sym, n_special_sym + len(chars)), index=chars) # 添加特殊符号：:0, :1 embedding_size = 256 vec_special_sym = np.random.randn(n_special_sym, embedding_size) for i in range(n_special_sym): sr_id2char[i] = SPECIAL_SYMBOL[i] sr_char2id[SPECIAL_SYMBOL[i]] = i char_embedding = np.vstack([vec_special_sym, char_embedding]) # 保存字向量 save_path = '../data/' if not os.path.exists(save_path): os.makedirs(save_path) np.save(save_path + 'char_embedding.npy', char_embedding) # 保存字与id的对应关系 with open(save_path + 'sr_char2id.pkl', 'wb') as outp: pickle.dump(sr_id2char, outp) pickle.dump(sr_char2id, outp) print('Saving the char_embedding.npy to ../data/char_embedding.npy') if __name__ == '__main__': get_word_embedding() get_char_embedding() ================================================ FILE: zhihu-text-classification-master/data_process/question_and_topic_2id.py ================================================ # -*- coding:utf-8 -*- import pandas as pd import pickle from itertools import chain def question_and_topic_2id(): """把question和topic转成id形式并保存至 ../data/目录下。""" print('Changing the quetion and topic to id and save in sr_question2.pkl and sr_topic2id.pkl in ../data/') df_question_topic = pd.read_csv('../raw_data/question_topic_train_set.txt', sep='\t', names=['question', 'topics'], dtype={'question': object, 'topics': object}) df_question_topic.topics = df_question_topic.topics.apply(lambda tps: tps.split(',')) save_path = '../data/' print('questino number = %d ' % len(df_question_topic)) # 问题 id 按照给出的问题顺序编号 questions = df_question_topic.question.values sr_question2id = pd.Series(range(len(questions)), index=questions) sr_id2question = pd.Series(questions, index=range(len(questions))) # topic 按照数量从大到小进行编号 topics = df_question_topic.topics.values topics = list(chain(*topics)) sr_topics = pd.Series(topics) topics_count = sr_topics.value_counts() topics = topics_count.index sr_topic2id = pd.Series(range(len(topics)),index=topics) sr_id2topic = pd.Series(topics, index=range(len(topics))) with open(save_path + 'sr_question2id.pkl', 'wb') as outp: pickle.dump(sr_question2id, outp) pickle.dump(sr_id2question, outp) with open(save_path + 'sr_topic2id.pkl', 'wb') as outp: pickle.dump(sr_topic2id, outp) pickle.dump(sr_id2topic, outp) print('Finished changing.') if __name__ == '__main__': question_and_topic_2id() ================================================ FILE: zhihu-text-classification-master/data_process/run_all_data_process.sh ================================================ #!/usr/bin/env bash echo -e "\033[44;37;5m RUNNING embed2ndarray.py\033[0m "; python embed2ndarray.py; echo -e "\033[44;37;5m RUNNING question_and_topic_2id.py\033[0m "; python question_and_topic_2id.py; echo -e "\033[44;37;5m RUNNING char2id.py\033[0m "; python char2id.py; echo -e "\033[44;37;5m RUNNING word2id.py\033[0m "; python word2id.py; echo -e "\033[44;37;5m RUNNING creat_batch_data.py\033[0m "; python creat_batch_data.py; echo -e "\033[44;37;5m RUNNING creat_batch_seg.py\033[0m "; python creat_batch_seg.py; ================================================ FILE: zhihu-text-classification-master/data_process/test.py ================================================ # -*- coding:utf-8 -*- from multiprocessing import Pool import numpy as np def func(a, b): return a+b p = Pool() a = [1,2,3] b = [4,5,6] para = zip(a,b) result = p.map(func, para) p.close() p.join() print result ================================================ FILE: zhihu-text-classification-master/data_process/word2id.py ================================================ # -*- coding:utf-8 -*- from __future__ import division from __future__ import print_function import numpy as np import pandas as pd import pickle from multiprocessing import Pool from tqdm import tqdm import time save_path = '../data/' with open(save_path + 'sr_word2id.pkl', 'rb') as inp: sr_id2word = pickle.load(inp) sr_word2id = pickle.load(inp) dict_word2id = dict() for i in range(len(sr_word2id)): dict_word2id[sr_word2id.index[i]] = sr_word2id.values[i] def get_id(word): """获取 word 所对应的 id. 如果该词不在词典中，用（对应的 ID 为 1 ）进行替换。 """ if word not in dict_word2id: return 1 else: return dict_word2id[word] def get_id4words(words): """把 words 转为对应的 id""" words = words.strip().split(',') # 先分开词 ids = list(map(get_id, words)) # 获取id return ids def test_word2id(): """把测试集的所有词转成对应的id。""" time0 = time.time() print('Processing eval data.') df_eval = pd.read_csv('../raw_data/question_eval_set.txt', sep='\t', usecols=[0, 2, 4], names=['question_id', 'word_title', 'word_content'], dtype={'question_id': object}) print('test question number %d' % len(df_eval)) # 没有 title 的问题用 content 来替换 na_title_indexs = list() for i in range(len(df_eval)): word_title = df_eval.word_title.values[i] if type(word_title) is float: na_title_indexs.append(i) print('There are %d test questions without title.' % len(na_title_indexs)) for na_index in na_title_indexs: df_eval.at[na_index, 'word_title'] = df_eval.at[na_index, 'word_content'] # 没有 content 的问题用 title 来替换 na_content_indexs = list() for i in tqdm(range(len(df_eval))): word_content = df_eval.word_content.values[i] if type(word_content) is float: na_content_indexs.append(i) print('There are %d test questions without content.' % len(na_content_indexs)) for na_index in tqdm(na_content_indexs): df_eval.at[na_index, 'word_content'] = df_eval.at[na_index, 'word_title'] # 转为 id 形式 p = Pool() eval_title = np.asarray(p.map(get_id4words, df_eval.word_title.values)) np.save('../data/wd_eval_title.npy', eval_title) eval_content = np.asarray(p.map(get_id4words, df_eval.word_content.values)) np.save('../data/wd_eval_content.npy', eval_content) p.close() p.join() print('Finished changing the eval words to ids. Costed time %g s' % (time.time() - time0)) def train_word2id(): """把训练集的所有词转成对应的id。""" time0 = time.time() print('Processing train data.') df_train = pd.read_csv('../raw_data/question_train_set.txt', sep='\t', usecols=[0, 2, 4], names=['question_id', 'word_title', 'word_content'], dtype={'question_id': object}) print('training question number %d ' % len(df_train)) # 没有 content 的问题用 title 来替换 na_content_indexs = list() for i in tqdm(range(len(df_train))): word_content = df_train.word_content.values[i] if type(word_content) is float: na_content_indexs.append(i) print('There are %d train questions without content.' % len(na_content_indexs)) for na_index in tqdm(na_content_indexs): df_train.at[na_index, 'word_content'] = df_train.at[na_index, 'word_title'] # 没有 title 的问题，丢弃 na_title_indexs = list() for i in range(len(df_train)): word_title = df_train.word_title.values[i] if type(word_title) is float: na_title_indexs.append(i) print('There are %d train questions without title.' % len(na_title_indexs)) df_train = df_train.drop(na_title_indexs) print('After dropping, training question number(should be 2999952) = %d' % len(df_train)) # 转为 id 形式 p = Pool() train_title = np.asarray(p.map(get_id4words, df_train.word_title.values)) np.save('../data/wd_train_title.npy', train_title) train_content = np.asarray(p.map(get_id4words, df_train.word_content.values)) np.save('../data/wd_train_content.npy', train_content) p.close() p.join() print('Finished changing the training words to ids. Costed time %g s' % (time.time() - time0)) if __name__ == '__main__': test_word2id() train_word2id() ================================================ FILE: zhihu-text-classification-master/models/wd_1_1_cnn_concat/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_1_1_cnn_concat/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf """wd_1_1_cnn_concat title 部分使用 TextCNN；content 部分使用 TextCNN；两部分输出直接 concat。 """ class Settings(object): def __init__(self): self.model_name = 'wd_1_1_cnn_concat' self.title_len = 30 self.content_len = 150 self.filter_sizes = [2, 3, 4, 5, 7] self.n_filter = 256 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class TextCNN(object): """ title: inputs->textcnn->output_title content: inputs->textcnn->output_content concat[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = settings.title_len self.content_len = settings.content_len self.filter_sizes = settings.filter_sizes self.n_filter = settings.n_filter self.n_filter_total = self.n_filter * len(self.filter_sizes) self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.content_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('cnn_text'): output_title = self.cnn_inference(self._X1_inputs, self.title_len) with tf.variable_scope('hcnn_content'): output_content = self.cnn_inference(self._X2_inputs, self.content_len) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) W_fc = self.weight_variable([self.n_filter_total * 2, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=2) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def cnn_inference(self, X_inputs, n_step): """TextCNN 模型。 Args: X_inputs: tensor.shape=(batch_size, n_step) Returns: title_outputs: tensor.shape=(batch_size, self.n_filter_total) """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) inputs = tf.expand_dims(inputs, -1) pooled_outputs = list() for i, filter_size in enumerate(self.filter_sizes): with tf.variable_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, self.embedding_size, 1, self.n_filter] W_filter = self.weight_variable(shape=filter_shape, name='W_filter') beta = self.bias_variable(shape=[self.n_filter], name='beta_filter') tf.summary.histogram('beta', beta) conv = tf.nn.conv2d(inputs, W_filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") conv_bn, update_ema = self.batchnorm(conv, beta, convolutional=True) # 在激活层前面加 BN # Apply nonlinearity, batch norm scaling is not useful with relus # batch norm offsets are used instead of biases,使用 BN 层的 offset，不要 biases h = tf.nn.relu(conv_bn, name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool(h, ksize=[1, n_step - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") pooled_outputs.append(pooled) self.update_emas.append(update_ema) h_pool = tf.concat(pooled_outputs, 3) h_pool_flat = tf.reshape(h_pool, [-1, self.n_filter_total]) return h_pool_flat # shape = [batch_size, self.n_filter_total] # test the model # def test(): # import numpy as np # print('Begin testing...') # settings = Settings() # W_embedding = np.random.randn(50, 10) # config = tf.ConfigProto() # config.gpu_options.allow_growth = True # batch_size = 128 # with tf.Session(config=config) as sess: # model = TextCNN(W_embedding, settings) # optimizer = tf.train.AdamOptimizer(0.001) # train_op = optimizer.minimize(model.loss) # update_op = tf.group(*model.update_emas) # sess.run(tf.global_variables_initializer()) # fetch = [model.loss, model.y_pred, train_op, update_op] # loss_list = list() # for i in xrange(100): # X1_batch = np.zeros((batch_size, 30), dtype=float) # X2_batch = np.zeros((batch_size, 150), dtype=float) # y_batch = np.zeros((batch_size, 1999), dtype=int) # _batch_size = len(y_batch) # feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, # model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} # loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) # loss_list.append(loss) # print(i, loss) # # if __name__ == '__main__': # test() ================================================ FILE: zhihu-text-classification-master/models/wd_1_1_cnn_concat/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/data_valid/' data_test_path = '../../data/wd-data/data_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = map(lambda label: label.argsort()[-1:-6:-1], predict_labels) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.TextCNN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_1_1_cnn_concat/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 1e-3, 'initial learning rate, default: 1e-3') flags.DEFINE_float('decay_rate', 0.65, 'decay rate, default: 0.65') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.40, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/data_train/' data_valid_path = '../../data/wd-data/data_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.TextCNN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() train_op1 = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(model.loss, global_step=model.global_step, var_list=tvars1) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] train_op2 = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(model.loss, global_step=model.global_step, var_list=tvars2) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 else: train_op = train_op2 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_1_2_cnn_max/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_1_2_cnn_max/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf """wd_1_2_cnn_max title 部分使用 TextCNN；content 部分使用 TextCNN；两部分输出按位取 max。 """ class Settings(object): def __init__(self): self.model_name = 'wd_1_2_cnn_max' self.title_len = 30 self.content_len = 150 self.filter_sizes = [2, 3, 4, 5, 7] self.n_filter = 256 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class TextCNN(object): """ title: inputs->textcnn->output_title content: inputs->textcnn->output_content max[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = settings.title_len self.content_len = settings.content_len self.filter_sizes = settings.filter_sizes self.n_filter = settings.n_filter self.n_filter_total = self.n_filter * len(self.filter_sizes) self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.content_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('cnn_text'): output_title = self.cnn_inference(self._X1_inputs, self.title_len) output_title = tf.expand_dims(output_title, 0) with tf.variable_scope('hcnn_content'): output_content = self.cnn_inference(self._X2_inputs, self.content_len) output_content = tf.expand_dims(output_content, 0) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=0) output = tf.reduce_max(output, axis=0) W_fc = self.weight_variable([self.n_filter_total, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=2) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def cnn_inference(self, X_inputs, n_step): """TextCNN 模型。 Args: X_inputs: tensor.shape=(batch_size, n_step) Returns: title_outputs: tensor.shape=(batch_size, self.n_filter_total) """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) inputs = tf.expand_dims(inputs, -1) pooled_outputs = list() for i, filter_size in enumerate(self.filter_sizes): with tf.variable_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, self.embedding_size, 1, self.n_filter] W_filter = self.weight_variable(shape=filter_shape, name='W_filter') beta = self.bias_variable(shape=[self.n_filter], name='beta_filter') tf.summary.histogram('beta', beta) conv = tf.nn.conv2d(inputs, W_filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") conv_bn, update_ema = self.batchnorm(conv, beta, convolutional=True) # 在激活层前面加 BN # Apply nonlinearity, batch norm scaling is not useful with relus # batch norm offsets are used instead of biases,使用 BN 层的 offset，不要 biases h = tf.nn.relu(conv_bn, name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool(h, ksize=[1, n_step - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") pooled_outputs.append(pooled) self.update_emas.append(update_ema) h_pool = tf.concat(pooled_outputs, 3) h_pool_flat = tf.reshape(h_pool, [-1, self.n_filter_total]) return h_pool_flat # shape = [batch_size, self.n_filter_total] # test the model # def test(): # import numpy as np # print('Begin testing...') # settings = Settings() # W_embedding = np.random.randn(50, 10) # config = tf.ConfigProto() # config.gpu_options.allow_growth = True # batch_size = 128 # with tf.Session(config=config) as sess: # model = TextCNN(W_embedding, settings) # optimizer = tf.train.AdamOptimizer(0.001) # train_op = optimizer.minimize(model.loss) # update_op = tf.group(*model.update_emas) # sess.run(tf.global_variables_initializer()) # fetch = [model.loss, model.y_pred, train_op, update_op] # loss_list = list() # for i in xrange(100): # X1_batch = np.zeros((batch_size, 30), dtype=float) # X2_batch = np.zeros((batch_size, 150), dtype=float) # y_batch = np.zeros((batch_size, 1999), dtype=int) # _batch_size = len(y_batch) # feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, # model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} # loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) # loss_list.append(loss) # print(i, loss) # # if __name__ == '__main__': # test() ================================================ FILE: zhihu-text-classification-master/models/wd_1_2_cnn_max/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/data_valid/' data_test_path = '../../data/wd-data/data_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.TextCNN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_1_2_cnn_max/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 1e-3, 'initial learning rate, default: 1e-3') flags.DEFINE_float('decay_rate', 0.65, 'decay rate, default: 0.65') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.35, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/data_train/' data_valid_path = '../../data/wd-data/data_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.TextCNN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) train_op = train_op2 for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_2_hcnn/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_2_hcnn/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf """wd_2_hcnn title 部分使用 TextCNN；content 部分使用分层的 TextCNN。 """ class Settings(object): def __init__(self): self.model_name = 'wd_2_hcnn' self.title_len = self.sent_len = 30 self.doc_len = 10 self.sent_filter_sizes = [2, 3, 4, 5] self.doc_filter_sizes = [2, 3, 4] self.n_filter = 256 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class HCNN(object): """ title: inputs->textcnn->output_title content: inputs->hcnn->output_content concat[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.sent_len = settings.sent_len self.doc_len = settings.doc_len self.sent_filter_sizes = settings.sent_filter_sizes self.doc_filter_sizes = settings.doc_filter_sizes self.n_filter = settings.n_filter self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.sent_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.doc_len * self.sent_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('cnn_text'): output_title = self.cnn_inference(self._X1_inputs) with tf.variable_scope('hcnn_content'): output_content = self.hcnn_inference(self._X2_inputs) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) output_size = self.n_filter * (len(self.sent_filter_sizes) + len(self.doc_filter_sizes)) W_fc = self.weight_variable([output_size, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=2) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def textcnn(self, X_inputs, n_step, filter_sizes, embed_size): """build the TextCNN network. n_step: the sentence len.""" inputs = tf.expand_dims(X_inputs, -1) pooled_outputs = list() for i, filter_size in enumerate(filter_sizes): with tf.name_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, embed_size, 1, self.n_filter] W_filter = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W_filter") beta = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.n_filter], name="beta")) tf.summary.histogram('beta', beta) conv = tf.nn.conv2d(inputs, W_filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") conv_bn, update_ema = self.batchnorm(conv, beta, convolutional=True) # 在激活层前面加 BN # Apply nonlinearity, batch norm scaling is not useful with relus # batch norm offsets are used instead of biases,使用 BN 层的 offset，不要 biases h = tf.nn.relu(conv_bn, name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool(h, ksize=[1, n_step - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") pooled_outputs.append(pooled) self.update_emas.append(update_ema) h_pool = tf.concat(pooled_outputs, 3) n_filter_total = self.n_filter * len(filter_sizes) h_pool_flat = tf.reshape(h_pool, [-1, n_filter_total]) return h_pool_flat # shape = [-1, n_filter_total] def cnn_inference(self, X_inputs): """TextCNN 模型。title部分。 Args: X_inputs: tensor.shape=(batch_size, title_len) Returns: title_outputs: tensor.shape=(batch_size, n_filter*filter_num_sent) """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) with tf.variable_scope('title_encoder'): # 生成 title 的向量表示 title_outputs = self.textcnn(inputs, self.sent_len, self.sent_filter_sizes, embed_size=self.embedding_size) return title_outputs # shape = [batch_size, n_filter*filter_num_sent] def hcnn_inference(self, X_inputs): """分层 TextCNN 模型。content部分。 Args: X_inputs: tensor.shape=(batch_size, doc_len*sent_len) Returns: doc_attn_outputs: tensor.shape=(batch_size, n_filter*filter_num_doc) """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) # inputs.shape=[batch_size, doc_len*sent_len, embedding_size] sent_inputs = tf.reshape(inputs, [self.batch_size * self.doc_len, self.sent_len, self.embedding_size]) # [batch_size*doc_len, sent_len, embedding_size] with tf.variable_scope('sentence_encoder'): # 生成句向量 sent_outputs = self.textcnn(sent_inputs, self.sent_len, self.sent_filter_sizes, self.embedding_size) with tf.variable_scope('doc_encoder'): # 生成文档向量 doc_inputs = tf.reshape(sent_outputs, [self.batch_size, self.doc_len, self.n_filter * len( self.sent_filter_sizes)]) # [batch_size, doc_len, n_filter*len(filter_sizes_sent)] doc_outputs = self.textcnn(doc_inputs, self.doc_len, self.doc_filter_sizes, self.n_filter * len( self.sent_filter_sizes)) # [batch_size, doc_len, n_filter*filter_num_doc] return doc_outputs # [batch_size, n_filter*len(doc_filter_sizes)] # test the model # def test(): # import numpy as np # print('Begin testing...') # settings = Settings() # W_embedding = np.random.randn(50, 10) # config = tf.ConfigProto() # config.gpu_options.allow_growth = True # batch_size = 128 # with tf.Session(config=config) as sess: # model = HCNN(W_embedding, settings) # optimizer = tf.train.AdamOptimizer(0.001) # train_op = optimizer.minimize(model.loss) # update_op = tf.group(*model.update_emas) # sess.run(tf.global_variables_initializer()) # fetch = [model.loss, model.y_pred, train_op, update_op] # loss_list = list() # for i in xrange(100): # X1_batch = np.zeros((batch_size, 30), dtype=float) # X2_batch = np.zeros((batch_size, 10 * 30), dtype=float) # y_batch = np.zeros((batch_size, 1999), dtype=int) # _batch_size = len(y_batch) # feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, # model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} # loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) # loss_list.append(loss) # print(i, loss) # test() ================================================ FILE: zhihu-text-classification-master/models/wd_2_hcnn/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/seg_valid/' data_test_path = '../../data/wd-data/seg_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.HCNN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_2_hcnn/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 1e-3, 'initial learning rate, default: 1e-3') flags.DEFINE_float('decay_rate', 0.65, 'decay rate, default: 0.65') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.38, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/seg_train/' data_valid_path = '../../data/wd-data/seg_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.HCNN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) train_op = train_op2 for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_3_bigru/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_3_bigru/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf from tensorflow.contrib import rnn import tensorflow.contrib.layers as layers """wd_3_bigru title 部分使用 bigru+attention；content 部分使用 bigru+attention；两部分输出直接 concat。 """ class Settings(object): def __init__(self): self.model_name = 'wd_3_bigru' self.title_len = 30 self.content_len = 150 self.hidden_size = 256 self.n_layer = 1 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class BiGRU(object): """ title: inputs->bigru+attention->output_title content: inputs->bigru+attention->output_content concat[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = settings.title_len self.content_len = settings.content_len self.hidden_size = settings.hidden_size self.n_layer = settings.n_layer self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.content_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('bigru_text'): output_title = self.bigru_inference(self._X1_inputs) with tf.variable_scope('bigru_content'): output_content = self.bigru_inference(self._X2_inputs) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) W_fc = self.weight_variable([self.hidden_size * 4, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(self.fc_bn_relu, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=1) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def gru_cell(self): with tf.name_scope('gru_cell'): cell = rnn.GRUCell(self.hidden_size, reuse=tf.get_variable_scope().reuse) return rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob) def bi_gru(self, inputs): """build the bi-GRU network. 返回个所有层的隐含状态。""" cells_fw = [self.gru_cell() for _ in range(self.n_layer)] cells_bw = [self.gru_cell() for _ in range(self.n_layer)] initial_states_fw = [cell_fw.zero_state(self.batch_size, tf.float32) for cell_fw in cells_fw] initial_states_bw = [cell_bw.zero_state(self.batch_size, tf.float32) for cell_bw in cells_bw] outputs, _, _ = rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, inputs, initial_states_fw=initial_states_fw, initial_states_bw=initial_states_bw, dtype=tf.float32) return outputs def task_specific_attention(self, inputs, output_size, initializer=layers.xavier_initializer(), activation_fn=tf.tanh, scope=None): """ Performs task-specific attention reduction, using learned attention context vector (constant within task of interest). Args: inputs: Tensor of shape [batch_size, units, input_size] `input_size` must be static (known) `units` axis will be attended over (reduced from output) `batch_size` will be preserved output_size: Size of output's inner (feature) dimension Returns: outputs: Tensor of shape [batch_size, output_dim]. """ assert len(inputs.get_shape()) == 3 and inputs.get_shape()[-1].value is not None with tf.variable_scope(scope or 'attention') as scope: # u_w, attention 向量 attention_context_vector = tf.get_variable(name='attention_context_vector', shape=[output_size], initializer=initializer, dtype=tf.float32) # 全连接层，把 h_i 转为 u_i ， shape= [batch_size, units, input_size] -> [batch_size, units, output_size] input_projection = layers.fully_connected(inputs, output_size, activation_fn=activation_fn, scope=scope) # 输出 [batch_size, units] vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True) attention_weights = tf.nn.softmax(vector_attn, dim=1) tf.summary.histogram('attention_weigths', attention_weights) weighted_projection = tf.multiply(inputs, attention_weights) outputs = tf.reduce_sum(weighted_projection, axis=1) return outputs # 输出 [batch_size, hidden_size*2] def bigru_inference(self, X_inputs): inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) output_bigru = self.bi_gru(inputs) output_att = self.task_specific_attention(output_bigru, self.hidden_size*2) return output_att # test the model def test(): import numpy as np print('Begin testing...') settings = Settings() W_embedding = np.random.randn(50, 10) config = tf.ConfigProto() config.gpu_options.allow_growth = True batch_size = 128 with tf.Session(config=config) as sess: model = BiGRU(W_embedding, settings) optimizer = tf.train.AdamOptimizer(0.001) train_op = optimizer.minimize(model.loss) update_op = tf.group(*model.update_emas) sess.run(tf.global_variables_initializer()) fetch = [model.loss, model.y_pred, train_op, update_op] loss_list = list() for i in xrange(100): X1_batch = np.zeros((batch_size, 30), dtype=float) X2_batch = np.zeros((batch_size, 150), dtype=float) y_batch = np.zeros((batch_size, 1999), dtype=int) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) loss_list.append(loss) print(i, loss) if __name__ == '__main__': test() ================================================ FILE: zhihu-text-classification-master/models/wd_3_bigru/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/data_valid/' data_test_path = '../../data/wd-data/data_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(range(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(range(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.BiGRU(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_3_bigru/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 8e-4, 'initial learning rate, default: 8e-4') flags.DEFINE_float('decay_rate', 0.85, 'decay rate, default: 0.85') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.40, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/data_train/' data_valid_path = '../../data/wd-data/data_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.BiGRU(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') train_op = train_op2 print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_4_han/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_4_han/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf from tensorflow.contrib import rnn import tensorflow.contrib.layers as layers """wd_4_han title 部分使用 bigru+attention；content 部分使用 han；两部分输出直接 concat。 """ class Settings(object): def __init__(self): self.model_name = 'wd_4_han' self.title_len = self.sent_len = 30 self.doc_len = 10 self.hidden_size = 256 self.n_layer = 1 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class HAN(object): """ title: inputs->bigru+attention->output_title content: inputs->sent_encoder(bigru+attention)->doc_encoder(bigru+attention)->output_content concat[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = self.sent_len = settings.sent_len self.doc_len = settings.doc_len self.hidden_size = settings.hidden_size self.n_layer = settings.n_layer self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.doc_len * self.sent_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('bigru_text'): output_title = self.bigru_inference(self._X1_inputs) with tf.variable_scope('han_content'): output_content = self.han_inference(self._X2_inputs) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) W_fc = self.weight_variable([self.hidden_size * 4, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=1) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def gru_cell(self): with tf.name_scope('gru_cell'): cell = rnn.GRUCell(self.hidden_size, reuse=tf.get_variable_scope().reuse) return rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob) def bi_gru(self, inputs, seg_num): """build the bi-GRU network. Return the encoder represented vector. n_step: 句子的词数量；或者文档的句子数。 seg_num: 序列的数量，原本应该为 batch_size, 但是这里将 batch_size 个 doc展开成很多个句子。 """ cells_fw = [self.gru_cell() for _ in range(self.n_layer)] cells_bw = [self.gru_cell() for _ in range(self.n_layer)] initial_states_fw = [cell_fw.zero_state(seg_num, tf.float32) for cell_fw in cells_fw] initial_states_bw = [cell_bw.zero_state(seg_num, tf.float32) for cell_bw in cells_bw] outputs, _, _ = rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, inputs, initial_states_fw = initial_states_fw, initial_states_bw = initial_states_bw, dtype=tf.float32) # outputs: Output Tensor shaped: seg_num, max_time, layers_output]，其中layers_output=hidden_size * 2 在这里。 return outputs def task_specific_attention(self, inputs, output_size, initializer=layers.xavier_initializer(), activation_fn=tf.tanh, scope=None): """ Performs task-specific attention reduction, using learned attention context vector (constant within task of interest). Args: inputs: Tensor of shape [batch_size, units, input_size] `input_size` must be static (known) `units` axis will be attended over (reduced from output) `batch_size` will be preserved output_size: Size of output's inner (feature) dimension Returns: outputs: Tensor of shape [batch_size, output_dim]. """ assert len(inputs.get_shape()) == 3 and inputs.get_shape()[-1].value is not None with tf.variable_scope(scope or 'attention') as scope: # u_w, attention 向量 attention_context_vector = tf.get_variable(name='attention_context_vector', shape=[output_size], initializer=initializer, dtype=tf.float32) # 全连接层，把 h_i 转为 u_i ， shape= [batch_size, units, input_size] -> [batch_size, units, output_size] input_projection = layers.fully_connected(inputs, output_size, activation_fn=activation_fn, scope=scope) # 输出 [batch_size, units] vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True) attention_weights = tf.nn.softmax(vector_attn, dim=1) tf.summary.histogram('attention_weigths', attention_weights) weighted_projection = tf.multiply(inputs, attention_weights) outputs = tf.reduce_sum(weighted_projection, axis=1) return outputs # 输出 [batch_size, hidden_size*2] def bigru_inference(self, X_inputs): inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) output_bigru = self.bi_gru(inputs, self.batch_size) output_att = self.task_specific_attention(output_bigru, self.hidden_size*2) return output_att # 输出 [batch_size, hidden_size*2] def han_inference(self, X_inputs): """分层 attention 模型。content部分。 Args: X_inputs: tensor.shape=(batch_size, doc_len*sent_len) Returns: doc_attn_outputs: tensor.shape=(batch_size, hidden_size(*2 for bigru)) """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) # inputs.shape=[batch_size, doc_len*sent_len, embedding_size] sent_inputs = tf.reshape(inputs,[self.batch_size*self.doc_len, self.sent_len, self.embedding_size]) # shape=(?, 40, 256) with tf.variable_scope('sentence_encoder'): # 生成句向量 sent_outputs = self.bi_gru(sent_inputs, seg_num=self.batch_size*self.doc_len) sent_attn_outputs = self.task_specific_attention(sent_outputs, self.hidden_size*2) # [batch_size*doc_len, hidden_size*2] with tf.variable_scope('dropout'): sent_attn_outputs = tf.nn.dropout(sent_attn_outputs, self.keep_prob) with tf.variable_scope('doc_encoder'): # 生成文档向量 doc_inputs = tf.reshape(sent_attn_outputs, [self.batch_size, self.doc_len, self.hidden_size*2]) doc_outputs = self.bi_gru(doc_inputs, self.batch_size) # [batch_size, doc_len, hidden_size*2] doc_attn_outputs = self.task_specific_attention(doc_outputs, self.hidden_size*2) # [batch_size, hidden_size*2] return doc_attn_outputs # [batch_size, hidden_size*2] # test the model def test(): import numpy as np print('Begin testing...') settings = Settings() W_embedding = np.random.randn(50, 10) config = tf.ConfigProto() config.gpu_options.allow_growth = True batch_size = 128 with tf.Session(config=config) as sess: model = HAN(W_embedding, settings) optimizer = tf.train.AdamOptimizer(0.001) train_op = optimizer.minimize(model.loss) update_op = tf.group(*model.update_emas) sess.run(tf.global_variables_initializer()) fetch = [model.loss, model.y_pred, train_op, update_op] loss_list = list() for i in xrange(100): X1_batch = np.zeros((batch_size, 30), dtype=float) X2_batch = np.zeros((batch_size, 10 * 30), dtype=float) y_batch = np.zeros((batch_size, 1999), dtype=int) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) loss_list.append(loss) print(i, loss) if __name__ == '__main__': test() ================================================ FILE: zhihu-text-classification-master/models/wd_4_han/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/seg_valid/' data_test_path = '../../data/wd-data/seg_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.HAN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_4_han/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 2, 'update the embedding after max_epoch, default: 2') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 8e-4, 'initial learning rate, default: 8e-4') flags.DEFINE_float('decay_rate', 0.85, 'decay rate, default: 0.85') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.38, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/seg_train/' data_valid_path = '../../data/wd-data/seg_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = (map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.HAN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) train_op = train_op2 for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_5_bigru_cnn/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_5_bigru_cnn/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf from tensorflow.contrib import rnn import tensorflow.contrib.layers as layers """wd_5_bigru_cnn 两部分使用不同的 embedding，因为RNN与CNN结构完全不同，共用embedding会降低性能。 title 部分使用 bigru+attention；content 部分使用 textcnn；两部分输出直接 concat。 """ class Settings(object): def __init__(self): self.model_name = 'wd_5_bigru_cnn' self.title_len = 30 self.content_len = 150 self.hidden_size = 256 self.n_layer = 1 self.filter_sizes = [2, 3, 4, 5, 7] self.n_filter = 256 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class BiGRU_CNN(object): """ title: inputs->bigru+attention->output_title content: inputs->textcnn->output_content concat[output_title, output_content] -> fc+bn+relu -> sigmoid_entropy. """ def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = settings.title_len self.content_len = settings.content_len self.hidden_size = settings.hidden_size self.n_layer = settings.n_layer self.filter_sizes = settings.filter_sizes self.n_filter = settings.n_filter self.n_filter_total = self.n_filter * len(self.filter_sizes) self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.content_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.title_embedding = tf.get_variable(name='title_embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.content_embedding = tf.get_variable(name='content_embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('bigru_text'): output_title = self.bigru_inference(self._X1_inputs) with tf.variable_scope('cnn_content'): output_content = self.cnn_inference(self._X2_inputs, self.content_len) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) W_fc = self.weight_variable([self.hidden_size*2 + self.n_filter_total, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=1) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def gru_cell(self): with tf.name_scope('gru_cell'): cell = rnn.GRUCell(self.hidden_size, reuse=tf.get_variable_scope().reuse) return rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob) def bi_gru(self, inputs): """build the bi-GRU network. 返回个所有层的隐含状态。""" cells_fw = [self.gru_cell() for _ in range(self.n_layer)] cells_bw = [self.gru_cell() for _ in range(self.n_layer)] initial_states_fw = [cell_fw.zero_state(self.batch_size, tf.float32) for cell_fw in cells_fw] initial_states_bw = [cell_bw.zero_state(self.batch_size, tf.float32) for cell_bw in cells_bw] outputs, _, _ = rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, inputs, initial_states_fw=initial_states_fw, initial_states_bw=initial_states_bw, dtype=tf.float32) return outputs def task_specific_attention(self, inputs, output_size, initializer=layers.xavier_initializer(), activation_fn=tf.tanh, scope=None): """ Performs task-specific attention reduction, using learned attention context vector (constant within task of interest). Args: inputs: Tensor of shape [batch_size, units, input_size] `input_size` must be static (known) `units` axis will be attended over (reduced from output) `batch_size` will be preserved output_size: Size of output's inner (feature) dimension Returns: outputs: Tensor of shape [batch_size, output_dim]. """ assert len(inputs.get_shape()) == 3 and inputs.get_shape()[-1].value is not None with tf.variable_scope(scope or 'attention') as scope: # u_w, attention 向量 attention_context_vector = tf.get_variable(name='attention_context_vector', shape=[output_size], initializer=initializer, dtype=tf.float32) # 全连接层，把 h_i 转为 u_i ， shape= [batch_size, units, input_size] -> [batch_size, units, output_size] input_projection = layers.fully_connected(inputs, output_size, activation_fn=activation_fn, scope=scope) # 输出 [batch_size, units] vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True) attention_weights = tf.nn.softmax(vector_attn, dim=1) tf.summary.histogram('attention_weigths', attention_weights) weighted_projection = tf.multiply(inputs, attention_weights) outputs = tf.reduce_sum(weighted_projection, axis=1) return outputs # 输出 [batch_size, hidden_size*2] def bigru_inference(self, X_inputs): inputs = tf.nn.embedding_lookup(self.title_embedding, X_inputs) output_bigru = self.bi_gru(inputs) output_att = self.task_specific_attention(output_bigru, self.hidden_size*2) return output_att def cnn_inference(self, X_inputs, n_step): """TextCNN 模型。 Args: X_inputs: tensor.shape=(batch_size, n_step) Returns: title_outputs: tensor.shape=(batch_size, self.n_filter_total) """ inputs = tf.nn.embedding_lookup(self.content_embedding, X_inputs) inputs = tf.expand_dims(inputs, -1) pooled_outputs = list() for i, filter_size in enumerate(self.filter_sizes): with tf.variable_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, self.embedding_size, 1, self.n_filter] W_filter = self.weight_variable(shape=filter_shape, name='W_filter') beta = self.bias_variable(shape=[self.n_filter], name='beta_filter') tf.summary.histogram('beta', beta) conv = tf.nn.conv2d(inputs, W_filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") conv_bn, update_ema = self.batchnorm(conv, beta, convolutional=True) # Apply nonlinearity, batch norm scaling is not useful with relus h = tf.nn.relu(conv_bn, name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool(h, ksize=[1, n_step - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") pooled_outputs.append(pooled) self.update_emas.append(update_ema) h_pool = tf.concat(pooled_outputs, 3) h_pool_flat = tf.reshape(h_pool, [-1, self.n_filter_total]) return h_pool_flat # shape = [batch_size, self.n_filter_total] # test the model def test(): import numpy as np print('Begin testing...') settings = Settings() W_embedding = np.random.randn(50, 10) config = tf.ConfigProto() config.gpu_options.allow_growth = True batch_size = 128 with tf.Session(config=config) as sess: model = BiGRU_CNN(W_embedding, settings) optimizer = tf.train.AdamOptimizer(0.001) train_op = optimizer.minimize(model.loss) update_op = tf.group(*model.update_emas) sess.run(tf.global_variables_initializer()) fetch = [model.loss, model.y_pred, train_op, update_op] loss_list = list() for i in xrange(100): X1_batch = np.zeros((batch_size, 30), dtype=float) X2_batch = np.zeros((batch_size, 150), dtype=float) y_batch = np.zeros((batch_size, 1999), dtype=int) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) loss_list.append(loss) print(i, loss) if __name__ == '__main__': test() ================================================ FILE: zhihu-text-classification-master/models/wd_5_bigru_cnn/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/data_valid/' data_test_path = '../../data/wd-data/data_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.BiGRU_CNN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_5_bigru_cnn/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 8e-4, 'initial learning rate, default: 8e-4') flags.DEFINE_float('decay_rate', 0.75, 'decay rate, default: 0.75') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.38, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/data_train/' data_valid_path = '../../data/wd-data/data_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = (map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.BiGRU_CNN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) train_op = train_op1 for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_6_rcnn/__init__.py ================================================ # -*- coding:utf-8 -*- ================================================ FILE: zhihu-text-classification-master/models/wd_6_rcnn/network.py ================================================ # -*- coding:utf-8 -*- import tensorflow as tf from tensorflow.contrib import rnn import tensorflow.contrib.layers as layers """wd_6_rcnn 在论文 Recurrent Convolutional Neural Networks for Text Classification 中。使用 BiRNN 处理，将每个时刻的隐藏状态和原输入拼起来，在进行 max_pooling 操作。这里有些不同，首先也是使用 bigru 得到每个时刻的，将每个时刻的隐藏状态和原输入拼起来；然后使用输入到 TextCNN 网络中。 """ class Settings(object): def __init__(self): self.model_name = "wd_6_rcnn" self.title_len = 30 self.content_len = 150 self.hidden_size = 256 self.n_layer = 1 self.filter_sizes = [2, 3, 4, 5, 7] self.n_filter = 256 self.fc_hidden_size = 1024 self.n_class = 1999 self.summary_path = '../../summary/' + self.model_name + '/' self.ckpt_path = '../../ckpt/' + self.model_name + '/' class RCNN(object): def __init__(self, W_embedding, settings): self.model_name = settings.model_name self.title_len = settings.title_len self.content_len = settings.content_len self.hidden_size = settings.hidden_size self.n_layer = settings.n_layer self.filter_sizes = settings.filter_sizes self.n_filter = settings.n_filter self.n_filter_total = self.n_filter * len(self.filter_sizes) self.n_class = settings.n_class self.fc_hidden_size = settings.fc_hidden_size self._global_step = tf.Variable(0, trainable=False, name='Global_Step') self.update_emas = list() # placeholders self._tst = tf.placeholder(tf.bool) self._keep_prob = tf.placeholder(tf.float32, []) self._batch_size = tf.placeholder(tf.int32, []) with tf.name_scope('Inputs'): self._X1_inputs = tf.placeholder(tf.int64, [None, self.title_len], name='X1_inputs') self._X2_inputs = tf.placeholder(tf.int64, [None, self.content_len], name='X2_inputs') self._y_inputs = tf.placeholder(tf.float32, [None, self.n_class], name='y_input') with tf.variable_scope('embedding'): self.embedding = tf.get_variable(name='embedding', shape=W_embedding.shape, initializer=tf.constant_initializer(W_embedding), trainable=True) self.embedding_size = W_embedding.shape[1] with tf.variable_scope('rcnn_text'): output_title = self.rcnn_inference(self._X1_inputs, self.title_len) with tf.variable_scope('rcnn_content'): output_content = self.rcnn_inference(self._X2_inputs, self.content_len) with tf.variable_scope('fc-bn-layer'): output = tf.concat([output_title, output_content], axis=1) W_fc = self.weight_variable([self.n_filter_total*2, self.fc_hidden_size], name='Weight_fc') tf.summary.histogram('W_fc', W_fc) h_fc = tf.matmul(output, W_fc, name='h_fc') beta_fc = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.fc_hidden_size], name="beta_fc")) tf.summary.histogram('beta_fc', beta_fc) fc_bn, update_ema_fc = self.batchnorm(h_fc, beta_fc, convolutional=False) self.update_emas.append(update_ema_fc) self.fc_bn_relu = tf.nn.relu(fc_bn, name="relu") fc_bn_drop = tf.nn.dropout(self.fc_bn_relu, self.keep_prob) with tf.variable_scope('out_layer'): W_out = self.weight_variable([self.fc_hidden_size, self.n_class], name='Weight_out') tf.summary.histogram('Weight_out', W_out) b_out = self.bias_variable([self.n_class], name='bias_out') tf.summary.histogram('bias_out', b_out) self._y_pred = tf.nn.xw_plus_b(fc_bn_drop, W_out, b_out, name='y_pred') # 每个类别的分数 scores with tf.name_scope('loss'): self._loss = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits(logits=self._y_pred, labels=self._y_inputs)) tf.summary.scalar('loss', self._loss) self.saver = tf.train.Saver(max_to_keep=1) @property def tst(self): return self._tst @property def keep_prob(self): return self._keep_prob @property def batch_size(self): return self._batch_size @property def global_step(self): return self._global_step @property def X1_inputs(self): return self._X1_inputs @property def X2_inputs(self): return self._X2_inputs @property def y_inputs(self): return self._y_inputs @property def y_pred(self): return self._y_pred @property def loss(self): return self._loss def weight_variable(self, shape, name): """Create a weight variable with appropriate initialization.""" initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial, name=name) def bias_variable(self, shape, name): """Create a bias variable with appropriate initialization.""" initial = tf.constant(0.1, shape=shape) return tf.Variable(initial, name=name) def batchnorm(self, Ylogits, offset, convolutional=False): """batchnormalization. Args: Ylogits: 1D向量或者是3D的卷积结果。 num_updates: 迭代的global_step offset：表示beta，全局均值；在 RELU 激活中一般初始化为 0.1。 scale：表示lambda，全局方差；在 sigmoid 激活中需要，这 RELU 激活中作用不大。 m: 表示batch均值；v:表示batch方差。 bnepsilon：一个很小的浮点数，防止除以 0. Returns: Ybn: 和 Ylogits 的维度一样，就是经过 Batch Normalization 处理的结果。 update_moving_everages：更新mean和variance，主要是给最后的 test 使用。 """ exp_moving_avg = tf.train.ExponentialMovingAverage(0.999, self._global_step) # adding the iteration prevents from averaging across non-existing iterations bnepsilon = 1e-5 if convolutional: mean, variance = tf.nn.moments(Ylogits, [0, 1, 2]) else: mean, variance = tf.nn.moments(Ylogits, [0]) update_moving_everages = exp_moving_avg.apply([mean, variance]) m = tf.cond(self.tst, lambda: exp_moving_avg.average(mean), lambda: mean) v = tf.cond(self.tst, lambda: exp_moving_avg.average(variance), lambda: variance) Ybn = tf.nn.batch_normalization(Ylogits, m, v, offset, None, bnepsilon) return Ybn, update_moving_everages def gru_cell(self): with tf.name_scope('gru_cell'): cell = rnn.GRUCell(self.hidden_size, reuse=tf.get_variable_scope().reuse) return rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob) def bi_gru(self, X_inputs): """build the bi-GRU network. Return the encoder represented vector. X_inputs: [batch_size, n_step] n_step: 句子的词数量；或者文档的句子数。 outputs: [fw_state, embeddings, bw_state], shape=[batch_size, hidden_size+embedding_size+hidden_size] """ inputs = tf.nn.embedding_lookup(self.embedding, X_inputs) # [batch_size, n_step, embedding_size] cells_fw = [self.gru_cell() for _ in range(self.n_layer)] cells_bw = [self.gru_cell() for _ in range(self.n_layer)] initial_states_fw = [cell_fw.zero_state(self.batch_size, tf.float32) for cell_fw in cells_fw] initial_states_bw = [cell_bw.zero_state(self.batch_size, tf.float32) for cell_bw in cells_bw] outputs, _, _ = rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, inputs, initial_states_fw = initial_states_fw, initial_states_bw = initial_states_bw, dtype=tf.float32) hidden_outputs = tf.concat([outputs, inputs], axis=2) return hidden_outputs # shape =[seg_num, n_steps, hidden_size*2+embedding_size] def textcnn(self, cnn_inputs, n_step): """build the TextCNN network. Return the h_drop""" # cnn_inputs.shape = [batchsize, n_step, hidden_size*2+embedding_size] inputs = tf.expand_dims(cnn_inputs, -1) pooled_outputs = list() for i, filter_size in enumerate(self.filter_sizes): with tf.variable_scope("conv-maxpool-%s" % filter_size): # Convolution Layer filter_shape = [filter_size, self.hidden_size*2+self.embedding_size, 1, self.n_filter] W_filter = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W_filter") beta = tf.Variable(tf.constant(0.1, tf.float32, shape=[self.n_filter], name="beta")) tf.summary.histogram('beta', beta) conv = tf.nn.conv2d(inputs, W_filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") conv_bn, update_ema = self.batchnorm(conv, beta, convolutional=True) # 在激活层前面加 BN # Apply nonlinearity, batch norm scaling is not useful with relus h = tf.nn.relu(conv_bn, name="relu") # Maxpooling over the outputs pooled = tf.nn.max_pool(h,ksize=[1, n_step - filter_size + 1, 1, 1], strides=[1, 1, 1, 1],padding='VALID',name="pool") pooled_outputs.append(pooled) self.update_emas.append(update_ema) h_pool = tf.concat(pooled_outputs, 3) h_pool_flat = tf.reshape(h_pool, [-1, self.n_filter_total]) return h_pool_flat # shape = [batch_size, n_filter_total] def rcnn_inference(self, X_inputs, n_step): output_bigru = self.bi_gru(X_inputs) output_cnn = self.textcnn(output_bigru, n_step) return output_cnn # shape = [batch_size, n_filter_total] # test the model def test(): import numpy as np print('Begin testing...') settings = Settings() W_embedding = np.random.randn(50, 10) config = tf.ConfigProto() config.gpu_options.allow_growth = True batch_size = 128 with tf.Session(config=config) as sess: model = RCNN(W_embedding, settings) optimizer = tf.train.AdamOptimizer(0.001) train_op = optimizer.minimize(model.loss) update_op = tf.group(*model.update_emas) sess.run(tf.global_variables_initializer()) fetch = [model.loss, model.y_pred, train_op, update_op] loss_list = list() for i in xrange(100): X1_batch = np.zeros((batch_size, 30), dtype=float) X2_batch = np.zeros((batch_size, 150), dtype=float) y_batch = np.zeros((batch_size, 1999), dtype=int) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: 0.5} loss, y_pred, _, _ = sess.run(fetch, feed_dict=feed_dict) loss_list.append(loss) print(i, loss) if __name__ == '__main__': test() ================================================ FILE: zhihu-text-classification-master/models/wd_6_rcnn/predict.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import time import network sys.path.append('../..') from evaluator import score_eval settings = network.Settings() title_len = settings.title_len model_name = settings.model_name ckpt_path = settings.ckpt_path local_scores_path = '../../local_scores/' scores_path = '../../scores/' if not os.path.exists(local_scores_path): os.makedirs(local_scores_path) if not os.path.exists(scores_path): os.makedirs(scores_path) embedding_path = '../../data/word_embedding.npy' data_valid_path = '../../data/wd-data/data_valid/' data_test_path = '../../data/wd-data/data_test/' va_batches = os.listdir(data_valid_path) te_batches = os.listdir(data_test_path) # batch 文件名列表 n_va_batches = len(va_batches) n_te_batches = len(te_batches) def get_batch(batch_id): """get a batch from valid data""" new_batch = np.load(data_valid_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def get_test_batch(batch_id): """get a batch from test data""" X_batch = np.load(data_test_path + str(batch_id) + '.npy') X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch] def local_predict(sess, model): """Test on the valid data.""" time0 = time.time() predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() predict_scores = list() for i in tqdm(xrange(n_va_batches)): [X1_batch, X2_batch, y_batch] = get_batch(i) marked_labels_list.extend(y_batch) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) print('Local valid p=%g, r=%g, f1=%g' % (precision, recall, f1)) predict_scores = np.vstack(np.asarray(predict_scores)) local_scores_name = local_scores_path + model_name + '.npy' np.save(local_scores_name, predict_scores) print('local_scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (local_scores_name, time.time() - time0)) def predict(sess, model): """Test on the test data.""" time0 = time.time() predict_scores = list() for i in tqdm(xrange(n_te_batches)): [X1_batch, X2_batch] = get_test_batch(i) _batch_size = len(X1_batch) fetches = [model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} predict_labels = sess.run(fetches, feed_dict)[0] predict_scores.append(predict_labels) predict_scores = np.vstack(np.asarray(predict_scores)) scores_name = scores_path + model_name + '.npy' np.save(scores_name, predict_scores) print('scores.shape=', predict_scores.shape) print('Writed the scores into %s, time %g s' % (scores_name, time.time() - time0)) def main(_): if not os.path.exists(ckpt_path + 'checkpoint'): print('there is not saved model, please check the ckpt path') exit() print('Loading model...') W_embedding = np.load(embedding_path) config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.RCNN(W_embedding, settings) model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) print('Local predicting...') local_predict(sess, model) print('Test predicting...') predict(sess, model) if __name__ == '__main__': tf.app.run() ================================================ FILE: zhihu-text-classification-master/models/wd_6_rcnn/train.py ================================================ # -*- coding:utf-8 -*- from __future__ import print_function from __future__ import division import tensorflow as tf import numpy as np from tqdm import tqdm import os import sys import shutil import time import network sys.path.append('../..') from data_helpers import to_categorical from evaluator import score_eval flags = tf.flags flags.DEFINE_bool('is_retrain', False, 'if is_retrain is true, not rebuild the summary') flags.DEFINE_integer('max_epoch', 1, 'update the embedding after max_epoch, default: 1') flags.DEFINE_integer('max_max_epoch', 6, 'all training epoches, default: 6') flags.DEFINE_float('lr', 8e-4, 'initial learning rate, default: 8e-4') flags.DEFINE_float('decay_rate', 0.75, 'decay rate, default: 0.75') flags.DEFINE_float('keep_prob', 0.5, 'keep_prob for training, default: 0.5') # 正式 flags.DEFINE_integer('decay_step', 15000, 'decay_step, default: 15000') flags.DEFINE_integer('valid_step', 10000, 'valid_step, default: 10000') flags.DEFINE_float('last_f1', 0.38, 'if valid_f1 > last_f1, save new model. default: 0.40') # 测试 # flags.DEFINE_integer('decay_step', 1000, 'decay_step, default: 1000') # flags.DEFINE_integer('valid_step', 500, 'valid_step, default: 500') # flags.DEFINE_float('last_f1', 0.10, 'if valid_f1 > last_f1, save new model. default: 0.10') FLAGS = flags.FLAGS lr = FLAGS.lr last_f1 = FLAGS.last_f1 settings = network.Settings() title_len = settings.title_len summary_path = settings.summary_path ckpt_path = settings.ckpt_path model_path = ckpt_path + 'model.ckpt' embedding_path = '../../data/word_embedding.npy' data_train_path = '../../data/wd-data/data_train/' data_valid_path = '../../data/wd-data/data_valid/' tr_batches = os.listdir(data_train_path) # batch 文件名列表 va_batches = os.listdir(data_valid_path) n_tr_batches = len(tr_batches) n_va_batches = len(va_batches) # 测试 # n_tr_batches = 1000 # n_va_batches = 50 def get_batch(data_path, batch_id): """get a batch from data_path""" new_batch = np.load(data_path + str(batch_id) + '.npz') X_batch = new_batch['X'] y_batch = new_batch['y'] X1_batch = X_batch[:, :title_len] X2_batch = X_batch[:, title_len:] return [X1_batch, X2_batch, y_batch] def valid_epoch(data_path, sess, model): """Test on the valid data.""" va_batches = os.listdir(data_path) n_va_batches = len(va_batches) _costs = 0.0 predict_labels_list = list() # 所有的预测结果 marked_labels_list = list() for i in range(n_va_batches): [X1_batch, X2_batch, y_batch] = get_batch(data_path, i) marked_labels_list.extend(y_batch) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) fetches = [model.loss, model.y_pred] feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} _cost, predict_labels = sess.run(fetches, feed_dict) _costs += _cost predict_labels = list(map(lambda label: label.argsort()[-1:-6:-1], predict_labels)) # 取最大的5个下标 predict_labels_list.extend(predict_labels) predict_label_and_marked_label_list = zip(predict_labels_list, marked_labels_list) precision, recall, f1 = score_eval(predict_label_and_marked_label_list) mean_cost = _costs / n_va_batches return mean_cost, precision, recall, f1 def train_epoch(data_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer): global last_f1 global lr time0 = time.time() batch_indexs = np.random.permutation(n_tr_batches) # shuffle the training data for batch in tqdm(range(n_tr_batches)): global_step = sess.run(model.global_step) if 0 == (global_step + 1) % FLAGS.valid_step: valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g, time=%g s' % ( global_step, valid_cost, precision, recall, f1, time.time() - time0)) time0 = time.time() if f1 > last_f1: last_f1 = f1 saving_path = model.saver.save(sess, model_path, global_step+1) print('saved new model to %s ' % saving_path) # training batch_id = batch_indexs[batch] [X1_batch, X2_batch, y_batch] = get_batch(data_train_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: False, model.keep_prob: FLAGS.keep_prob} summary, _cost, _, _ = sess.run(train_fetches, feed_dict) # the cost is the mean cost of one batch # valid per 500 steps if 0 == (global_step + 1) % 500: train_writer.add_summary(summary, global_step) batch_id = np.random.randint(0, n_va_batches) # 随机选一个验证batch [X1_batch, X2_batch, y_batch] = get_batch(data_valid_path, batch_id) y_batch = to_categorical(y_batch) _batch_size = len(y_batch) feed_dict = {model.X1_inputs: X1_batch, model.X2_inputs: X2_batch, model.y_inputs: y_batch, model.batch_size: _batch_size, model.tst: True, model.keep_prob: 1.0} summary, _cost = sess.run(valid_fetches, feed_dict) test_writer.add_summary(summary, global_step) def main(_): global ckpt_path global last_f1 if not os.path.exists(ckpt_path): os.makedirs(ckpt_path) if not os.path.exists(summary_path): os.makedirs(summary_path) elif not FLAGS.is_retrain: # 重新训练本模型，删除以前的 summary shutil.rmtree(summary_path) os.makedirs(summary_path) if not os.path.exists(summary_path): os.makedirs(summary_path) print('1.Loading data...') W_embedding = np.load(embedding_path) print('training sample_num = %d' % n_tr_batches) print('valid sample_num = %d' % n_va_batches) # Initial or restore the model print('2.Building model...') config = tf.ConfigProto() config.gpu_options.allow_growth = True with tf.Session(config=config) as sess: model = network.RCNN(W_embedding, settings) with tf.variable_scope('training_ops') as vs: learning_rate = tf.train.exponential_decay(FLAGS.lr, model.global_step, FLAGS.decay_step, FLAGS.decay_rate, staircase=True) # two optimizer: op1, update embedding; op2, do not update embedding. with tf.variable_scope('Optimizer1'): tvars1 = tf.trainable_variables() grads1 = tf.gradients(model.loss, tvars1) optimizer1 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op1 = optimizer1.apply_gradients(zip(grads1, tvars1), global_step=model.global_step) with tf.variable_scope('Optimizer2'): tvars2 = [tvar for tvar in tvars1 if 'embedding' not in tvar.name] grads2 = tf.gradients(model.loss, tvars2) optimizer2 = tf.train.AdamOptimizer(learning_rate=learning_rate) train_op2 = optimizer2.apply_gradients(zip(grads2, tvars2), global_step=model.global_step) update_op = tf.group(*model.update_emas) merged = tf.summary.merge_all() # summary train_writer = tf.summary.FileWriter(summary_path + 'train', sess.graph) test_writer = tf.summary.FileWriter(summary_path + 'test') training_ops = [v for v in tf.global_variables() if v.name.startswith(vs.name+'/')] # 如果已经保存过模型，导入上次的模型 if os.path.exists(ckpt_path + "checkpoint"): print("Restoring Variables from Checkpoint...") model.saver.restore(sess, tf.train.latest_checkpoint(ckpt_path)) last_valid_cost, precision, recall, last_f1 = valid_epoch(data_valid_path, sess, model) print(' valid cost=%g; p=%g, r=%g, f1=%g' % (last_valid_cost, precision, recall, last_f1)) sess.run(tf.variables_initializer(training_ops)) train_op2 = train_op1 else: print('Initializing Variables...') sess.run(tf.global_variables_initializer()) print('3.Begin training...') print('max_epoch=%d, max_max_epoch=%d' % (FLAGS.max_epoch, FLAGS.max_max_epoch)) train_op = train_op2 for epoch in range(FLAGS.max_max_epoch): global_step = sess.run(model.global_step) print('Global step %d, lr=%g' % (global_step, sess.run(learning_rate))) if epoch == FLAGS.max_epoch: # update the embedding train_op = train_op1 train_fetches = [merged, model.loss, train_op, update_op] valid_fetches = [merged, model.loss] train_epoch(data_train_path, sess, model, train_fetches, valid_fetches, train_writer, test_writer) # 最后再做一次验证 valid_cost, precision, recall, f1 = valid_epoch(data_valid_path, sess, model) print('END.Global_step=%d: valid cost=%g; p=%g, r=%g, f1=%g' % ( sess.run(model.global_step), valid_cost, precision, recall, f1)) if f1 > last_f1: # save the better model saving_path = model.saver.save(sess, model_path, sess.run(model.global_step)+1) print('saved new model to %s ' % saving_path) if __name__ == '__main__': tf.app.run()