Abstractive summarization using bert as encoder and transformer decoder
I have used a text generation library called Texar , Its a beautiful library with a lot of abstractions, i would say it to be
scikit learn for text generation problems.
The main idea behind this architecture is to use the transfer learning from pretrained BERT a masked language model ,
I have replaced the Encoder part with BERT Encoder and the deocder is trained from the scratch.
One of the advantages of using Transfomer Networks is training is much faster than LSTM based models as we elimanate sequential behaviour in Transformer models.
Transformer based models generate more gramatically correct and coherent sentences.
To run the model
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
Place the story and summary files under data folder with the following names.
-train_story.txt
-train_summ.txt
-eval_story.txt
-eval_summ.txt
each story and summary must be in a single line (see sample text given.)
Step1:
Run Preprocessing
python preprocess.py
This creates two tfrecord files under the data folder.
Step 2:
python main.py
Configurations for the model can be changes from config.py file
Step 3:
Inference
Run the command python inference.py
This code runs a flask server
Use postman to send the POST request @http://your_ip_address:1118/results
with two form parameters story,summary
"
]
},
"metadata": {}
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"TrainOutput(global_step=10, training_loss=2.4338608503341677, metrics={'train_runtime': 166.0171, 'train_samples_per_second': 0.241, 'train_steps_per_second': 0.06, 'total_flos': 99255709532160.0, 'train_loss': 2.4338608503341677, 'epoch': 0.02})"
]
},
"metadata": {},
"execution_count": 7
}
]
}
]
}
================================================
FILE: config.py
================================================
import texar as tx
dcoder_config = {
'dim': 768,
'num_blocks': 6,
'multihead_attention': {
'num_heads': 8,
'output_dim': 768
# See documentation for more optional hyperparameters
},
'position_embedder_hparams': {
'dim': 768
},
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode': 'fan_avg',
'distribution': 'uniform',
},
},
'poswise_feedforward': tx.modules.default_transformer_poswise_net_hparams(
output_dim=768)
}
loss_label_confidence = 0.9
random_seed = 1234
beam_width = 5
alpha = 0.6
hidden_dim = 768
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'beta1': 0.9,
'beta2': 0.997,
'epsilon': 1e-9
}
}
}
#warmup steps must be 0.1% of number of iterations
lr = {
'learning_rate_schedule': 'constant.linear_warmup.rsqrt_decay.rsqrt_depth',
'lr_constant': 2 * (hidden_dim ** -0.5),
'static_lr': 1e-3,
'warmup_steps': 10000,
}
bos_token_id =101
eos_token_id = 102
model_dir= "./models"
run_mode= "train_and_evaluate"
batch_size = 1
eval_batch_size = 1
max_train_steps = 100000
display_steps = 1
checkpoint_steps = 1000
eval_steps = 50000
max_decoding_length = 400
max_seq_length_src = 512
max_seq_length_tgt = 400
epochs =10
is_distributed = False
data_dir = "data/"
train_out_file = "data/train.tf_record"
eval_out_file = "data/eval.tf_record"
bert_pretrain_dir="./bert_uncased_model"
train_story = "data/train_story.txt"
train_summ = "data/train_summ.txt"
eval_story = "data/eval_story.txt"
eval_summ = "data/eval_summ.txt"
bert_pretrain_dir = "./uncased_L-12_H-768_A-12"
================================================
FILE: data/eval_story.txt
================================================
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
================================================
FILE: data/eval_summ.txt
================================================
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
================================================
FILE: data/train_story.txt
================================================
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
================================================
FILE: data/train_summ.txt
================================================
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
The new question is that we know how many class data includes, but what if number of class is unknow in data. This is kind of like hyperparameter in KNN or regressions.
================================================
FILE: main.py
================================================
import sys
if not 'texar_repo' in sys.path:
sys.path += ['texar_repo']
import tensorflow as tf
import texar as tx
import numpy as np
from config import *
from model import *
import os
def _train_epoch(sess, epoch, step, smry_writer):
fetches = {
'step': global_step,
'train_op': train_op,
'smry': summary_merged,
'loss': mle_loss,
}
while True:
try:
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'train'),
tx.global_mode(): tf.estimator.ModeKeys.TRAIN,
}
op = sess.run([batch],feed_dict)
feed_dict = {
src_input_ids:op[0]['src_input_ids'],
src_segment_ids : op[0]['src_segment_ids'],
tgt_input_ids:op[0]['tgt_input_ids'],
labels:op[0]['tgt_labels'],
learning_rate: utils.get_lr(step, lr),
tx.global_mode(): tf.estimator.ModeKeys.TRAIN
}
fetches_ = sess.run(fetches, feed_dict=feed_dict)
step, loss = fetches_['step'], fetches_['loss']
if step and step % display_steps == 0:
logger.info('step: %d, loss: %.4f', step, loss)
print('step: %d, loss: %.4f' % (step, loss))
smry_writer.add_summary(fetches_['smry'], global_step=step)
if step and step % checkpoint_steps == 0:
model_path = model_dir+"/model_"+str(step)+".ckpt"
logger.info('saving model to %s', model_path)
print('saving model to %s' % model_path)
saver.save(sess, model_path)
if step and step % eval_steps == 0:
_eval_epoch(sess, epoch, mode='eval')
except tf.errors.OutOfRangeError:
break
return step
def _eval_epoch(sess, epoch, mode):
references, hypotheses = [], []
bsize = test_batch_size
fetches = {
'inferred_ids': inferred_ids,
}
bno=0
while True:
#print("Temp",temp)
try:
print("Batch",bno)
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'eval'),
tx.global_mode(): tf.estimator.ModeKeys.EVAL,
}
op = sess.run([batch],feed_dict)
feed_dict = {
src_input_ids:op[0]['src_input_ids'],
src_segment_ids : op[0]['src_segment_ids'],
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
fetches_ = sess.run(fetches, feed_dict=feed_dict)
labels = op[0]['tgt_labels']
hypotheses.extend(h.tolist() for h in fetches_['inferred_ids'])
references.extend(r.tolist() for r in labels)
hypotheses = utils.list_strip_eos(hypotheses, eos_token_id)
references = utils.list_strip_eos(references, eos_token_id)
bno = bno+1
except tf.errors.OutOfRangeError:
break
if mode == 'eval':
# Writes results to files to evaluate BLEU
# For 'eval' mode, the BLEU is based on token ids (rather than
# text tokens) and serves only as a surrogate metric to monitor
# the training process
fname = os.path.join(model_dir, 'tmp.eval')
hypotheses = tx.utils.str_join(hypotheses)
references = tx.utils.str_join(references)
hyp_fn, ref_fn = tx.utils.write_paired_text(
hypotheses, references, fname, mode='s')
eval_bleu = bleu_wrapper(ref_fn, hyp_fn, case_sensitive=True)
eval_bleu = 100. * eval_bleu
logger.info('epoch: %d, eval_bleu %.4f', epoch, eval_bleu)
print('epoch: %d, eval_bleu %.4f' % (epoch, eval_bleu))
if eval_bleu > best_results['score']:
logger.info('epoch: %d, best bleu: %.4f', epoch, eval_bleu)
best_results['score'] = eval_bleu
best_results['epoch'] = epoch
model_path = os.path.join(model_dir, 'best-model.ckpt')
logger.info('saving model to %s', model_path)
print('saving model to %s' % model_path)
saver.save(sess, model_path)
tx.utils.maybe_create_dir(model_dir)
logging_file= os.path.join(model_dir,"logging.txt")
logger = utils.get_logger(logging_file)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
smry_writer = tf.summary.FileWriter(model_dir, graph=sess.graph)
if run_mode == 'train_and_evaluate':
logger.info('Begin running with train_and_evaluate mode')
if tf.train.latest_checkpoint(model_dir) is not None:
logger.info('Restore latest checkpoint in %s' % model_dir)
saver.restore(sess, tf.train.latest_checkpoint(model_dir))
iterator.initialize_dataset(sess)
step = 0
for epoch in range(epochs):
iterator.restart_dataset(sess, 'train')
step = _train_epoch(sess, epoch, step, smry_writer)
else:
raise ValueError('Unknown mode: {}'.format(run_mode))
================================================
FILE: model.py
================================================
import sys
if not 'texar_repo' in sys.path:
sys.path += ['texar_repo']
from config import *
from preprocess import file_based_input_fn_builder
import os
import csv
import collections
from texar_repo.examples.bert.utils import data_utils, model_utils, tokenization
import importlib
import tensorflow as tf
import texar as tx
from texar_repo.examples.bert import config_classifier as config_downstream
from texar_repo.texar.utils import transformer_utils
from texar_repo.examples.transformer.utils import data_utils, utils
from texar_repo.examples.transformer.bleu_tool import bleu_wrapper
train_dataset = file_based_input_fn_builder(
input_file=train_out_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=True,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
eval_dataset = file_based_input_fn_builder(
input_file=eval_out_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=True,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': eval_batch_size})
bert_config = model_utils.transform_bert_to_texar_config(
os.path.join(bert_pretrain_dir, 'bert_config.json'))
tokenizer = tokenization.FullTokenizer(
vocab_file=os.path.join(bert_pretrain_dir, 'vocab.txt'),
do_lower_case=True)
vocab_size = len(tokenizer.vocab)
src_input_ids = tf.placeholder(tf.int64, shape=(None, None))
src_segment_ids = tf.placeholder(tf.int64, shape=(None, None))
tgt_input_ids = tf.placeholder(tf.int64, shape=(None, None))
tgt_segment_ids = tf.placeholder(tf.int64, shape=(None, None))
batch_size = tf.shape(src_input_ids)[0]
src_input_length = tf.reduce_sum(1 - tf.to_int32(tf.equal(src_input_ids, 0)),
axis=1)
tgt_input_length = tf.reduce_sum(1 - tf.to_int32(tf.equal(tgt_input_ids, 0)),
axis=1)
labels = tf.placeholder(tf.int64, shape=(None, None))
is_target = tf.to_float(tf.not_equal(labels, 0))
global_step = tf.Variable(0, dtype=tf.int64, trainable=False)
learning_rate = tf.placeholder(tf.float64, shape=(), name='lr')
iterator = tx.data.FeedableDataIterator({
'train': train_dataset, 'eval': eval_dataset})
batch = iterator.get_next()
#encoder Bert model
print("Intializing the Bert Encoder Graph")
with tf.variable_scope('bert'):
embedder = tx.modules.WordEmbedder(
vocab_size=bert_config.vocab_size,
hparams=bert_config.embed)
word_embeds = embedder(src_input_ids)
# Creates segment embeddings for each type of tokens.
segment_embedder = tx.modules.WordEmbedder(
vocab_size=bert_config.type_vocab_size,
hparams=bert_config.segment_embed)
segment_embeds = segment_embedder(src_segment_ids)
input_embeds = word_embeds + segment_embeds
# The BERT model (a TransformerEncoder)
encoder = tx.modules.TransformerEncoder(hparams=bert_config.encoder)
encoder_output = encoder(input_embeds, src_input_length)
# Builds layers for downstream classification, which is also initialized
# with BERT pre-trained checkpoint.
with tf.variable_scope("pooler"):
# Uses the projection of the 1st-step hidden vector of BERT output
# as the representation of the sentence
bert_sent_hidden = tf.squeeze(encoder_output[:, 0:1, :], axis=1)
bert_sent_output = tf.layers.dense(
bert_sent_hidden, config_downstream.hidden_dim,
activation=tf.tanh)
output = tf.layers.dropout(
bert_sent_output, rate=0.1, training=tx.global_mode_train())
print("loading the bert pretrained weights")
# Loads pretrained BERT model parameters
init_checkpoint = os.path.join(bert_pretrain_dir, 'bert_model.ckpt')
model_utils.init_bert_checkpoint(init_checkpoint)
tgt_embedding = tf.concat(
[tf.zeros(shape=[1, embedder.dim]), embedder.embedding[1:, :]], axis=0)
decoder = tx.modules.TransformerDecoder(embedding=tgt_embedding,
hparams=dcoder_config)
# For training
outputs = decoder(
memory=encoder_output,
memory_sequence_length=src_input_length,
inputs=embedder(tgt_input_ids),
sequence_length=tgt_input_length,
decoding_strategy='train_greedy',
mode=tf.estimator.ModeKeys.TRAIN
)
mle_loss = transformer_utils.smoothing_cross_entropy(
outputs.logits, labels, vocab_size, loss_label_confidence)
mle_loss = tf.reduce_sum(mle_loss * is_target) / tf.reduce_sum(is_target)
tvars =tf.trainable_variables()
non_bert_vars = [var for var in tvars if 'bert' not in var.name]
train_op = tx.core.get_train_op(
mle_loss,
learning_rate=learning_rate,
variables= non_bert_vars,
global_step=global_step,
hparams=opt)
tf.summary.scalar('lr', learning_rate)
tf.summary.scalar('mle_loss', mle_loss)
summary_merged = tf.summary.merge_all()
saver = tf.train.Saver(max_to_keep=5)
best_results = {'score': 0, 'epoch': -1}
start_tokens = tf.fill([tx.utils.get_batch_size(src_input_ids)],
bos_token_id)
predictions = decoder(
memory=encoder_output,
memory_sequence_length=src_input_length,
decoding_strategy='infer_greedy',
beam_width=beam_width,
alpha=alpha,
start_tokens=start_tokens,
end_token=eos_token_id,
max_decoding_length=400,
mode=tf.estimator.ModeKeys.PREDICT
)
if beam_width <= 1:
inferred_ids = predictions[0].sample_id
else:
# Uses the best sample by beam search
inferred_ids = predictions['sample_id'][:, :, 0]
================================================
FILE: models/logging.txt
================================================
2019-03-08 20:02:04,048:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:03:50,512:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:05:00,060:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:08:14,915:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:12:42,894:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:22:29,211:INFO:Begin running with train_and_evaluate mode
2019-03-08 20:22:39,003:INFO:step: 1, loss: 11.7971
2019-03-08 20:22:42,072:INFO:step: 2, loss: 11.7444
2019-03-08 20:22:45,111:INFO:step: 3, loss: 11.6753
2019-03-08 20:22:48,523:INFO:step: 4, loss: 11.8856
2019-03-08 20:22:51,878:INFO:step: 5, loss: 11.7765
2019-03-08 20:22:55,144:INFO:step: 6, loss: 11.9311
2019-03-08 20:22:58,406:INFO:step: 7, loss: 11.8430
2019-03-08 20:23:01,664:INFO:step: 8, loss: 11.7669
2019-03-08 20:23:04,947:INFO:step: 9, loss: 11.7373
2019-03-08 20:23:08,286:INFO:step: 10, loss: 11.9579
2019-03-08 20:23:11,635:INFO:step: 11, loss: 11.5600
2019-03-08 20:23:15,028:INFO:step: 12, loss: 11.6753
2019-03-08 20:23:18,427:INFO:step: 13, loss: 11.5919
2019-03-08 20:23:21,835:INFO:step: 14, loss: 11.5611
2019-03-08 20:23:25,236:INFO:step: 15, loss: 11.3855
2019-03-08 20:23:28,616:INFO:step: 16, loss: 11.3497
2019-03-08 20:23:31,978:INFO:step: 17, loss: 11.3501
2019-03-08 20:23:35,314:INFO:step: 18, loss: 11.5671
2019-03-08 20:23:38,646:INFO:step: 19, loss: 11.3275
2019-03-08 20:23:41,964:INFO:step: 20, loss: 11.1347
2019-03-08 20:23:45,385:INFO:step: 21, loss: 11.2892
2019-03-08 20:23:48,854:INFO:step: 22, loss: 10.9162
2019-03-08 20:23:52,347:INFO:step: 23, loss: 11.0379
2019-03-08 20:23:55,775:INFO:step: 24, loss: 11.0149
2019-03-08 20:23:59,314:INFO:step: 25, loss: 10.7168
2019-03-08 20:24:02,892:INFO:step: 26, loss: 10.9317
2019-03-08 20:24:06,576:INFO:step: 27, loss: 10.8448
2019-03-08 20:24:10,262:INFO:step: 28, loss: 10.7415
2019-03-08 20:24:13,979:INFO:step: 29, loss: 10.8425
2019-03-08 20:24:17,663:INFO:step: 30, loss: 10.7316
2019-03-08 20:24:21,293:INFO:step: 31, loss: 10.6841
2019-03-08 20:24:24,977:INFO:step: 32, loss: 10.4804
2019-03-08 20:24:28,662:INFO:step: 33, loss: 10.2873
================================================
FILE: preprocess.py
================================================
import sys
if not 'texar_repo' in sys.path:
sys.path += ['texar_repo']
from config import *
from texar_repo.examples.bert.utils import data_utils, model_utils, tokenization
from texar_repo.examples.transformer.utils import data_utils, utils
import tensorflow as tf
import os
import csv
import collections
class InputExample():
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence.
For single sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second
sequence. Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.src_txt = text_a
self.tgt_txt = text_b
class InputFeatures():
"""A single set of features of data."""
def __init__(self, src_input_ids,src_input_mask,src_segment_ids,tgt_input_ids,tgt_input_mask,tgt_labels):
self.src_input_ids = src_input_ids
self.src_input_mask = src_input_mask
self.src_segment_ids = src_segment_ids
self.tgt_input_ids = tgt_input_ids
self.tgt_input_mask = tgt_input_mask
self.tgt_labels = tgt_labels
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
i = 0
for line in reader:
lines.append(line)
return lines
@classmethod
def _read_file(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\n", quotechar=quotechar)
lines = []
i = 0
for line in reader:
lines.append(line)
return lines
class CNNDailymail(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_file(os.path.join(data_dir, "train_story.txt")),self._read_file(os.path.join(data_dir, "train_summ.txt")),
"train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_file(os.path.join(data_dir, "eval_story.txt")),self._read_file(os.path.join(data_dir, "eval_summ.txt")),
"dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_file(os.path.join(data_dir, "test_story.txt")),self._read_file(os.path.join(data_dir, "test_summ.txt")),
"test")
def _create_examples(self, src_lines,tgt_lines,set_type):
examples = []
for i,data in enumerate(zip(src_lines,tgt_lines)):
guid = "%s-%s" % (set_type, i)
if set_type == "test" and i == 0:
continue
else:
#print(data)
if len(data[0])==0 or len(data[1])==0:
continue
src_lines = tokenization.convert_to_unicode(data[0][0])
tgt_lines = tokenization.convert_to_unicode(data[1][0])
examples.append(InputExample(guid=guid, text_a=src_lines,
text_b=tgt_lines))
return examples
def file_based_convert_examples_to_features(
examples, max_seq_length_src,max_seq_length_tgt,tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
#print("ex_index",ex_index)
if (ex_index+1) %1000 == 0 :
print("------------processed..{}...examples".format(ex_index))
feature = convert_single_example(ex_index, example,
max_seq_length_src,max_seq_length_tgt,tokenizer)
def create_int_feature(values):
return tf.train.Feature(
int64_list=tf.train.Int64List(value=list(values)))
features = collections.OrderedDict()
features["src_input_ids"] = create_int_feature(feature.src_input_ids)
features["src_input_mask"] = create_int_feature(feature.src_input_mask)
features["src_segment_ids"] = create_int_feature(feature.src_segment_ids)
features["tgt_input_ids"] = create_int_feature(feature.tgt_input_ids)
features["tgt_input_mask"] = create_int_feature(feature.tgt_input_mask)
features['tgt_labels'] = create_int_feature(feature.tgt_labels)
#print(feature.tgt_labels)
tf_example = tf.train.Example(
features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
def convert_single_example(ex_index, example, max_seq_length_src,max_seq_length_tgt,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
"""
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
"""
tokens_a = tokenizer.tokenize(example.src_txt)
tokens_b = tokenizer.tokenize(example.tgt_txt)
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
if len(tokens_a) > max_seq_length_src - 2:
tokens_a = tokens_a[0:(max_seq_length_src - 2)]
if len(tokens_b) > max_seq_length_tgt - 2:
tokens_b = tokens_b[0:(max_seq_length_tgt - 2)]
tokens_src = []
segment_ids_src = []
tokens_src.append("[CLS]")
segment_ids_src.append(0)
for token in tokens_a:
tokens_src.append(token)
segment_ids_src.append(0)
tokens_src.append("[SEP]")
segment_ids_src.append(0)
tokens_tgt = []
segment_ids_tgt = []
tokens_tgt.append("[CLS]")
#segment_ids_tgt.append(0)
for token in tokens_b:
tokens_tgt.append(token)
#segment_ids_tgt.append(0)
tokens_tgt.append("[SEP]")
#segment_ids_tgt.append(0)
input_ids_src = tokenizer.convert_tokens_to_ids(tokens_src)
input_ids_tgt = tokenizer.convert_tokens_to_ids(tokens_tgt)
labels_tgt = input_ids_tgt[1:]
#Adding begiining and end token
input_ids_tgt = input_ids_tgt[:-1]
input_mask_src = [1] * len(input_ids_src)
input_mask_tgt = [1] * len(input_ids_tgt)
#print(len(input_ids_tgt))
#print(len(input_mask_tgt))
#print(len(labels_tgt))
#print(len(segment_ids_tgt))
while len(input_ids_src) < max_seq_length_src:
input_ids_src.append(0)
input_mask_src.append(0)
segment_ids_src.append(0)
while len(input_ids_tgt) < max_seq_length_tgt:
input_ids_tgt.append(0)
input_mask_tgt.append(0)
segment_ids_tgt.append(0)
labels_tgt.append(0)
feature = InputFeatures( src_input_ids=input_ids_src,src_input_mask=input_mask_src,src_segment_ids=segment_ids_src,
tgt_input_ids=input_ids_tgt,tgt_input_mask=input_mask_tgt,tgt_labels=labels_tgt)
return feature
def file_based_input_fn_builder(input_file, max_seq_length_src,max_seq_length_tgt, is_training,
drop_remainder, is_distributed=False):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"src_input_ids": tf.FixedLenFeature([max_seq_length_src], tf.int64),
"src_input_mask": tf.FixedLenFeature([max_seq_length_src], tf.int64),
"src_segment_ids": tf.FixedLenFeature([max_seq_length_src], tf.int64),
"tgt_input_ids": tf.FixedLenFeature([max_seq_length_tgt], tf.int64),
"tgt_input_mask": tf.FixedLenFeature([max_seq_length_tgt], tf.int64),
"tgt_labels" : tf.FixedLenFeature([max_seq_length_tgt], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
print(example)
print(example.keys())
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
if is_distributed:
import horovod.tensorflow as hvd
tf.logging.info('distributed mode is enabled.'
'size:{} rank:{}'.format(hvd.size(), hvd.rank()))
# https://github.com/uber/horovod/issues/223
d = d.shard(hvd.size(), hvd.rank())
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size//hvd.size(),
drop_remainder=drop_remainder))
else:
tf.logging.info('distributed mode is not enabled.')
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
else:
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def get_dataset(processor,
tokenizer,
data_dir,
max_seq_length_src,
max_seq_length_tgt,
batch_size,
mode,
output_dir,
is_distributed=False):
"""
Args:
processor: Data Preprocessor, must have get_lables,
get_train/dev/test/examples methods defined.
tokenizer: The Sentence Tokenizer. Generally should be
SentencePiece Model.
data_dir: The input data directory.
max_seq_length: Max sequence length.
batch_size: mini-batch size.
model: `train`, `eval` or `test`.
output_dir: The directory to save the TFRecords in.
"""
#label_list = processor.get_labels()
if mode == 'train':
train_examples = processor.get_train_examples(data_dir)
train_file = os.path.join(output_dir, "train.tf_record")
file_based_convert_examples_to_features(
train_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, train_file)
dataset = file_based_input_fn_builder(
input_file=train_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=True,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
elif mode == 'eval':
eval_examples = processor.get_dev_examples(data_dir)
eval_file = os.path.join(output_dir, "eval.tf_record")
file_based_convert_examples_to_features(
eval_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, eval_file)
dataset = file_based_input_fn_builder(
input_file=eval_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=False,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
elif mode == 'test':
test_examples = processor.get_test_examples(data_dir)
test_file = os.path.join(output_dir, "predict.tf_record")
file_based_convert_examples_to_features(
test_examples, max_seq_length_src,max_seq_length_tgt,
tokenizer, test_file)
dataset = file_based_input_fn_builder(
input_file=test_file,
max_seq_length_src=max_seq_length_src,
max_seq_length_tgt =max_seq_length_tgt,
is_training=False,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
return dataset
if __name__=="__main__":
tokenizer = tokenization.FullTokenizer(
vocab_file=os.path.join(bert_pretrain_dir, 'vocab.txt'),
do_lower_case=True)
vocab_size = len(tokenizer.vocab)
processor = CNNDailymail()
train_dataset = get_dataset(processor,tokenizer,data_dir,max_seq_length_src,max_seq_length_tgt,batch_size,'train',data_dir)
eval_dataset = get_dataset(processor,tokenizer,data_dir,max_seq_length_src,max_seq_length_tgt,eval_batch_size,'eval',data_dir)
================================================
FILE: texar_repo/.gitignore
================================================
# Created by https://www.gitignore.io/api/python
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# dotenv
.env
# virtualenv
.venv
venv/
ENV/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
### Linux ###
*~
# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*
# KDE directory preferences
.directory
# Linux trash folder which might appear on any partition or disk
.Trash-*
# .nfs files are created when an open file is removed but is still being accessed
.nfs*
### macOS ###
*.DS_Store
.AppleDouble
.LSOverride
# Icon must end with two \r
Icon
# Thumbnails
._*
# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
### Vim ###
# swap
[._]*.s[a-v][a-z]
[._]*.sw[a-p]
[._]s[a-v][a-z]
[._]sw[a-p]
# session
Session.vim
# temporary
.netrwhist
# auto-generated tag files
tags
### Emacs ###
# -*- mode: gitignore; -*-
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc
auto-save-list
tramp
.\#*
# Org-mode
.org-id-locations
*_archive
# flymake-mode
*_flymake.*
# eshell files
/eshell/history
/eshell/lastdir
# elpa packages
/elpa/
# reftex files
*.rel
# AUCTeX auto folder
/auto/
# cask packages
.cask/
# Flycheck
flycheck_*.el
# server auth directory
/server/
# projectiles files
.projectile
# directory configuration
.dir-locals.el
# Editors
.idea
.vscode
docs/_build
### IntelliJ ###
*.iml
### pytest ###
/.pytest_cache/
### Project ###
/data/
checkpoints/
/language_models/
/examples/language_model_ptb/simple-examples/
simple-examples.tgz
/examples/hierarchical_dialog/data/
/examples/sequence_tagging/data/
/examples/sequence_tagging/tmp/
/examples/sentence_classifier/data/
/examples/seq2seq_attn/data/
/examples/seq2seq_attn/data.zip
/examples/seq2seq_attn/iwslt14.zip
/examples/seq2seq_attn/toy_copy.zip
/examples/seq2seq_rl/data/
/examples/seq2seq_rl/data.zip
/examples/seq2seq_rl/iwslt14.zip
/examples/seq2seq_rl/toy_copy.zip
/examples/seq2seq_configs/data/
/examples/seq2seq_configs/data.zip
/examples/seq2seq_config/iwslt14.zip
/examples/seq2seq_config/toy_copy.zip
/examples/seq2seq_exposure_bias/data/
/examples/text_style_transfer/checkpoints/
/examples/text_style_transfer/samples/
/examples/text_style_transfer/data/
/examples/text_style_transfer/yelp.zip
/examples/vae_text/simple-examples/
/examples/vae_text/data/
/examples/transformer/data/
/examples/transformer/temp/
/examples/transformer/outputs/
/examples/bert/data/
!/examples/bert/data/download_glue_data.py
!/examples/bert/data/README.md
/examples/bert/bert_pretrained_models/
!/examples/bert/bert_pretrained_models/download_model.sh
/examples/bert/output
================================================
FILE: texar_repo/.pylintrc
================================================
[MASTER]
# Specify a configuration file.
#rcfile=
# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=
# Add files or directories to the blacklist. They should be base names, not
# paths.
ignore=CVS
# Add files or directories matching the regex patterns to the blacklist. The
# regex matches against base names, not paths.
ignore-patterns=
# Pickle collected data for later comparisons.
persistent=yes
# List of plugins (as comma separated values of python modules names) to load,
# usually to register additional checkers.
load-plugins=
# Use multiple processes to speed up Pylint.
jobs=1
# Allow loading of arbitrary C extensions. Extensions are imported into the
# active Python interpreter and may run arbitrary code.
unsafe-load-any-extension=no
# A comma-separated list of package or module names from where C extensions may
# be loaded. Extensions are loading into the active Python interpreter and may
# run arbitrary code
extension-pkg-whitelist=
# Allow optimization of some AST trees. This will activate a peephole AST
# optimizer, which will apply various small optimizations. For instance, it can
# be used to obtain the result of joining multiple strings with the addition
# operator. Joining a lot of strings can lead to a maximum recursion error in
# Pylint and this flag can prevent that. It has one side effect, the resulting
# AST will be different than the one from reality. This option is deprecated
# and it will be removed in Pylint 2.0.
optimize-ast=no
[MESSAGES CONTROL]
# Only show warnings with the listed confidence levels. Leave empty to show
# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED
confidence=
# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
# multiple time (only on the command line, not in the configuration file where
# it should appear only once). See also the "--disable" option for examples.
#enable=
# Disable the message, report, category or checker with the given id(s). You
# can either give multiple identifiers separated by comma (,) or put this
# option multiple times (only on the command line, not in the configuration
# file where it should appear only once).You can also use "--disable=all" to
# disable everything first and then reenable specific checks. For example, if
# you want to run only the similarities checker, you can use "--disable=all
# --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use"--disable=all --enable=classes
# --disable=W"
disable=print-statement,parameter-unpacking,unpacking-in-except,old-raise-syntax,backtick,import-star-module-level,apply-builtin,basestring-builtin,buffer-builtin,cmp-builtin,coerce-builtin,execfile-builtin,file-builtin,long-builtin,raw_input-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,no-absolute-import,old-division,dict-iter-method,dict-view-method,next-method-called,metaclass-assignment,indexing-exception,raising-string,reload-builtin,oct-method,hex-method,nonzero-method,cmp-method,input-builtin,round-builtin,intern-builtin,unichr-builtin,map-builtin-not-iterating,zip-builtin-not-iterating,range-builtin-not-iterating,filter-builtin-not-iterating,using-cmp-argument,long-suffix,old-ne-operator,old-octal-literal,suppressed-message,useless-suppression
[REPORTS]
# Set the output format. Available formats are text, parseable, colorized, msvs
# (visual studio) and html. You can also give a reporter class, eg
# mypackage.mymodule.MyReporterClass.
output-format=text
# Put messages in a separate file for each module / package specified on the
# command line instead of printing them on stdout. Reports (if any) will be
# written in a file name "pylint_global.[txt|html]". This option is deprecated
# and it will be removed in Pylint 2.0.
files-output=no
# Tells whether to display a full report or only the messages
reports=yes
# Python expression which should return a note less than 10 (10 is the highest
# note). You have access to the variables errors warning, statement which
# respectively contain the number of errors / warnings messages and the total
# number of statements analyzed. This is used by the global evaluation report
# (RP0004).
evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)
# Template used to display messages. This is a python new-style format string
# used to format the message information. See doc for all details
#msg-template=
[BASIC]
# Good variable names which should always be accepted, separated by a comma
good-names=i,j,k,ex,Run,_
# Bad variable names which should always be refused, separated by a comma
bad-names=foo,bar,baz,toto,tutu,tata
# Colon-delimited sets of names that determine each other's naming style when
# the name regexes allow several styles.
name-group=
# Include a hint for the correct naming format with invalid-name
include-naming-hint=no
# List of decorators that produce properties, such as abc.abstractproperty. Add
# to this list to register other decorators that produce valid properties.
property-classes=abc.abstractproperty
# Regular expression matching correct module names
module-rgx=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
# Naming hint for module names
module-name-hint=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
# Regular expression matching correct constant names
const-rgx=(([A-Z_][A-Z0-9_]*)|(__.*__))$
# Naming hint for constant names
const-name-hint=(([A-Z_][A-Z0-9_]*)|(__.*__))$
# Regular expression matching correct class names
class-rgx=[A-Z_][a-zA-Z0-9]+$
# Naming hint for class names
class-name-hint=[A-Z_][a-zA-Z0-9]+$
# Regular expression matching correct function names
function-rgx=[a-z_][a-z0-9_]{2,30}$
# Naming hint for function names
function-name-hint=[a-z_][a-z0-9_]{2,30}$
# Regular expression matching correct method names
method-rgx=[a-z_][a-z0-9_]{2,30}$
# Naming hint for method names
method-name-hint=[a-z_][a-z0-9_]{2,30}$
# Regular expression matching correct attribute names
attr-rgx=[a-z_][a-z0-9_]{2,30}$
# Naming hint for attribute names
attr-name-hint=[a-z_][a-z0-9_]{2,30}$
# Regular expression matching correct argument names
argument-rgx=[a-z_][a-z0-9_]{2,30}$
# Naming hint for argument names
argument-name-hint=[a-z_][a-z0-9_]{2,30}$
# Regular expression matching correct variable names
variable-rgx=[a-z_][a-z0-9_]{2,30}$
# Naming hint for variable names
variable-name-hint=[a-z_][a-z0-9_]{2,30}$
# Regular expression matching correct class attribute names
class-attribute-rgx=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
# Naming hint for class attribute names
class-attribute-name-hint=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
# Regular expression matching correct inline iteration names
inlinevar-rgx=[A-Za-z_][A-Za-z0-9_]*$
# Naming hint for inline iteration names
inlinevar-name-hint=[A-Za-z_][A-Za-z0-9_]*$
# Regular expression which should only match function or class names that do
# not require a docstring.
no-docstring-rgx=^_
# Minimum line length for functions/classes that require docstrings, shorter
# ones are exempt.
docstring-min-length=-1
[ELIF]
# Maximum number of nested blocks for function / method body
max-nested-blocks=5
[TYPECHECK]
# Tells whether missing members accessed in mixin class should be ignored. A
# mixin class is detected if its name ends with "mixin" (case insensitive).
ignore-mixin-members=yes
# List of module names for which member attributes should not be checked
# (useful for modules/projects where namespaces are manipulated during runtime
# and thus existing member attributes cannot be deduced by static analysis. It
# supports qualified module names, as well as Unix pattern matching.
ignored-modules=
# List of class names for which member attributes should not be checked (useful
# for classes with dynamically set attributes). This supports the use of
# qualified names.
ignored-classes=optparse.Values,thread._local,_thread._local
# List of members which are set dynamically and missed by pylint inference
# system, and so shouldn't trigger E1101 when accessed. Python regular
# expressions are accepted.
generated-members=
# List of decorators that produce context managers, such as
# contextlib.contextmanager. Add to this list to register other decorators that
# produce valid context managers.
contextmanager-decorators=contextlib.contextmanager
[SPELLING]
# Spelling dictionary name. Available dictionaries: none. To make it working
# install python-enchant package.
spelling-dict=
# List of comma separated words that should not be checked.
spelling-ignore-words=
# A path to a file that contains private dictionary; one word per line.
spelling-private-dict-file=
# Tells whether to store unknown words to indicated private dictionary in
# --spelling-private-dict-file option instead of raising a message.
spelling-store-unknown-words=no
[MISCELLANEOUS]
# List of note tags to take in consideration, separated by a comma.
notes=FIXME,XXX,TODO
[SIMILARITIES]
# Minimum lines number of a similarity.
min-similarity-lines=4
# Ignore comments when computing similarities.
ignore-comments=yes
# Ignore docstrings when computing similarities.
ignore-docstrings=yes
# Ignore imports when computing similarities.
ignore-imports=no
[VARIABLES]
# Tells whether we should check for unused import in __init__ files.
init-import=no
# A regular expression matching the name of dummy variables (i.e. expectedly
# not used).
dummy-variables-rgx=(_+[a-zA-Z0-9]*?$)|dummy
# List of additional names supposed to be defined in builtins. Remember that
# you should avoid to define new builtins when possible.
additional-builtins=
# List of strings which can identify a callback function by name. A callback
# name must start or end with one of those strings.
callbacks=cb_,_cb
# List of qualified module names which can have objects that can redefine
# builtins.
redefining-builtins-modules=six.moves,future.builtins
[LOGGING]
# Logging modules to check that the string format arguments are in logging
# function parameter format
logging-modules=logging
[FORMAT]
# Maximum number of characters on a single line.
max-line-length=80
# Regexp for a line that is allowed to be longer than the limit.
ignore-long-lines=^\s*(# )??$
# Allow the body of an if to be on the same line as the test if there is no
# else.
single-line-if-stmt=no
# List of optional constructs for which whitespace checking is disabled. `dict-
# separator` is used to allow tabulation in dicts, etc.: {1 : 1,\n222: 2}.
# `trailing-comma` allows a space between comma and closing bracket: (a, ).
# `empty-line` allows space-only lines.
no-space-check=trailing-comma,dict-separator
# Maximum number of lines in a module
max-module-lines=1000
# String used as indentation unit. This is usually " " (4 spaces) or "\t" (1
# tab).
indent-string=' '
# Number of spaces of indent required inside a hanging or continued line.
indent-after-paren=4
# Expected format of line ending, e.g. empty (any line ending), LF or CRLF.
expected-line-ending-format=
[DESIGN]
# Maximum number of arguments for function / method
max-args=5
# Argument names that match this expression will be ignored. Default to name
# with leading underscore
ignored-argument-names=_.*
# Maximum number of locals for function / method body
max-locals=15
# Maximum number of return / yield for function / method body
max-returns=6
# Maximum number of branch for function / method body
max-branches=12
# Maximum number of statements in function / method body
max-statements=50
# Maximum number of parents for a class (see R0901).
max-parents=7
# Maximum number of attributes for a class (see R0902).
max-attributes=7
# Minimum number of public methods for a class (see R0903).
min-public-methods=2
# Maximum number of public methods for a class (see R0904).
max-public-methods=20
# Maximum number of boolean expressions in a if statement
max-bool-expr=5
[CLASSES]
# List of method names used to declare (i.e. assign) instance attributes.
defining-attr-methods=__init__,__new__,setUp
# List of valid names for the first argument in a class method.
valid-classmethod-first-arg=cls
# List of valid names for the first argument in a metaclass class method.
valid-metaclass-classmethod-first-arg=mcs
# List of member names, which should be excluded from the protected access
# warning.
exclude-protected=_asdict,_fields,_replace,_source,_make
[IMPORTS]
# Deprecated modules which should not be used, separated by a comma
deprecated-modules=optparse
# Create a graph of every (i.e. internal and external) dependencies in the
# given file (report RP0402 must not be disabled)
import-graph=
# Create a graph of external dependencies in the given file (report RP0402 must
# not be disabled)
ext-import-graph=
# Create a graph of internal dependencies in the given file (report RP0402 must
# not be disabled)
int-import-graph=
# Force import order to recognize a module as part of the standard
# compatibility libraries.
known-standard-library=
# Force import order to recognize a module as part of a third party library.
known-third-party=enchant
# Analyse import fallback blocks. This can be used to support both Python 2 and
# 3 compatible code, which means that the block might have code that exists
# only in one or another interpreter, leading to false positives when analysed.
analyse-fallback-blocks=no
[EXCEPTIONS]
# Exceptions that will emit a warning when being caught. Defaults to
# "Exception"
overgeneral-exceptions=Exception
================================================
FILE: texar_repo/.travis.yml
================================================
sudo: required
language: python
python:
- "2.7"
- "3.5"
- "3.6"
install:
- pip install -e .[tensorflow-cpu]
- pip install flake8
before_script:
# stop the build if there are Python syntax errors or undefined names
- flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
# exit-zero treats all errors as warnings. Texar limits lines to a maximum of 80 chars.
- flake8 . --count --exit-zero --max-complexity=10 --max-line-length=80 --statistics
script:
# units test
- pytest
notifications:
email: false
================================================
FILE: texar_repo/CHANGELOG.md
================================================
## [Unreleased]
### New features
* [2019-01-02] Support distributed-GPU training. See the [example](https://github.com/asyml/texar/tree/master/examples/distributed_gpu)
* [2018-11-29] Support pre-trained BERT model. See the [example](https://github.com/asyml/texar/tree/master/examples/bert)
================================================
FILE: texar_repo/LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: texar_repo/README.md
================================================
-----------------
[](https://travis-ci.org/asyml/texar)
[](https://texar.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/asyml/texar/blob/master/LICENSE)
**Texar** is an open-source toolkit based on TensorFlow, aiming to support a broad set of machine learning especially **text generation tasks**, such as machine translation, dialog, summarization, content manipulation, language modeling, and so on. Texar is designed for both researchers and practitioners for fast prototyping and experimentation.
With the design goals of **modularity, versatility, and extensibility** in mind, Texar extracts the common patterns underlying the diverse tasks and methodologies, creates a library of highly reusable modules and functionalities, and facilitates **arbitrary model architectures and algorithmic paradigms**, e.g.,
* encoder(s) to decoder(s), sequential- and self-attentions, memory, hierarchical models, classifiers...
* maximum likelihood learning, reinforcement learning, adversarial learning, probabilistic modeling, ...
With Texar, cutting-edge complex models can be easily constructed, freely enriched with best modeling/training practices, readily fitted into standard training/evaluation pipelines, and fastly experimented and evolved by, e.g., plugging-in and swapping-out different modules.
### Key Features
* **Versatility**. Texar contains a wide range of modules and functionalities for composing arbitrary model architectures and implementing various learning algorithms, as well as for data processing, evaluation, prediction, etc.
* **Modularity**. Texar decomposes diverse complex machine learning models/algorithms into a set of highly-reusable modules. In particular, model **architecture, losses, and learning processes** are fully decomposed.
Users can construct their own models at a high conceptual level just like assembling building blocks. It is convenient to plug-ins or swap-out modules, and configure rich options of each module. For example, switching between maximum likelihood learning and reinforcement learning involves only changing several lines of code.
* **Extensibility**. It is straightforward to integrate any user-customized, external modules. Also, Texar is fully compatible with the native TensorFlow interfaces and can take advantage of the rich TensorFlow features, and resources from the vibrant open-source community.
* Interfaces with different functionality levels. Users can customize a model through 1) simple **Python/YAML configuration files** of provided model templates/examples; 2) programming with **Python Library APIs** for maximal customizability.
* Easy-to-use APIs: 1) Convenient automatic variable re-use---no worry about the complicated TF variable scopes; 2) PyTorch-like callable modules; 3) Rich configuration options for each module, all with default values; ...
* Well-structured high-quality code of uniform design patterns and consistent styles.
* Clean, detailed [documentation](https://texar.readthedocs.io) and rich [examples](./examples).
* **Distributed model training** with multiple GPUs.
### Library API Example
Builds a (self-)attentional sequence encoder-decoder model, with different learning algorithms:
```python
import texar as tx
# Data
data = tx.data.PairedTextData(hparams=hparams_data) # Hyperparameter configs in `hparams`
iterator = tx.data.DataIterator(data)
batch = iterator.get_next() # A data mini-batch
# Model architecture
embedder = tx.modules.WordEmbedder(data.target_vocab.size, hparams=hparams_emb)
encoder = tx.modules.TransformerEncoder(hparams=hparams_encoder)
outputs_enc = encoder(inputs=embedder(batch['source_text_ids']),
sequence_length=batch['source_length'])
decoder = tx.modules.AttentionRNNDecoder(memory=output_enc,
memory_sequence_length=batch['source_length'],
hparams=hparams_decoder)
outputs, _, _ = decoder(inputs=embedder(batch['target_text_ids']),
sequence_length=batch['target_length']-1)
# Loss for maximum likelihood learning
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=batch['target_length']-1) # Automatic masks
# Beam search decoding
outputs_bs, _, _ = tx.modules.beam_search_decode(
decoder,
embedding=embedder,
start_tokens=[data.target_vocab.bos_token_id]*num_samples,
end_token=data.target_vocab.eos_token_id)
```
```python
# Policy gradient agent for RL learning
agent = tx.agents.SeqPGAgent(samples=outputs.sample_id,
logits=outputs.logits,
sequence_length=batch['target_length']-1,
hparams=config_model.agent)
```
Many more examples are available [here](./examples)
### Installation
```
git clone https://github.com/asyml/texar.git
cd texar
pip install -e .
```
### Getting Started
* [Examples](./examples)
* [Documentation](https://texar.readthedocs.io)
### Reference
If you use Texar, please cite the [report](https://arxiv.org/abs/1809.00794) with the following BibTex entry:
```
Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation
Zhiting Hu, Haoran Shi, Zichao Yang, Bowen Tan, Tiancheng Zhao, Junxian He, Wentao Wang, Lianhui Qin, Di Wang, Xuezhe Ma, Hector Liu, Xiaodan Liang, Wanrong Zhu, Devendra Singh Sachan, Eric P. Xing
2018
@article{hu2018texar,
title={Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation},
author={Hu, Zhiting and Shi, Haoran and Yang, Zichao and Tan, Bowen and Zhao, Tiancheng and He, Junxian and Wang, Wentao and Qin, Lianhui and Wang, Di and others},
journal={arXiv preprint arXiv:1809.00794},
year={2018}
}
```
### License
[Apache License 2.0](./LICENSE)
================================================
FILE: texar_repo/bin/average_checkpoints.py
================================================
"""Checkpoint averaging script."""
# This script is modified version of
# https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/bin/t2t_avg_all.py
# which comes with the following license and copyright notice:
# Copyright 2017 The Tensor2Tensor Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import six
import tensorflow as tf
import numpy as np
def main():
tf.logging.set_verbosity(tf.logging.INFO)
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--model_dir", required=True,
help="The model directory containing the checkpoints.")
parser.add_argument("--output_dir", required=True,
help="The output directory where the averaged checkpoint will be saved.")
parser.add_argument("--max_count", type=int, default=8,
help="The maximal number of checkpoints to average.")
args = parser.parse_args()
if args.model_dir == args.output_dir:
raise ValueError("Model and output directory must be different")
checkpoints_path = tf.train.get_checkpoint_state(args.model_dir).all_model_checkpoint_paths
if len(checkpoints_path) > args.max_count:
checkpoints_path = checkpoints_path[-args.max_count:]
num_checkpoints = len(checkpoints_path)
tf.logging.info("Averaging %d checkpoints..." % num_checkpoints)
tf.logging.info("Listing variables...")
var_list = tf.train.list_variables(checkpoints_path[0])
avg_values = {}
for name, shape in var_list:
if not name.startswith("global_step"):
avg_values[name] = np.zeros(shape)
for checkpoint_path in checkpoints_path:
tf.logging.info("Loading checkpoint %s" % checkpoint_path)
reader = tf.train.load_checkpoint(checkpoint_path)
for name in avg_values:
avg_values[name] += reader.get_tensor(name) / num_checkpoints
tf_vars = []
for name, value in six.iteritems(avg_values):
tf_vars.append(tf.get_variable(name, shape=value.shape))
placeholders = [tf.placeholder(v.dtype, shape=v.shape) for v in tf_vars]
assign_ops = [tf.assign(v, p) for (v, p) in zip(tf_vars, placeholders)]
latest_step = int(checkpoints_path[-1].split("-")[-1])
out_base_file = os.path.join(args.output_dir, "model.ckpt")
global_step = tf.get_variable(
"global_step",
initializer=tf.constant(latest_step, dtype=tf.int64),
trainable=False)
saver = tf.train.Saver(tf.global_variables())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for p, assign_op, (name, value) in zip(placeholders, assign_ops, six.iteritems(avg_values)):
sess.run(assign_op, {p: value})
tf.logging.info("Saving averaged checkpoint to %s-%d" % (out_base_file, latest_step))
saver.save(sess, out_base_file, global_step=global_step)
if __name__ == "__main__":
main()
================================================
FILE: texar_repo/bin/train.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Main script for model training.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import yaml
import tensorflow as tf
from texar import utils
from texar.run import Executor
tf.flags.DEFINE_string("config_paths", "",
"Paths to configuration files. This can be a path to a "
"directory in which all files are loaded, or paths to "
"multiple files separated by commas. Setting a key in "
"these files is equivalent to setting the FLAG value "
"with the same name. If a key is set in both config "
"files and FLAG, the value in config files is used.")
tf.flags.DEFINE_string("model", "",
"Name of the model class.")
tf.flags.DEFINE_string("model_hparams", "{}",
"YAML configuration string for the model "
"hyper-parameters.")
tf.flags.DEFINE_string("data_hparams_train", "{}",
"YAML configuration string for the training data "
"hyper-parameters.")
tf.flags.DEFINE_string("data_hparams_eval", "{}",
"YAML configuration string for the evaluation data "
"hyper-parameters.")
tf.flags.DEFINE_integer("max_train_steps", None,
"Maximum number of training steps to run. "
"If None, train forever or until the train data "
"generates the OutOfRange exception. If OutOfRange "
"occurs in the middle, training stops before "
"max_train_steps steps.")
tf.flags.DEFINE_integer("eval_steps", None,
"Maximum number of evaluation steps to run. "
"If None, evaluate until the eval data raises an "
"OutOfRange exception.")
# RunConfig
tf.flags.DEFINE_string("model_dir", None,
"The directory where model parameters, graph, "
"summeries, etc are saved. If None, a local temporary "
"directory is created.")
tf.flags.DEFINE_integer("tf_random_seed", None,
"Random seed for TensorFlow initializers. Setting "
"this value allows consistency between reruns.")
tf.flags.DEFINE_integer("save_summary_steps", 100,
"Save summaries every this many steps.")
tf.flags.DEFINE_integer("save_checkpoints_steps", None,
"Save checkpoints every this many steps. "
"Can not be specified with save_checkpoints_secs.")
tf.flags.DEFINE_integer("save_checkpoints_secs", None,
"Save checkpoints every this many seconds. "
"Can not be specified with save_checkpoints_steps. "
"Defaults to 600 seconds if both "
"save_checkpoints_steps and save_checkpoints_secs "
"are not set. If both are set to -1, then "
"checkpoints are disabled.")
tf.flags.DEFINE_integer("keep_checkpoint_max", 5,
"Maximum number of recent checkpoint files to keep. "
"As new files are created, older files are deleted. "
"If None or 0, all checkpoint files are kept.")
tf.flags.DEFINE_integer("keep_checkpoint_every_n_hours", 10000,
"Number of hours between each checkpoint to be saved. "
"The default value of 10,000 hours effectively "
"disables the feature.")
tf.flags.DEFINE_integer("log_step_count_steps", 100,
"The frequency, in number of global steps, that the "
"global step/sec and the loss will be logged during "
"training.")
# Session config
tf.flags.DEFINE_float("per_process_gpu_memory_fraction", 1.0,
"Fraction of the available GPU memory to allocate for "
"each process.")
tf.flags.DEFINE_boolean("gpu_allow_growth", False,
"If true, the allocator does not pre-allocate the "
"entire specified GPU memory region, instead starting "
"small and growing as needed.")
tf.flags.DEFINE_boolean("log_device_placement", False,
"Whether device placements should be logged.")
FLAGS = tf.flags.FLAGS
def _process_config():
# Loads configs
config = utils.load_config(FLAGS.config_paths)
# Parses YAML FLAGS
FLAGS.model_hparams = yaml.load(FLAGS.model_hparams)
FLAGS.data_hparams_train = yaml.load(FLAGS.data_hparams_train)
FLAGS.data_hparams_eval = yaml.load(FLAGS.data_hparams_eval)
# Merges
final_config = {}
for flag_key in dir(FLAGS):
if flag_key in {'h', 'help', 'helpshort'}: # Filters out help flags
continue
flag_value = getattr(FLAGS, flag_key)
config_value = config.get(flag_key, None)
if isinstance(flag_value, dict) and isinstance(config_value, dict):
final_config[flag_key] = utils.dict_patch(config_value, flag_value)
elif flag_key in config:
final_config[flag_key] = config_value
else:
final_config[flag_key] = flag_value
# Processes
if final_config['model_dir'] is None:
final_config['model_dir'] = tempfile.mkdtemp()
if final_config['save_checkpoints_steps'] is None \
and final_config['save_checkpoints_secs'] is None:
final_config['save_checkpoints_secs'] = 600
if final_config['save_checkpoints_steps'] == -1 \
and final_config['save_checkpoints_secs'] == -1:
final_config['save_checkpoints_steps'] = None
final_config['save_checkpoints_secs'] = None
tf.logging.info("Final Config:\n%s", yaml.dump(final_config))
return final_config
def _get_run_config(config):
gpu_options = tf.GPUOptions(
per_process_gpu_memory_fraction=\
config['per_process_gpu_memory_fraction'],
allow_growth=config['gpu_allow_growth'])
sess_config = tf.ConfigProto(
gpu_options=gpu_options,
log_device_placement=config['log_device_placement'])
run_config = tf.estimator.RunConfig(
model_dir=config['model_dir'],
tf_random_seed=config['tf_random_seed'],
save_summary_steps=config['save_summary_steps'],
save_checkpoints_steps=config['save_checkpoints_steps'],
save_checkpoints_secs=config['save_checkpoints_secs'],
keep_checkpoint_max=config['keep_checkpoint_max'],
keep_checkpoint_every_n_hours=config['keep_checkpoint_every_n_hours'],
log_step_count_steps=config['log_step_count_steps'],
session_config=sess_config)
return run_config
def main(_):
"""The entrypoint."""
config = _process_config()
run_config = _get_run_config(config)
kwargs = {
'data_hparams': config['data_hparams_train'],
'hparams': config['model_hparams']
}
model = utils.check_or_get_instance_with_redundant_kwargs(
config['model'], kwargs=kwargs,
module_paths=['texar.models', 'texar.custom'])
data_hparams = {
'train': config['data_hparams_train'],
'eval': config['data_hparams_eval']
}
exor = Executor(
model=model,
data_hparams=data_hparams,
config=run_config)
exor.train_and_evaluate(
max_train_steps=config['max_train_steps'],
eval_steps=config['eval_steps'])
if __name__ == "__main__":
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run(main=main)
================================================
FILE: texar_repo/bin/utils/README.md
================================================
This directory contains several utilities for, e.g., data pre-processing.
Instructions of using BPE and WPM encoding are as follows.
See [examples/transformer](https://github.com/asyml/texar/tree/master/examples/transformer)
for a real example of using these encoding.
### *[Byte Pair Encoding (BPE)](https://arxiv.org/abs/1508.07909)* pipeline
* Add `bin` directory to `PATH` env variable
```bash
TEXAR=$(pwd)
export PATH=$PATH:$TEXAR/bin
```
* Learning BPE vocab on source and target combined
```bash
cat train.src train.trg | learn_bpe -s 32000 > bpe-codes.32000
```
* Applying BPE on source and target files
```bash
apply_bpe -c bpe-codes.32000 < train.src > train.src.bpe
apply_bpe -c bpe-codes.32000 < train.trg > train.trg.bpe
apply_bpe -c bpe-codes.32000 < dev.src > dev.src.bpe
apply_bpe -c bpe-codes.32000 < dev.trg > dev.trg.bpe
apply_bpe -c bpe-codes.32000 < test.src > test.src.bpe
```
* BPE decoding target to match with references
```bash
mv test.out test.out.bpe
cat test.out.bpe | sed -E 's/(@@ )|(@@ ?$)//g' > test.out
```
##### Evaluate Using t2t-bleu
```bash
t2t-bleu --translation=test.out --reference=test.tgt
```
### Word Piece Model (WPM) pipeline
* This requires installation of *[sentencepiece](https://github.com/google/sentencepiece#python-module) library
```bash
pip install sentencepiece
```
* Learning Word Piece on source and target combined
```bash
spm_train --input=train.src,train.tgt --vocab_size 32000 --model_prefix=wpm-codes
```
* Applying Word Piece on source and target
```bash
spm_encode --model wpm-codes.model --output_format=id < train.src > train.src.wpm
spm_encode --model wpm-codes.model --output_format=id < train.tgt > train.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.src > valid.src.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.tgt > valid.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < test.src > test.src.wpm
```
* WPM decoding/detokenising target to match with references
```bash
mv test.out test.wpm
spm_decode --model wpm-codes.model --input_format=id < test.out.wpm > test.out
```
================================================
FILE: texar_repo/bin/utils/apply_bpe
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich
# flake8: noqa
"""Use operations learned with learn_bpe to encode a new text.
The text will not be smaller, but use only a fixed vocabulary, with rare words
encoded as variable-length sequences of subword units.
Reference:
Rico Sennrich, Barry Haddow and Alexandra Birch (2015). Neural Machine Translation of Rare Words with Subword Units.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
"""
# This file is retrieved from https://github.com/rsennrich/subword-nmt
from __future__ import unicode_literals, division
import sys
import codecs
import io
import argparse
import json
import re
from collections import defaultdict
# hack for python2/3 compatibility
from io import open
argparse.open = open
class BPE(object):
def __init__(self, codes, separator='@@', vocab=None, glossaries=None):
# check version information
firstline = codes.readline()
if firstline.startswith('#version:'):
self.version = tuple([int(x) for x in re.sub(r'(\.0+)*$','', firstline.split()[-1]).split(".")])
else:
self.version = (0, 1)
codes.seek(0)
self.bpe_codes = [tuple(item.split()) for item in codes]
# some hacking to deal with duplicates (only consider first instance)
self.bpe_codes = dict([(code,i) for (i,code) in reversed(list(enumerate(self.bpe_codes)))])
self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])
self.separator = separator
self.vocab = vocab
self.glossaries = glossaries if glossaries else []
self.cache = {}
def segment(self, sentence):
"""segment single sentence (whitespace-tokenized string) with BPE encoding"""
output = []
for word in sentence.split():
new_word = [out for segment in self._isolate_glossaries(word)
for out in encode(segment,
self.bpe_codes,
self.bpe_codes_reverse,
self.vocab,
self.separator,
self.version,
self.cache,
self.glossaries)]
for item in new_word[:-1]:
output.append(item + self.separator)
output.append(new_word[-1])
return ' '.join(output)
def _isolate_glossaries(self, word):
word_segments = [word]
for gloss in self.glossaries:
word_segments = [out_segments for segment in word_segments
for out_segments in isolate_glossary(segment, gloss)]
return word_segments
def create_parser():
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="learn BPE-based word segmentation")
parser.add_argument(
'--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
metavar='PATH',
help="Input file (default: standard input).")
parser.add_argument(
'--codes', '-c', type=argparse.FileType('r'), metavar='PATH',
required=True,
help="File with BPE codes (created by learn_bpe).")
parser.add_argument(
'--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
metavar='PATH',
help="Output file (default: standard output)")
parser.add_argument(
'--separator', '-s', type=str, default='@@', metavar='STR',
help="Separator between non-final subword units (default: '%(default)s'))")
parser.add_argument(
'--vocabulary', type=argparse.FileType('r'), default=None,
metavar="PATH",
help="Vocabulary file (built with get_vocab.py). If provided, this script reverts any merge operations that produce an OOV.")
parser.add_argument(
'--vocabulary-threshold', type=int, default=None,
metavar="INT",
help="Vocabulary threshold. If vocabulary is provided, any word with frequency < threshold will be treated as OOV")
parser.add_argument(
'--glossaries', type=str, nargs='+', default=None,
metavar="STR",
help="Glossaries. The strings provided in glossaries will not be affected"+
"by the BPE (i.e. they will neither be broken into subwords, nor concatenated with other subwords")
return parser
def get_pairs(word):
"""Return set of symbol pairs in a word.
word is represented as tuple of symbols (symbols being variable-length strings)
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
def encode(orig, bpe_codes, bpe_codes_reverse, vocab, separator, version, cache, glossaries=None):
"""Encode word based on list of BPE merge operations, which are applied consecutively
"""
if orig in cache:
return cache[orig]
if orig in glossaries:
cache[orig] = (orig,)
return (orig,)
if version == (0, 1):
word = tuple(orig) + ('',)
elif version == (0, 2): # more consistent handling of word-final segments
word = tuple(orig[:-1]) + ( orig[-1] + '',)
else:
raise NotImplementedError
pairs = get_pairs(word)
if not pairs:
return orig
while True:
bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))
if bigram not in bpe_codes:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word)-1 and word[i+1] == second:
new_word.append(first+second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
# don't print end-of-word symbols
if word[-1] == '':
word = word[:-1]
elif word[-1].endswith(''):
word = word[:-1] + (word[-1].replace('',''),)
if vocab:
word = check_vocab_and_split(word, bpe_codes_reverse, vocab, separator)
cache[orig] = word
return word
def recursive_split(segment, bpe_codes, vocab, separator, final=False):
"""Recursively split segment into smaller units (by reversing BPE merges)
until all units are either in-vocabulary, or cannot be split futher."""
try:
if final:
left, right = bpe_codes[segment + '']
right = right[:-4]
else:
left, right = bpe_codes[segment]
except:
#sys.stderr.write('cannot split {0} further.\n'.format(segment))
yield segment
return
if left + separator in vocab:
yield left
else:
for item in recursive_split(left, bpe_codes, vocab, separator, False):
yield item
if (final and right in vocab) or (not final and right + separator in vocab):
yield right
else:
for item in recursive_split(right, bpe_codes, vocab, separator, final):
yield item
def check_vocab_and_split(orig, bpe_codes, vocab, separator):
"""Check for each segment in word if it is in-vocabulary,
and segment OOV segments into smaller units by reversing the BPE merge operations"""
out = []
for segment in orig[:-1]:
if segment + separator in vocab:
out.append(segment)
else:
#sys.stderr.write('OOV: {0}\n'.format(segment))
for item in recursive_split(segment, bpe_codes, vocab, separator, False):
out.append(item)
segment = orig[-1]
if segment in vocab:
out.append(segment)
else:
#sys.stderr.write('OOV: {0}\n'.format(segment))
for item in recursive_split(segment, bpe_codes, vocab, separator, True):
out.append(item)
return out
def read_vocabulary(vocab_file, threshold):
"""read vocabulary file produced by get_vocab.py, and filter according to frequency threshold.
"""
vocabulary = set()
for line in vocab_file:
word, freq = line.split()
freq = int(freq)
if threshold == None or freq >= threshold:
vocabulary.add(word)
return vocabulary
def isolate_glossary(word, glossary):
"""
Isolate a glossary present inside a word.
Returns a list of subwords. In which all 'glossary' glossaries are isolated
For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is:
['1934', 'USA', 'B', 'USA']
"""
if word == glossary or glossary not in word:
return [word]
else:
splits = word.split(glossary)
segments = [segment.strip() for split in splits[:-1] for segment in [split, glossary] if segment != '']
return segments + [splits[-1].strip()] if splits[-1] != '' else segments
if __name__ == '__main__':
# python 2/3 compatibility
if sys.version_info < (3, 0):
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
else:
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', write_through=True, line_buffering=True)
parser = create_parser()
args = parser.parse_args()
# read/write files as UTF-8
args.codes = codecs.open(args.codes.name, encoding='utf-8')
if args.input.name != '':
args.input = codecs.open(args.input.name, encoding='utf-8')
if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
if args.vocabulary:
args.vocabulary = codecs.open(args.vocabulary.name, encoding='utf-8')
if args.vocabulary:
vocabulary = read_vocabulary(args.vocabulary, args.vocabulary_threshold)
else:
vocabulary = None
bpe = BPE(args.codes, args.separator, vocabulary, args.glossaries)
for line in args.input:
args.output.write(bpe.segment(line).strip())
args.output.write('\n')
================================================
FILE: texar_repo/bin/utils/learn_bpe
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich
# flake8: noqa
"""Use byte pair encoding (BPE) to learn a variable-length encoding of the vocabulary in a text.
Unlike the original BPE, it does not compress the plain text, but can be used to reduce the vocabulary
of a text to a configurable number of symbols, with only a small increase in the number of tokens.
Reference:
Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
"""
# This file is retrieved from https://github.com/rsennrich/subword-nmt
from __future__ import unicode_literals
import sys
import codecs
import re
import copy
import argparse
from collections import defaultdict, Counter
# hack for python2/3 compatibility
from io import open
argparse.open = open
def create_parser():
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="learn BPE-based word segmentation")
parser.add_argument(
'--input', '-i', type=argparse.FileType('r'), default=sys.stdin,
metavar='PATH',
help="Input text (default: standard input).")
parser.add_argument(
'--output', '-o', type=argparse.FileType('w'), default=sys.stdout,
metavar='PATH',
help="Output file for BPE codes (default: standard output)")
parser.add_argument(
'--symbols', '-s', type=int, default=10000,
help="Create this many new symbols (each representing a character n-gram) (default: %(default)s))")
parser.add_argument(
'--min-frequency', type=int, default=2, metavar='FREQ',
help='Stop if no symbol pair has frequency >= FREQ (default: %(default)s))')
parser.add_argument('--dict-input', action="store_true",
help="If set, input file is interpreted as a dictionary where each line contains a word-count pair")
parser.add_argument(
'--verbose', '-v', action="store_true",
help="verbose mode.")
return parser
def get_vocabulary(fobj, is_dict=False):
"""Read text and return dictionary that encodes vocabulary
"""
vocab = Counter()
for line in fobj:
if is_dict:
word, count = line.strip().split()
vocab[word] = int(count)
else:
for word in line.split():
vocab[word] += 1
return vocab
def update_pair_statistics(pair, changed, stats, indices):
"""Minimally update the indices and frequency of symbol pairs
if we merge a pair of symbols, only pairs that overlap with occurrences
of this pair are affected, and need to be updated.
"""
stats[pair] = 0
indices[pair] = defaultdict(int)
first, second = pair
new_pair = first+second
for j, word, old_word, freq in changed:
# find all instances of pair, and update frequency/indices around it
i = 0
while True:
# find first symbol
try:
i = old_word.index(first, i)
except ValueError:
break
# if first symbol is followed by second symbol, we've found an occurrence of pair (old_word[i:i+2])
if i < len(old_word)-1 and old_word[i+1] == second:
# assuming a symbol sequence "A B C", if "B C" is merged, reduce the frequency of "A B"
if i:
prev = old_word[i-1:i+1]
stats[prev] -= freq
indices[prev][j] -= 1
if i < len(old_word)-2:
# assuming a symbol sequence "A B C B", if "B C" is merged, reduce the frequency of "C B".
# however, skip this if the sequence is A B C B C, because the frequency of "C B" will be reduced by the previous code block
if old_word[i+2] != first or i >= len(old_word)-3 or old_word[i+3] != second:
nex = old_word[i+1:i+3]
stats[nex] -= freq
indices[nex][j] -= 1
i += 2
else:
i += 1
i = 0
while True:
try:
# find new pair
i = word.index(new_pair, i)
except ValueError:
break
# assuming a symbol sequence "A BC D", if "B C" is merged, increase the frequency of "A BC"
if i:
prev = word[i-1:i+1]
stats[prev] += freq
indices[prev][j] += 1
# assuming a symbol sequence "A BC B", if "B C" is merged, increase the frequency of "BC B"
# however, if the sequence is A BC BC, skip this step because the count of "BC BC" will be incremented by the previous code block
if i < len(word)-1 and word[i+1] != new_pair:
nex = word[i:i+2]
stats[nex] += freq
indices[nex][j] += 1
i += 1
def get_pair_statistics(vocab):
"""Count frequency of all symbol pairs, and create index"""
# data structure of pair frequencies
stats = defaultdict(int)
#index from pairs to words
indices = defaultdict(lambda: defaultdict(int))
for i, (word, freq) in enumerate(vocab):
prev_char = word[0]
for char in word[1:]:
stats[prev_char, char] += freq
indices[prev_char, char][i] += 1
prev_char = char
return stats, indices
def replace_pair(pair, vocab, indices):
"""Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'"""
first, second = pair
pair_str = ''.join(pair)
pair_str = pair_str.replace('\\','\\\\')
changes = []
pattern = re.compile(r'(?');
# version numbering allows bckward compatibility
outfile.write('#version: 0.2\n')
vocab = get_vocabulary(infile, is_dict)
vocab = dict([(tuple(x[:-1])+(x[-1]+'',) ,y) for (x,y) in vocab.items()])
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)
stats, indices = get_pair_statistics(sorted_vocab)
big_stats = copy.deepcopy(stats)
# threshold is inspired by Zipfian assumption, but should only affect speed
threshold = max(stats.values()) / 10
for i in range(num_symbols):
if stats:
most_frequent = max(stats, key=lambda x: (stats[x], x))
# we probably missed the best pair because of pruning; go back to full statistics
if not stats or (i and stats[most_frequent] < threshold):
prune_stats(stats, big_stats, threshold)
stats = copy.deepcopy(big_stats)
most_frequent = max(stats, key=lambda x: (stats[x], x))
# threshold is inspired by Zipfian assumption, but should only affect speed
threshold = stats[most_frequent] * i/(i+10000.0)
prune_stats(stats, big_stats, threshold)
if stats[most_frequent] < min_frequency:
sys.stderr.write('no pair has frequency >= {0}. Stopping\n'.format(min_frequency))
break
if verbose:
sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))
outfile.write('{0} {1}\n'.format(*most_frequent))
changes = replace_pair(most_frequent, sorted_vocab, indices)
update_pair_statistics(most_frequent, changes, stats, indices)
stats[most_frequent] = 0
if not i % 100:
prune_stats(stats, big_stats, threshold)
if __name__ == '__main__':
# python 2/3 compatibility
if sys.version_info < (3, 0):
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
else:
sys.stderr = codecs.getwriter('UTF-8')(sys.stderr.buffer)
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout.buffer)
sys.stdin = codecs.getreader('UTF-8')(sys.stdin.buffer)
parser = create_parser()
args = parser.parse_args()
# read/write files as UTF-8
if args.input.name != '':
args.input = codecs.open(args.input.name, encoding='utf-8')
if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
================================================
FILE: texar_repo/bin/utils/make_vocab.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Creates vocabulary from a set of data files.
Example usage:
$ python make_vocab.py --files './data/train*'
Note that if the file path is a pattern, wrap it with quotation masks.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=invalid-name
import sys
from io import open
import tensorflow as tf
import texar as tx
Py3 = sys.version_info[0] == 3
flags = tf.flags
flags.DEFINE_string("files", "./train.txt",
"Path to the data files. Can be a pattern, e.g., "
"'/path/to/train*', '/path/to/train[12]'. "
"Caution: If the path is a pattern, you must wrap the path "
"with quotation marks. e.g., "
"python make_vocab.py --files './data/*'")
flags.DEFINE_integer("max_vocab_size", -1,
"Maximum size of the vocabulary. Low frequency words "
"that exceeding the limit will be discarded. "
"Set to `-1` if no truncation is wanted.")
flags.DEFINE_boolean("count", False, "Whether to print word count in the "
"output file. Note that Texar data modules require a "
"vocab file without word count. But the functionality "
"can be useful to decide vocab truncation.")
flags.DEFINE_string("output_path", "./vocab.txt",
"Path of the output vocab file.")
flags.DEFINE_string("newline_token", None,
"The token to replace the original newline token '\n'. "
"For example, `--newline_token ''`. If not "
"specified, no replacement is performed.")
FLAGS = flags.FLAGS
def main(_):
"""Makes vocab.
"""
filenames = tx.utils.get_files(FLAGS.files)
if FLAGS.count:
vocab, count = tx.data.make_vocab(
filenames,
max_vocab_size=FLAGS.max_vocab_size,
newline_token=FLAGS.newline_token,
return_count=True)
with open(FLAGS.output_path, "w", encoding="utf-8") as fout:
for v, c in zip(vocab, count):
fout.write('{}\t{}\n'.format(v, c))
else:
vocab = tx.data.make_vocab(
filenames,
max_vocab_size=FLAGS.max_vocab_size,
newline_token=FLAGS.newline_token)
with open(FLAGS.output_path, "w", encoding="utf-8") as fout:
fout.write('\n'.join(vocab))
if __name__ == "__main__":
tf.app.run()
================================================
FILE: texar_repo/bin/utils/multi-bleu.perl
================================================
#!/usr/bin/env perl
#
# This file is part of moses. Its use is licensed under the GNU Lesser General
# Public License version 2.1 or, at your option, any later version.
# $Id$
use warnings;
use strict;
my $lowercase = 0;
if ($ARGV[0] eq "-lc") {
$lowercase = 1;
shift;
}
my $stem = $ARGV[0];
if (!defined $stem) {
print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
print STDERR "Reads the references from reference or reference0, reference1, ...\n";
exit(1);
}
$stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
my @REF;
my $ref=0;
while(-e "$stem$ref") {
&add_to_ref("$stem$ref",\@REF);
$ref++;
}
&add_to_ref($stem,\@REF) if -e $stem;
die("ERROR: could not find reference file $stem") unless scalar @REF;
# add additional references explicitly specified on the command line
shift;
foreach my $stem (@ARGV) {
&add_to_ref($stem,\@REF) if -e $stem;
}
sub add_to_ref {
my ($file,$REF) = @_;
my $s=0;
if ($file =~ /.gz$/) {
open(REF,"gzip -dc $file|") or die "Can't read $file";
} else {
open(REF,$file) or die "Can't read $file";
}
while() {
chop;
push @{$$REF[$s++]}, $_;
}
close(REF);
}
my(@CORRECT,@TOTAL,$length_translation,$length_reference);
my $s=0;
while() {
chop;
$_ = lc if $lowercase;
my @WORD = split;
my %REF_NGRAM = ();
my $length_translation_this_sentence = scalar(@WORD);
my ($closest_diff,$closest_length) = (9999,9999);
foreach my $reference (@{$REF[$s]}) {
# print "$s $_ <=> $reference\n";
$reference = lc($reference) if $lowercase;
my @WORD = split(' ',$reference);
my $length = scalar(@WORD);
my $diff = abs($length_translation_this_sentence-$length);
if ($diff < $closest_diff) {
$closest_diff = $diff;
$closest_length = $length;
# print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
} elsif ($diff == $closest_diff) {
$closest_length = $length if $length < $closest_length;
# from two references with the same closeness to me
# take the *shorter* into account, not the "first" one.
}
for(my $n=1;$n<=4;$n++) {
my %REF_NGRAM_N = ();
for(my $start=0;$start<=$#WORD-($n-1);$start++) {
my $ngram = "$n";
for(my $w=0;$w<$n;$w++) {
$ngram .= " ".$WORD[$start+$w];
}
$REF_NGRAM_N{$ngram}++;
}
foreach my $ngram (keys %REF_NGRAM_N) {
if (!defined($REF_NGRAM{$ngram}) ||
$REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
$REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
# print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram} \n";
}
}
}
}
$length_translation += $length_translation_this_sentence;
$length_reference += $closest_length;
for(my $n=1;$n<=4;$n++) {
my %T_NGRAM = ();
for(my $start=0;$start<=$#WORD-($n-1);$start++) {
my $ngram = "$n";
for(my $w=0;$w<$n;$w++) {
$ngram .= " ".$WORD[$start+$w];
}
$T_NGRAM{$ngram}++;
}
foreach my $ngram (keys %T_NGRAM) {
$ngram =~ /^(\d+) /;
my $n = $1;
# my $corr = 0;
# print "$i e $ngram $T_NGRAM{$ngram} \n";
$TOTAL[$n] += $T_NGRAM{$ngram};
if (defined($REF_NGRAM{$ngram})) {
if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
$CORRECT[$n] += $T_NGRAM{$ngram};
# $corr = $T_NGRAM{$ngram};
# print "$i e correct1 $T_NGRAM{$ngram} \n";
}
else {
$CORRECT[$n] += $REF_NGRAM{$ngram};
# $corr = $REF_NGRAM{$ngram};
# print "$i e correct2 $REF_NGRAM{$ngram} \n";
}
}
# $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
# print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
}
}
$s++;
}
my $brevity_penalty = 1;
my $bleu = 0;
my @bleu=();
for(my $n=1;$n<=4;$n++) {
if (defined ($TOTAL[$n])){
$bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
# print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
}else{
$bleu[$n]=0;
}
}
if ($length_reference==0){
printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
exit(1);
}
if ($length_translation<$length_reference) {
$brevity_penalty = exp(1-$length_reference/$length_translation);
}
$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
my_log( $bleu[2] ) +
my_log( $bleu[3] ) +
my_log( $bleu[4] ) ) / 4) ;
printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
100*$bleu,
100*$bleu[1],
100*$bleu[2],
100*$bleu[3],
100*$bleu[4],
$brevity_penalty,
$length_translation / $length_reference,
$length_translation,
$length_reference;
sub my_log {
return -9999999999 unless $_[0];
return log($_[0]);
}
================================================
FILE: texar_repo/bin/utils/spm_decode
================================================
#!/usr/bin/env python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import sys
from argparse import ArgumentParser
import sentencepiece as spm
from io import open
parser = ArgumentParser(description='SentencePiece Train')
parser.add_argument('--input_format', type=str)
parser.add_argument('--model', type=str)
parser.add_argument('--infile', type=str)
parser.add_argument('--outfile', type=str)
args = parser.parse_args()
sp = spm.SentencePieceProcessor()
sp.Load("{}".format(args.model))
map_func = None
if args.input_format == 'piece':
func = sp.DecodePieces
else:
func = sp.DecodeIds
map_func = int
with open(args.infile, encoding='utf-8') as infile, \
open(args.outfile, 'w+', encoding='utf-8') as outfile:
for line in infile.readlines():
line = line.strip().split()
if map_func:
line = list(map(map_func, line))
outfile.write('{}\n'.format(func(line)))
================================================
FILE: texar_repo/bin/utils/spm_encode
================================================
#!/usr/bin/env python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import sys
from io import open
from argparse import ArgumentParser
import sentencepiece as spm
Py3 = sys.version_info[0] == 3
parser = ArgumentParser(description='SentencePiece Train')
parser.add_argument('--output_format', type=str)
parser.add_argument('--model', type=str)
parser.add_argument('--infile', type=str)
parser.add_argument('--outfile', type=str)
args = parser.parse_args()
sp = spm.SentencePieceProcessor()
sp.Load("{}".format(args.model))
if args.output_format == 'piece':
func = sp.EncodeAsPieces
else:
func = sp.EncodeAsIds
with open(args.infile, encoding='utf-8') as infile, \
open(args.outfile, 'w+', encoding='utf-8') as outfile:
for line in infile.readlines():
line = line.strip()
if Py3:
encoded = map(str, func(line))
else:
encoded = map(unicode, func(line))
outfile.write('{}\n'.format(' '.join(encoded)))
================================================
FILE: texar_repo/bin/utils/spm_train
================================================
#!/usr/bin/env python
from argparse import ArgumentParser
import sentencepiece as spm
parser = ArgumentParser(description='SentencePiece Train')
parser.add_argument('--input', type=str)
parser.add_argument('--vocab_size', type=str)
parser.add_argument('--model_prefix', type=str)
args = parser.parse_args()
spm.SentencePieceTrainer.Train('--input={} --model_prefix={} --vocab_size={}'.format(args.input,
args.model_prefix,
args.vocab_size))
print(args)
================================================
FILE: texar_repo/config.py
================================================
import texar as tx
dcoder_config = {
'dim': 768,
'num_blocks': 6,
'multihead_attention': {
'num_heads': 8,
'output_dim': 768
# See documentation for more optional hyperparameters
},
'position_embedder_hparams': {
'dim': 768
},
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode': 'fan_avg',
'distribution': 'uniform',
},
},
'poswise_feedforward': tx.modules.default_transformer_poswise_net_hparams(
output_dim=768)
}
loss_label_confidence = 0.9
random_seed = 1234
beam_width = 5
alpha = 0.6
hidden_dim = 768
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'beta1': 0.9,
'beta2': 0.997,
'epsilon': 1e-9
}
}
}
#warmup steps must be 0.1% of number of iterations
lr = {
'learning_rate_schedule': 'constant.linear_warmup.rsqrt_decay.rsqrt_depth',
'lr_constant': 2 * (hidden_dim ** -0.5),
'static_lr': 1e-3,
'warmup_steps': 10000,
}
bos_token_id =101
eos_token_id = 102
model_dir= "./models"
run_mode= "train_and_evaluate"
batch_size = 32
test_batch_size = 32
max_train_steps = 100000
display_steps = 100
eval_steps = 100000
max_decoding_length = 400
max_seq_length_src = 512
max_seq_length_tgt = 400
train_file = "data/train.tf_reccord"
eval_file = "data/eval.tf_record"
bert_pretrain_dir="./bert_uncased_model"
================================================
FILE: texar_repo/docs/Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don\'t have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
.PHONY: help
help:
@echo "Please use \`make ' where is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " applehelp to make an Apple Help Book"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " epub3 to make an epub3"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
@echo " coverage to run coverage check of the documentation (if enabled)"
@echo " dummy to check syntax errors of document sources"
.PHONY: clean
clean:
rm -rf $(BUILDDIR)/*
.PHONY: html
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
.PHONY: dirhtml
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
.PHONY: singlehtml
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
.PHONY: pickle
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
.PHONY: json
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
.PHONY: htmlhelp
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
.PHONY: qthelp
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/texar.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/texar.qhc"
.PHONY: applehelp
applehelp:
$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
@echo
@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
@echo "N.B. You won't be able to view it unless you put it in" \
"~/Library/Documentation/Help or install it in your application" \
"bundle."
.PHONY: devhelp
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/texar"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/texar"
@echo "# devhelp"
.PHONY: epub
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
.PHONY: epub3
epub3:
$(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3
@echo
@echo "Build finished. The epub3 file is in $(BUILDDIR)/epub3."
.PHONY: latex
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
.PHONY: latexpdf
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
.PHONY: latexpdfja
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
.PHONY: text
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
.PHONY: man
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
.PHONY: texinfo
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
.PHONY: info
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
.PHONY: gettext
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
.PHONY: changes
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
.PHONY: linkcheck
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
.PHONY: doctest
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
.PHONY: coverage
coverage:
$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
@echo "Testing of coverage in the sources finished, look at the " \
"results in $(BUILDDIR)/coverage/python.txt."
.PHONY: xml
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
.PHONY: pseudoxml
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
.PHONY: dummy
dummy:
$(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy
@echo
@echo "Build finished. Dummy builder generates no files."
================================================
FILE: texar_repo/docs/_static/css/custom_theme.css
================================================
/* This style sheet is heavily inspired by PyTorch docs . */
/* https://github.com/pytorch/pytorch/blob/master/docs/source/_static/css/pytorch_theme.css */
body {
font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
}
h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend, p.caption {
font-family: "Lato","proxima-nova","Helvetica Neue",Arial,sans-serif;
}
/* Literal color */
.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
color: #DB2407;
}
/* Docs top-left background color */
.wy-side-nav-search {
background-color: #fff;
}
/* Fixes for mobile */
.wy-nav-top {
background-color: #fff;
background-image: url('../img/logo_h.png');
background-repeat: no-repeat;
background-position: center;
padding: 0;
margin: 0.6em 1em;
}
.wy-nav-top > a {
display: none;
}
@media screen and (max-width: 768px) {
.wy-side-nav-search>a img.logo {
height: 60px;
}
}
/* This is needed to ensure that logo above search scales properly */
.wy-side-nav-search a {
display: block;
}
.wy-side-nav-search>div.version {
color: #000;
}
/* For hidden headers that appear in TOC tree */
/* see http://stackoverflow.com/a/32363545/3343043 */
.rst-content .hidden-section {
display: none;
}
nav .hidden-section {
display: inherit;
================================================
FILE: texar_repo/docs/code/agents.rst
================================================
.. role:: hidden
:class: hidden-section
Agents
*******
Sequence Agents
=================
:hidden:`SeqPGAgent`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.SeqPGAgent
:members:
:inherited-members:
Episodic Agents
=================
:hidden:`EpisodicAgentBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.EpisodicAgentBase
:members:
:inherited-members:
:hidden:`PGAgent`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.PGAgent
:members:
:inherited-members:
:hidden:`DQNAgent`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.DQNAgent
:members:
:inherited-members:
:hidden:`ActorCriticAgent`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.ActorCriticAgent
:members:
:inherited-members:
Agent Utils
============
:hidden:`Space`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.Space
:members:
:hidden:`EnvConfig`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.agents.EnvConfig
:members:
:hidden:`convert_gym_space`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.agents.convert_gym_space
:hidden:`get_gym_env_config`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.agents.get_gym_env_config
================================================
FILE: texar_repo/docs/code/context.rst
================================================
.. role:: hidden
:class: hidden-section
Context
********
Global Mode
===========
:hidden:`global_mode`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.global_mode
:hidden:`global_mode_train`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.global_mode_train
:hidden:`global_mode_eval`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.global_mode_eval
:hidden:`global_mode_predict`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.global_mode_predict
:hidden:`valid_modes`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.valid_modes
================================================
FILE: texar_repo/docs/code/core.rst
================================================
.. role:: hidden
:class: hidden-section
Core
****
Cells
=====
:hidden:`default_rnn_cell_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.default_rnn_cell_hparams
:hidden:`get_rnn_cell`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_rnn_cell
:hidden:`get_rnn_cell_trainable_variables`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_rnn_cell_trainable_variables
Layers
======
:hidden:`get_layer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_layer
:hidden:`MaxReducePooling1D`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.MaxReducePooling1D
:members:
:hidden:`AverageReducePooling1D`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.AverageReducePooling1D
:members:
:hidden:`get_pooling_layer_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_pooling_layer_hparams
:hidden:`MergeLayer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.MergeLayer
:members:
:hidden:`SequentialLayer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.SequentialLayer
:members:
:hidden:`default_regularizer_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.default_regularizer_hparams
:hidden:`get_regularizer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_regularizer
:hidden:`get_initializer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_initializer
:hidden:`get_activation_fn`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_activation_fn
:hidden:`get_constraint_fn`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_constraint_fn
:hidden:`default_conv1d_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.default_conv1d_kwargs
:hidden:`default_dense_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.default_dense_kwargs
Optimization
=============
:hidden:`default_optimization_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.default_optimization_hparams
:hidden:`get_train_op`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_train_op
:hidden:`get_optimizer_fn`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_optimizer_fn
:hidden:`get_optimizer`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_optimizer
:hidden:`get_learning_rate_decay_fn`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_learning_rate_decay_fn
:hidden:`get_gradient_clip_fn`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.core.get_gradient_clip_fn
Exploration
============
:hidden:`EpsilonLinearDecayExploration`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.EpsilonLinearDecayExploration
:members:
:hidden:`ExplorationBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.ExplorationBase
:members:
Replay Memories
================
:hidden:`DequeReplayMemory`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.DequeReplayMemory
:members:
:hidden:`ReplayMemoryBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.core.ReplayMemoryBase
:members:
================================================
FILE: texar_repo/docs/code/data.rst
================================================
.. role:: hidden
:class: hidden-section
Data
*******
Vocabulary
==========
:hidden:`SpecialTokens`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.SpecialTokens
:members:
:hidden:`Vocab`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.Vocab
:members:
Embedding
==========
:hidden:`Embedding`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.Embedding
:members:
:hidden:`load_word2vec`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.load_word2vec
:hidden:`load_glove`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.load_glove
Data
==========
:hidden:`DataBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.DataBase
:members:
:hidden:`MonoTextData`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.MonoTextData
:members:
:inherited-members:
:exclude-members: make_vocab,make_embedding
:hidden:`PairedTextData`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.PairedTextData
:members:
:inherited-members:
:exclude-members: make_vocab,make_embedding
:hidden:`ScalarData`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.ScalarData
:members:
:inherited-members:
:hidden:`MultiAlignedData`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.MultiAlignedData
:members:
:inherited-members:
:exclude-members: make_vocab,make_embedding
:hidden:`TextDataBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.TextDataBase
:members:
Data Iterators
===============
:hidden:`DataIteratorBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.DataIteratorBase
:members:
:hidden:`DataIterator`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.DataIterator
:members:
:hidden:`TrainTestDataIterator`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.TrainTestDataIterator
:members:
:hidden:`FeedableDataIterator`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.FeedableDataIterator
:members:
:hidden:`TrainTestFeedableDataIterator`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.data.TrainTestFeedableDataIterator
:members:
Data Utils
==========
:hidden:`random_shard_dataset`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.random_shard_dataset
:hidden:`maybe_tuple`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.maybe_tuple
:hidden:`make_partial`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.make_partial
:hidden:`maybe_download`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.maybe_download
:hidden:`read_words`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.read_words
:hidden:`make_vocab`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.make_vocab
:hidden:`count_file_lines`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.count_file_lines
:hidden:`make_chained_transformation`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.make_chained_transformation
:hidden:`make_combined_transformation`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.data.make_combined_transformation
================================================
FILE: texar_repo/docs/code/evals.rst
================================================
.. role:: hidden
:class: hidden-section
Evaluations
***********
BLEU
==========
:hidden:`sentence_bleu`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.sentence_bleu
:hidden:`corpus_bleu`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.corpus_bleu
:hidden:`sentence_bleu_moses`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.sentence_bleu_moses
:hidden:`corpus_bleu_moses`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.corpus_bleu_moses
Accuracy
========
:hidden:`accuracy`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.accuracy
:hidden:`binary_clas_accurac`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.evals.binary_clas_accuracy
================================================
FILE: texar_repo/docs/code/hyperparams.rst
================================================
.. role:: hidden
:class: hidden
HParams
*******
.. autoclass:: texar.HParams
:members:
================================================
FILE: texar_repo/docs/code/losses.rst
================================================
.. role:: hidden
:class: hidden-section
Loss Functions
**************
MLE Loss
==========
:hidden:`sequence_softmax_cross_entropy`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.sequence_softmax_cross_entropy
:hidden:`sequence_sparse_softmax_cross_entropy`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.sequence_sparse_softmax_cross_entropy
:hidden:`sequence_sigmoid_cross_entropy`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.sequence_sigmoid_cross_entropy
:hidden:`binary_sigmoid_cross_entropy`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.binary_sigmoid_cross_entropy
:hidden:`binary_sigmoid_cross_entropy_with_clas`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.binary_sigmoid_cross_entropy_with_clas
Policy Gradient Loss
=====================
:hidden:`pg_loss_with_logits`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.pg_loss_with_logits
:hidden:`pg_loss_with_log_probs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.pg_loss_with_log_probs
Reward
=============
:hidden:`discount_reward`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.discount_reward
Adversarial Loss
==================
:hidden:`binary_adversarial_losses`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.binary_adversarial_losses
Entropy
========
:hidden:`entropy_with_logits`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.entropy_with_logits
:hidden:`sequence_entropy_with_logits`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.sequence_entropy_with_logits
Loss Utils
===========
:hidden:`mask_and_reduce`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.mask_and_reduce
:hidden:`reduce_batch_time`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.reduce_batch_time
:hidden:`reduce_dimensions`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.losses.reduce_dimensions
================================================
FILE: texar_repo/docs/code/models.rst
================================================
.. role:: hidden
:class: hidden-section
Models
********
ModelBase
=============
.. autoclass:: texar.models.ModelBase
:members:
Seq2seqBase
===============
.. autoclass:: texar.models.Seq2seqBase
:members:
:inherited-members:
BasicSeq2seq
==============
.. autoclass:: texar.models.BasicSeq2seq
:members:
:inherited-members:
================================================
FILE: texar_repo/docs/code/modules.rst
================================================
.. role:: hidden
:class: hidden-section
Modules
*******
ModuleBase
===========
.. autoclass:: texar.ModuleBase
:members:
Embedders
=========
:hidden:`WordEmbedder`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.WordEmbedder
:members:
:hidden:`PositionEmbedder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.PositionEmbedder
:members:
:hidden:`SinusoidsPositionEmbedder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.SinusoidsPositionEmbedder
:members:
:hidden:`EmbedderBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.EmbedderBase
:members:
Encoders
========
:hidden:`UnidirectionalRNNEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.UnidirectionalRNNEncoder
:members:
:hidden:`BidirectionalRNNEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.BidirectionalRNNEncoder
:members:
:hidden:`HierarchicalRNNEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.HierarchicalRNNEncoder
:members:
:hidden:`MultiheadAttentionEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.MultiheadAttentionEncoder
:members:
:hidden:`TransformerEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.TransformerEncoder
:members:
:hidden:`Conv1DEncoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.Conv1DEncoder
:members:
:hidden:`EncoderBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.EncoderBase
:members:
:hidden:`RNNEncoderBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.RNNEncoderBase
:members:
:hidden:`default_transformer_poswise_net_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.modules.default_transformer_poswise_net_hparams
Decoders
========
:hidden:`RNNDecoderBase`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.RNNDecoderBase
:members:
:inherited-members:
:exclude-members: initialize,step,finalize,tracks_own_finished,output_size,output_dtype
:hidden:`BasicRNNDecoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.BasicRNNDecoder
:members:
:inherited-members:
:exclude-members: initialize,step,finalize,tracks_own_finished,output_size,output_dtype
:hidden:`BasicRNNDecoderOutput`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.BasicRNNDecoderOutput
:members:
:hidden:`AttentionRNNDecoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.AttentionRNNDecoder
:members:
:inherited-members:
:exclude-members: initialize,step,finalize,tracks_own_finished,output_size,output_dtype
:hidden:`AttentionRNNDecoderOutput`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.AttentionRNNDecoderOutput
:members:
:hidden:`beam_search_decode`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.modules.beam_search_decode
:hidden:`TransformerDecoder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.TransformerDecoder
:members:
:hidden:`TransformerDecoderOutput`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.TransformerDecoderOutput
:members:
:hidden:`SoftmaxEmbeddingHelper`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.SoftmaxEmbeddingHelper
:members:
:hidden:`GumbelSoftmaxEmbeddingHelper`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.GumbelSoftmaxEmbeddingHelper
:members:
:hidden:`get_helper`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.modules.get_helper
Connectors
==========
:hidden:`ConnectorBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.ConnectorBase
:members:
:inherited-members:
:hidden:`ConstantConnector`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.ConstantConnector
:members:
:inherited-members:
:hidden:`ForwardConnector`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.ForwardConnector
:members:
:inherited-members:
:hidden:`MLPTransformConnector`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.MLPTransformConnector
:members:
:inherited-members:
:hidden:`ReparameterizedStochasticConnector`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.ReparameterizedStochasticConnector
:members:
:inherited-members:
:hidden:`StochasticConnector`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.StochasticConnector
:members:
:inherited-members:
Classifiers
============
:hidden:`Conv1DClassifier`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.Conv1DClassifier
:members:
:inherited-members:
:hidden:`UnidirectionalRNNClassifier`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.UnidirectionalRNNClassifier
:members:
:inherited-members:
Networks
========
:hidden:`FeedForwardNetworkBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.FeedForwardNetworkBase
:members:
:inherited-members:
:hidden:`FeedForwardNetwork`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.FeedForwardNetwork
:members:
:inherited-members:
:hidden:`Conv1DNetwork`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.Conv1DNetwork
:members:
:inherited-members:
Memory
======
:hidden:`MemNetBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.MemNetBase
:members:
:inherited-members:
:hidden:`MemNetRNNLike`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.MemNetRNNLike
:members:
:inherited-members:
:exclude-members: get_default_embed_fn
:hidden:`default_memnet_embed_fn_hparams`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.modules.default_memnet_embed_fn_hparams
Policy
=========
:hidden:`PolicyNetBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.PolicyNetBase
:members:
:inherited-members:
:hidden:`CategoricalPolicyNet`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.CategoricalPolicyNet
:members:
:inherited-members:
Q-Nets
=========
:hidden:`QNetBase`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.QNetBase
:members:
:inherited-members:
:hidden:`CategoricalPolicyNet`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.modules.CategoricalQNet
:members:
:inherited-members:
================================================
FILE: texar_repo/docs/code/run.rst
================================================
.. role:: hidden
:class: hidden-section
Executor
********
.. autoclass:: texar.run.Executor
:members:
================================================
FILE: texar_repo/docs/code/txtgen.rst
================================================
Texar
******
.. automodule:: texar
================================================
FILE: texar_repo/docs/code/utils.rst
================================================
.. role:: hidden
:class: hidden-section
Utils
**************
Frequent Use
============
:hidden:`AverageRecorder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: texar.utils.AverageRecorder
:members:
:hidden:`collect_trainable_variables`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.collect_trainable_variables
:hidden:`compat_as_text`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.compat_as_text
:hidden:`map_ids_to_strs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.map_ids_to_strs
:hidden:`write_paired_text`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.write_paired_text
:hidden:`straight_through`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.straight_through
Variables
=========
:hidden:`collect_trainable_variables`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.collect_trainable_variables
:hidden:`get_unique_named_variable_scope`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_unique_named_variable_scope
:hidden:`add_variable`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.add_variable
IO
===
:hidden:`write_paired_text`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.write_paired_text
:hidden:`load_config`
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.load_config
:hidden:`maybe_create_dir`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.maybe_create_dir
:hidden:`get_files`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_files
DType
=====
:hidden:`compat_as_text`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.compat_as_text
:hidden:`get_tf_dtype`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_tf_dtype
:hidden:`is_callable`
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_callable
:hidden:`is_str`
~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_str
:hidden:`is_placeholder`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_placeholder
:hidden:`maybe_hparams_to_dict`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.maybe_hparams_to_dict
Shape
=====
:hidden:`mask_sequences`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.mask_sequences
:hidden:`transpose_batch_time`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.transpose_batch_time
:hidden:`get_batch_size`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_batch_size
:hidden:`get_rank`
~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_rank
:hidden:`shape_list`
~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.shape_list
:hidden:`pad_and_concat`
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.pad_and_concat
:hidden:`flatten`
~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.flatten
Dictionary
===========
:hidden:`dict_patch`
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.dict_patch
:hidden:`dict_lookup`
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.dict_lookup
:hidden:`dict_fetch`
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.dict_fetch
:hidden:`dict_pop`
~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.dict_pop
:hidden:`flatten_dict`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.flatten_dict
String
=======
:hidden:`map_ids_to_strs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.map_ids_to_strs
:hidden:`strip_token`
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.strip_token
:hidden:`strip_eos`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.strip_eos
:hidden:`strip_special_tokens`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.strip_special_tokens
:hidden:`str_join`
~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.str_join
:hidden:`default_str`
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.default_str
:hidden:`uniquify_str`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.uniquify_str
Meta
====
:hidden:`check_or_get_class`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.check_or_get_class
:hidden:`get_class`
~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_class
:hidden:`check_or_get_instance`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.check_or_get_instance
:hidden:`get_instance`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_instance
:hidden:`check_or_get_instance_with_redundant_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.check_or_get_instance_with_redundant_kwargs
:hidden:`get_instance_with_redundant_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_instance_with_redundant_kwargs
:hidden:`get_function`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_function
:hidden:`call_function_with_redundant_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.call_function_with_redundant_kwargs
:hidden:`get_args`
~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_args
:hidden:`get_default_arg_values`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_default_arg_values
:hidden:`get_instance_kwargs`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.get_instance_kwargs
Mode
====
:hidden:`switch_dropout`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.switch_dropout
:hidden:`maybe_global_mode`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.maybe_global_mode
:hidden:`is_train_mode`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_train_mode
:hidden:`is_eval_mode`
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_eval_mode
:hidden:`is_predict_mode`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_predict_mode
:hidden:`is_train_mode_py`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_train_mode_py
:hidden:`is_eval_mode_py`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_eval_mode_py
:hidden:`is_predict_mode_py`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.is_predict_mode_py
Misc
====
:hidden:`ceildiv`
~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.ceildiv
:hidden:`straight_through`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autofunction:: texar.utils.straight_through
AverageRecorder
==========================
.. autoclass:: texar.utils.AverageRecorder
:members:
================================================
FILE: texar_repo/docs/conf.py
================================================
# -*- coding: utf-8 -*-
#
# texar documentation build configuration file, created by
# sphinx-quickstart on Mon Sep 4 21:15:05 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sys
import os
from recommonmark.parser import CommonMarkParser
#from unittest.mock import MagicMock
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('..'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
'sphinx.ext.mathjax',
'sphinx.ext.viewcode',
'sphinx.ext.intersphinx',
'sphinx.ext.extlinks',
'sphinxcontrib.napoleon',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
source_parsers = {
'.md': CommonMarkParser,
}
source_suffix = ['.rst', '.md']
#source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'Texar'
copyright = u'2018, Texar'
author = u'Texar'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = u'v0.1'
# The full version, including alpha/beta/rc tags.
release = u'v0.1.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# html_theme = 'alabaster'
import sphinx_rtd_theme
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
html_theme_options = {
'collapse_navigation': False,
'display_version': True,
'logo_only': True,
}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# The name for this set of Sphinx documents.
# " v documentation" by default.
html_title = u'Texar v0.1'
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = '_static/img/logo_h.png'
# The name of an image file (relative to this directory) to use as a favicon of
# the docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_context = {
'css_files': [
'https://fonts.googleapis.com/css?family=Lato',
'_static/css/custom_theme.css'
],
}
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []
# If not None, a 'Last updated on:' timestamp is inserted at every page
# bottom, using the given strftime format.
# The empty string is equivalent to '%b %d, %Y'.
#html_last_updated_fmt = None
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
# 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
# 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh'
#html_search_language = 'en'
# A dictionary with options for the search language support, empty by default.
# 'ja' uses this config value.
# 'zh' user can custom change `jieba` dictionary path.
#html_search_options = {'type': 'default'}
# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
#html_search_scorer = 'scorer.js'
# Output file base name for HTML help builder.
htmlhelp_basename = 'texardoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
# Latex figure (float) alignment
#'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'texar.tex', u'Texar Documentation',
u'Texar', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'texar', u'Texar Documentation',
[author], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'texar', u'Texar Documentation',
author, 'Texar', 'One line description of project.',
'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# If false, no module index is generated.
#texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {
'python': ('https://docs.python.org/2.7/', None),
'numpy': ('http://docs.scipy.org/docs/numpy/', None),
}
extlinks = {'tf_main': (
'https://www.tensorflow.org/api_docs/python/tf/%s',
None),
'tf_r0.12': (
'https://www.tensorflow.org/versions/r0.12/api_docs/python/%s',
None),
'tf_hmpg': (
'https://www.tensorflow.org/%s',
None),
'gym': (
'https://gym.openai.com/docs/%s',
None),
}
##### Customize ######
autodoc_member_order = 'bysource'
# Adresses import errors. Refer to:
# https://docs.readthedocs.io/en/latest/faq.html#i-get-import-errors-on-libraries-that-depend-on-c-modules
#class Mock(MagicMock):
# @classmethod
# def __getattr__(cls, name):
# return MagicMock()
#MOCK_MODULES = ['gym']
#sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)
================================================
FILE: texar_repo/docs/examples.md
================================================
# Examples #
Rich examples are included to demonstrate the use of Texar. The implementations of cutting-edge models/algorithms also provide references for reproducibility and comparisons.
More examples are continuously added...
## Examples by Models/Algorithms ##
### RNN / Seq2seq ###
* [language_model_ptb](https://github.com/asyml/texar/tree/master/examples/language_model_ptb): Basic RNN language model
* [seq2seq_attn](https://github.com/asyml/texar/tree/master/examples/seq2seq_attn): Attentional seq2seq
* [seq2seq_configs](https://github.com/asyml/texar/tree/master/examples/seq2seq_configs): Seq2seq implemented with Texar model template.
* [seq2seq_rl](https://github.com/asyml/texar/tree/master/examples/seq2seq_rl): Attentional seq2seq trained with policy gradient.
* [hierarchical_dialog](https://github.com/asyml/texar/tree/master/examples/hierarchical_dialog): Hierarchical recurrent encoder-decoder model for conversation response generation.
* [torchtext](https://github.com/asyml/texar/tree/master/examples/torchtext): Use of torchtext data loader
### Transformer (Self-attention) ###
* [transformer](https://github.com/asyml/texar/tree/master/examples/transformer): Transformer for machine translation
* [vae_text](https://github.com/asyml/texar/tree/master/examples/vae_text): VAE with a transformer decoder for improved language modeling
### Variational Autoencoder (VAE) ###
* [vae_text](https://github.com/asyml/texar/tree/master/examples/vae_text): VAE language model
### GANs / Discriminiator-supervision ###
* [seqGAN](https://github.com/asyml/texar/tree/master/examples/seqgan): GANs for text generation
* [text_style_transfer](https://github.com/asyml/texar/tree/master/examples/text_style_transfer): Discriminator supervision for controlled text generation
### Reinforcement Learning ###
* [seq2seq_rl](https://github.com/asyml/texar/tree/master/examples/seq2seq_rl): Attentional seq2seq trained with policy gradient.
* [seqGAN](https://github.com/asyml/texar/tree/master/examples/seqgan): Policy gradient for sequence generation
* [rl_gym](https://github.com/asyml/texar/tree/master/examples/rl_gym): Various RL algoritms for games on OpenAI Gym
### Memory Network ###
* [memory_network_lm](https://github.com/asyml/texar/tree/master/examples/memory_network_lm): End-to-end memory network for language modeling
### Classifier / Predictions ##
* [sentence_classifier](https://github.com/asyml/texar/tree/master/examples/sentence_classifier): Basic CNN-based sentence classifier
* [sequence_tagging](https://github.com/asyml/texar/tree/master/examples/sequence_tagging): BiLSTM-CNN model for Named Entity Recognition (NER)
---
## Examples by Tasks
### Language Modeling ###
* [language_model_ptb](https://github.com/asyml/texar/tree/master/examples/language_model_ptb): Basic RNN language model
* [vae_text](https://github.com/asyml/texar/tree/master/examples/vae_text): VAE language model
* [seqGAN](https://github.com/asyml/texar/tree/master/examples/seqgan): GAN + policy gradient
* [memory_network_lm](https://github.com/asyml/texar/tree/master/examples/memory_network_lm): End-to-end memory network for language modeling
### Machine Translation ###
* [seq2seq_attn](https://github.com/asyml/texar/tree/master/examples/seq2seq_attn): Attentional seq2seq
* [seq2seq_configs](https://github.com/asyml/texar/tree/master/examples/seq2seq_configs): Seq2seq implemented with Texar model template.
* [seq2seq_rl](https://github.com/asyml/texar/tree/master/examples/seq2seq_rl): Attentional seq2seq trained with policy gradient.
* [transformer](https://github.com/asyml/texar/tree/master/examples/transformer): Transformer for machine translation
### Dialog ###
* [hierarchical_dialog](https://github.com/asyml/texar/tree/master/examples/hierarchical_dialog): Hierarchical recurrent encoder-decoder model for conversation response generation.
### Text Style Transfer ###
* [text_style_transfer](https://github.com/asyml/texar/tree/master/examples/text_style_transfer): Discriminator supervision for controlled text generation
### Classification ###
* [sentence_classifier](https://github.com/asyml/texar/tree/master/examples/sentence_classifier): Basic CNN-based sentence classifier
### Sequence Tagging ###
* [sequence_tagging](https://github.com/asyml/texar/tree/master/examples/sequence_tagging): BiLSTM-CNN model for Named Entity Recognition (NER)
### Games ###
* [rl_gym](https://github.com/asyml/texar/tree/master/examples/rl_gym): Various RL algoritms for games on OpenAI Gym
================================================
FILE: texar_repo/docs/get_started.md
================================================
# Overview #
**Texar** is an open-source toolkit based on Tensorflow, aiming to support a broad set of machine learning especially **text generation tasks**, such as machine translation, dialog, summarization, content manipulation, language modeling, and so on. Texar is designed for both researchers and practitioners for fast prototyping and experimentation.
With the design goals of **modularity, versatility, and extensibility** in mind, Texar extracts the common patterns underlying the diverse tasks and methodologies, creates a library of highly reusable modules and functionalities, and facilitates **arbitrary model architectures and algorithmic paradigms**, e.g.,
* encoder(s) to decoder(s), sequential- and self-attentions, memory, hierarchical models, classifiers...
* maximum likelihood learning, reinforcement learning, adversarial learning, probabilistic modeling, ...
With Texar, cutting-edge complex models can be easily constructed, freely enriched with best modeling/training practices, readily fitted into standard training/evaluation pipelines, and fastly experimented and evolved by, e.g., plugging-in and swapping-out different modules.
### Key Features
* **Versatility**. Texar contains a wide range of modules and functionalities for composing arbitrary model architectures and implementing various learning algorithms, as well as for data processing, evaluation, prediction, etc.
* **Modularity**. Texar decomposes diverse complex machine learning models/algorithms into a set of highly-reusable modules. In particular, model **architecture, losses, and learning processes** are fully decomposed.
Users can construct their own models at a high conceptual level just like assembling building blocks. It is convenient to plug in or swap out modules, and configure rich options of each module. For example, switching between maximum likelihood learning and reinforcement learning involves only changing several lines of code.
* **Extensibility**. It is straightforward to integrate any user-customized, external modules. Also, Texar is fully compatible with the native Tensorflow interfaces and can take advantage of the rich Tensorflow features, and resources from the vibrant open-source community.
* Interfaces with different functionality levels. Users can customize a model through 1) simple **Python/YAML configuration files** of provided model templates/examples; 2) programming with **Python Library APIs** for maximal customizability.
* Easy-to-use APIs: 1) Convenient automatic variable re-use---no worry about the complicated TF variable scopes; 2) Pytorch-like callable modules; 3) Rich configuration options for each module, all with default values; ...
* Well-structured high-quality code of uniform design patterns and consistent styles.
* Clean, detailed [documentation](https://texar.readthedocs.io) and rich [examples](https://github.com/asyml/texar/tree/master/examples).
### Library API Example
Builds a (self-)attentional sequence encoder-decoder model, with different learning algorithms:
```python
import texar as tx
# Data
data = tx.data.PairedTextData(hparams=hparams_data) # Hyperparameter configs in `hparams`
iterator = tx.data.DataIterator(data)
batch = iterator.get_next() # A data mini-batch
# Model architecture
embedder = tx.modules.WordEmbedder(data.target_vocab.size, hparams=hparams_emb)
encoder = tx.modules.TransformerEncoder(hparams=hparams_encoder)
outputs_enc = encoder(inputs=embedder(batch['source_text_ids']),
sequence_length=batch['source_length'])
decoder = tx.modules.AttentionRNNDecoder(memory=output_enc,
memory_sequence_length=batch['source_length'],
hparams=hparams_decoder)
outputs, _, _ = decoder(inputs=embedder(batch['target_text_ids']),
sequence_length=batch['target_length']-1)
# Loss for maximum likelihood learning
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=batch['target_length']-1) # Automatic masks
# Beam search decoding
outputs_bs, _, _ = tx.modules.beam_search_decode(
decoder,
embedding=embedder,
start_tokens=[data.target_vocab.bos_token_id]*num_samples,
end_token=data.target_vocab.eos_token_id)
```
```python
# Policy gradient agent for RL learning
agent = tx.agents.SeqPGAgent(samples=outputs.sample_id,
logits=outputs.logits,
sequence_length=batch['target_length']-1,
hparams=config_model.agent)
```
Many more examples are available [here](https://github.com/asyml/texar/tree/master/examples)
### Installtion
```
git clone https://github.com/asyml/texar.git
cd texar
pip install -e .
```
### Getting Started
* [Examples](https://github.com/asyml/texar/tree/master/examples)
* [Documentations](https://texar.readthedocs.io)
* [GitHub](https://github.com/asyml/texar)
### Reference
If you use Texar, please cite the [report](.) with the following BibTex entry:
```
Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation
Zhiting Hu, Haoran Shi, Zichao Yang, Bowen Tan, Tiancheng Zhao, Junxian He, Wentao Wang, Xingjiang Yu, Lianhui Qin, Di Wang, Xuezhe Ma, Hector Liu, Xiaodan Liang, Wanrong Zhu, Devendra Singh Sachan, Eric P. Xing
2018
@article{hu2018texar,
title={Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation},
author={Hu, Zhiting and Shi, Haoran and Yang, Zichao and Tan, Bowen and Zhao, Tiancheng and He, Junxian and Wang, Wentao and Yu, Xingjiang and Qin, Lianhui and Wang, Di and Ma, Xuezhe and Liu, Hector and Liang, Xiaodan and Zhu, Wanrong and Sachan, Devendra Singh and Xing, Eric},
year={2018}
}
```
### License
[Apache License 2.0](https://github.com/asyml/texar/blob/master/LICENSE)
================================================
FILE: texar_repo/docs/index.rst
================================================
.. texar documentation master file, created by
sphinx-quickstart on Mon Sep 4 21:15:05 2017.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Texar's documentation!
*********************************
Texar is a modularized, versatile, and extensible toolkit for text generation tasks and beyond.
.. toctree::
:maxdepth: 1
get_started.md
.. toctree::
:maxdepth: 2
examples.md
API
====
.. toctree::
:maxdepth: 2
code/hyperparams.rst
code/data.rst
code/core.rst
code/modules.rst
code/agents.rst
code/losses.rst
code/evals.rst
code/models.rst
code/run.rst
code/context.rst
code/utils.rst
================================================
FILE: texar_repo/docs/make.bat
================================================
@ECHO OFF
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set BUILDDIR=_build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
set I18NSPHINXOPTS=%SPHINXOPTS% .
if NOT "%PAPER%" == "" (
set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)
if "%1" == "" goto help
if "%1" == "help" (
:help
echo.Please use `make ^` where ^ is one of
echo. html to make standalone HTML files
echo. dirhtml to make HTML files named index.html in directories
echo. singlehtml to make a single large HTML file
echo. pickle to make pickle files
echo. json to make JSON files
echo. htmlhelp to make HTML files and a HTML help project
echo. qthelp to make HTML files and a qthelp project
echo. devhelp to make HTML files and a Devhelp project
echo. epub to make an epub
echo. epub3 to make an epub3
echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
echo. text to make text files
echo. man to make manual pages
echo. texinfo to make Texinfo files
echo. gettext to make PO message catalogs
echo. changes to make an overview over all changed/added/deprecated items
echo. xml to make Docutils-native XML files
echo. pseudoxml to make pseudoxml-XML files for display purposes
echo. linkcheck to check all external links for integrity
echo. doctest to run all doctests embedded in the documentation if enabled
echo. coverage to run coverage check of the documentation if enabled
echo. dummy to check syntax errors of document sources
goto end
)
if "%1" == "clean" (
for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
del /q /s %BUILDDIR%\*
goto end
)
REM Check if sphinx-build is available and fallback to Python version if any
%SPHINXBUILD% 1>NUL 2>NUL
if errorlevel 9009 goto sphinx_python
goto sphinx_ok
:sphinx_python
set SPHINXBUILD=python -m sphinx.__init__
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
:sphinx_ok
if "%1" == "html" (
%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/html.
goto end
)
if "%1" == "dirhtml" (
%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
goto end
)
if "%1" == "singlehtml" (
%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
goto end
)
if "%1" == "pickle" (
%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the pickle files.
goto end
)
if "%1" == "json" (
%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the JSON files.
goto end
)
if "%1" == "htmlhelp" (
%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
goto end
)
if "%1" == "qthelp" (
%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
echo.^> qcollectiongenerator %BUILDDIR%\qthelp\texar.qhcp
echo.To view the help file:
echo.^> assistant -collectionFile %BUILDDIR%\qthelp\texar.ghc
goto end
)
if "%1" == "devhelp" (
%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished.
goto end
)
if "%1" == "epub" (
%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The epub file is in %BUILDDIR%/epub.
goto end
)
if "%1" == "epub3" (
%SPHINXBUILD% -b epub3 %ALLSPHINXOPTS% %BUILDDIR%/epub3
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The epub3 file is in %BUILDDIR%/epub3.
goto end
)
if "%1" == "latex" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
if errorlevel 1 exit /b 1
echo.
echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdf" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf
cd %~dp0
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdfja" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf-ja
cd %~dp0
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "text" (
%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The text files are in %BUILDDIR%/text.
goto end
)
if "%1" == "man" (
%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The manual pages are in %BUILDDIR%/man.
goto end
)
if "%1" == "texinfo" (
%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
goto end
)
if "%1" == "gettext" (
%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
goto end
)
if "%1" == "changes" (
%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
if errorlevel 1 exit /b 1
echo.
echo.The overview file is in %BUILDDIR%/changes.
goto end
)
if "%1" == "linkcheck" (
%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
if errorlevel 1 exit /b 1
echo.
echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
goto end
)
if "%1" == "doctest" (
%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
if errorlevel 1 exit /b 1
echo.
echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
goto end
)
if "%1" == "coverage" (
%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage
if errorlevel 1 exit /b 1
echo.
echo.Testing of coverage in the sources finished, look at the ^
results in %BUILDDIR%/coverage/python.txt.
goto end
)
if "%1" == "xml" (
%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The XML files are in %BUILDDIR%/xml.
goto end
)
if "%1" == "pseudoxml" (
%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
goto end
)
if "%1" == "dummy" (
%SPHINXBUILD% -b dummy %ALLSPHINXOPTS% %BUILDDIR%/dummy
if errorlevel 1 exit /b 1
echo.
echo.Build finished. Dummy builder generates no files.
goto end
)
:end
================================================
FILE: texar_repo/docs/requirements.txt
================================================
sphinx
sphinx-rtd-theme >= 0.2.4
sphinxcontrib-napoleon >= 0.6.1
Pygments >= 2.1.1
tensorflow >= 1.7.0
pyyaml
funcsigs
================================================
FILE: texar_repo/docs/tutorials/tutorial.rst
================================================
Getting Started
===============
Write an awesome tutorial here.
================================================
FILE: texar_repo/examples/README.md
================================================
# Examples #
Rich examples are included to demonstrate the use of Texar. The implementations of cutting-edge models/algorithms also provide references for reproducibility and comparisons.
More examples are continuously added...
## Examples by Models/Algorithms ##
### RNN / Seq2seq ###
* [language_model_ptb](./language_model_ptb): Basic RNN language model
* [distributed_gpu](./distributed_gpu): Basic RNN language model with distributed training
* [seq2seq_attn](./seq2seq_attn): Attentional seq2seq
* [seq2seq_configs](./seq2seq_configs): Seq2seq implemented with Texar model template
* [seq2seq_rl](./seq2seq_rl): Attentional seq2seq trained with policy gradient
* [seq2seq_exposure_bias](./seq2seq_exposure_bias): Various algorithms tackling exposure bias in sequence generation
* [hierarchical_dialog](./hierarchical_dialog): Hierarchical recurrent encoder-decoder model for conversation response generation
* [torchtext](./torchtext): Use of torchtext data loader
### Transformer (Self-attention) ###
* [transformer](./transformer): Transformer for machine translation
* [bert](./bert): Pre-trained BERT model for text representation
* [vae_text](./vae_text): VAE with a transformer decoder for improved language modeling
### Variational Autoencoder (VAE) ###
* [vae_text](./vae_text): VAE language model
### GANs / Discriminiator-supervision ###
* [seqGAN](./seqgan): GANs for text generation
* [text_style_transfer](./text_style_transfer): Discriminator supervision for controlled text generation
### Reinforcement Learning ###
* [seq2seq_rl](./seq2seq_rl): Attentional seq2seq trained with policy gradient.
* [seqGAN](./seqgan): Policy gradient for sequence generation
* [rl_gym](./rl_gym): Various RL algoritms for games on OpenAI Gym
### Memory Network ###
* [memory_network_lm](./memory_network_lm): End-to-end memory network for language modeling
### Classifier / Sequence Prediction ###
* [bert](./bert): Pre-trained BERT model for text representation
* [sentence_classifier](./sentence_classifier): Basic CNN-based sentence classifier
* [sequence_tagging](./sequence_tagging): BiLSTM-CNN model for Named Entity Recognition (NER)
### Reward Augmented Maximum Likelihood (RAML) ###
* [seq2seq_exposure_bias](./seq2seq_exposure_bias): RAML and other learning algorithms for sequence generation
---
## Examples by Tasks
### Language Modeling ###
* [language_model_ptb](./language_model_ptb): Basic RNN language model
* [vae_text](./vae_text): VAE language model
* [seqGAN](./seqgan): GAN + policy gradient
* [memory_network_lm](./memory_network_lm): End-to-end memory network for language modeling
### Machine Translation ###
* [seq2seq_attn](./seq2seq_attn): Attentional seq2seq
* [seq2seq_configs](./seq2seq_configs): Seq2seq implemented with Texar model template.
* [seq2seq_rl](./seq2seq_rl): Attentional seq2seq trained with policy gradient.
* [seq2seq_exposure_bias](./seq2seq_exposure_bias): Various algorithms tackling exposure bias in sequence generation (MT and summarization as examples).
* [transformer](./transformer): Transformer for machine translation
### Dialog ###
* [hierarchical_dialog](./hierarchical_dialog): Hierarchical recurrent encoder-decoder model for conversation response generation.
### Text Summarization ###
* [seq2seq_exposure_bias](./seq2seq_exposure_bias): Various algorithms tackling exposure bias in sequence generation (MT and summarization as examples).
### Text Style Transfer ###
* [text_style_transfer](./text_style_transfer): Discriminator supervision for controlled text generation
### Classification ###
* [bert](./bert): Pre-trained BERT model for text representation
* [sentence_classifier](./sentence_classifier): Basic CNN-based sentence classifier
### Sequence Tagging ###
* [sequence_tagging](./sequence_tagging): BiLSTM-CNN model for Named Entity Recognition (NER)
### Games ###
* [rl_gym](./rl_gym): Various RL algoritms for games on OpenAI Gym
---
## MISC ##
### Distributed training ###
* [distributed_gpu](./distributed_gpu): Basic example of distributed training.
* [bert](./bert): Distributed training of BERT.
================================================
FILE: texar_repo/examples/bert/README.md
================================================
# BERT: Pre-trained models and downstream applications
This is a Texar implementation of Google's BERT model, which allows to load pre-trained model parameters downloaded from the [official releaes](https://github.com/google-research/bert) and build/fine-tune arbitrary downstream applications with **distributed training** (This example showcases BERT for sentence classification).
With Texar, building the BERT model is as simple as creating a [`TransformerEncoder`](https://texar.readthedocs.io/en/latest/code/modules.html#transformerencoder) instance. We can initialize the parameters of the TransformerEncoder using a pre-trained BERT checkpoint by calling `init_bert_checkpoint(path_to_bert_checkpoint)`.
In sum, this example showcases:
* Use of pre-trained Google BERT models in Texar
* Building and fine-tuning on downstream tasks
* Distributed training of the models
## Quick Start
### Download Dataset
We explain the use of the example code based on the Microsoft Research Paraphrase Corpus (MRPC) corpus for sentence classification.
Download the data with the following cmd
```
python data/download_glue_data.py --tasks=MRPC
```
By default, it will download the MRPC dataset into the `data` directory. FYI, the MRPC dataset part of the [GLUE](https://gluebenchmark.com/tasks) dataset collection.
### Download BERT Pre-train Model
```
sh bert_pretrained_models/download_model.sh
```
By default, it will download a pretrained model (BERT-Base Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters) named `uncased_L-12_H-768_A-12` to `bert_pretrained_models/`.
Under `bert_pretrained_models/uncased_L-12_H-768_A-12`, you can find 5 files, where
- `bert-config.json` is the model configuration of the BERT model. For the particular model we just downloaded, it is an uncased-vocabulary, 12-layer, 768-hidden, 12-heads Transformer model.
### Train and Evaluate
For **single-GPU** training (and evaluation), run the following cmd. The training updates the classification layer and fine-tunes the pre-trained BERT parameters.
```
python bert_classifier_main.py --do_train --do_eval
[--task=mrpc]
[--config_bert_pretrain=uncased_L-12_H-768_A-12]
[--config_downstream=config_classifier]
[--config_data=config_data_mrpc]
[--output_dir=output]
```
Here:
- `task`: Specifies which dataset to experiment on.
- `config_bert_pretrain`: Specifies the architecture of pre-trained BERT model to use.
- `config_downstream`: Configuration of the downstream part. In this example, [`config_classifier.py`](https://github.com/asyml/texar/blob/master/examples/bert/bert_classifier_main.py) configs the classification layer and the optimization method.
- `config_data`: The data configuration.
- `output_dir`: The output path where checkpoints and summaries for tensorboard visualization are saved.
For **Multi-GPU training** on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example.
Then run the following cmd for training and evaluation. The cmd trains the model on local with 2 GPUs. Evaluation is performed with the single rank-0 GPU.
```
mpirun -np 2 \
-H localhost:2\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl tcp,self \
-mca btl_tcp_if_include ens3 \
python bert_classifier_main.py --do_train --do_eval --distributed
[--task=mrpc]
[--config_bert_pretrain=uncased_L-12_H-768_A-12]
[--config_downstream=config_classifier]
[--config_data=config_data_mrpc]
[--output_dir=output]
```
The key configurations of multi-gpu training:
* `-np`: total number of processes
* `-H`: IP addresses of different servers and the number of processes used in each server. For example, `-H 192.168.11.22:1,192.168.33.44:1`
Please refer to [distributed_gpu](https://github.com/asyml/texar/tree/master/examples/distributed_gpu) example for more details of the other multi-gpu configurations.
Note that we also specified the `--distributed` flag for multi-gpu training.
After convergence, the evaluation performance is around the following. Due to certain randomness (e.g., random initialization of the classification layer), the evaluation accuracy is reasonable as long as it's `>0.84`.
```
INFO:tensorflow:dev accu: 0.8676470588235294
```
### Restore and Test
``
python bert_classifier_main.py --do_test --checkpoint=output/model.ckpt
``
The output is by default saved in `output/test_results.tsv`, where each line contains the predicted label for each sample.
## Use other datasets/tasks
`bert_classifier_main.py` also support other datasets/tasks. To do this, specify a different value to the `--task` flag, and use a corresponding data configuration file.
For example, use the following commands to download the SST (Stanford Sentiment Treebank) dataset and run for sentence classification.
```
python data/download_glue_data.py --tasks=SST
python bert_classifier_main.py --do_train --do_eval --task=sst --config_data=config_data_sst
```
================================================
FILE: texar_repo/examples/bert/bert_classifier_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example of building a sentence classifier based on pre-trained BERT
model.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import importlib
import tensorflow as tf
import texar as tx
from utils import data_utils, model_utils, tokenization
# pylint: disable=invalid-name, too-many-locals, too-many-statements
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string(
"task", "mrpc",
"The task to run experiment on. One of "
"{'cola', 'mnli', 'mrpc', 'xnli', 'sst'}.")
flags.DEFINE_string(
"config_bert_pretrain", 'uncased_L-12_H-768_A-12',
"The architecture of pre-trained BERT model to use.")
flags.DEFINE_string(
"config_format_bert", "json",
"The configuration format. Set to 'json' if the BERT config file is in "
"the same format of the official BERT config file. Set to 'texar' if the "
"BERT config file is in Texar format.")
flags.DEFINE_string(
"config_downstream", "config_classifier",
"Configuration of the downstream part of the model and optmization.")
flags.DEFINE_string(
"config_data", "config_data_mrpc",
"The dataset config.")
flags.DEFINE_string(
"checkpoint", None,
"Path to a model checkpoint (including bert modules) to restore from.")
flags.DEFINE_string(
"output_dir", "output/",
"The output directory where the model checkpoints will be written.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool("do_test", False, "Whether to run test on the test set.")
flags.DEFINE_bool("distributed", False, "Whether to run in distributed mode.")
config_data = importlib.import_module(FLAGS.config_data)
config_downstream = importlib.import_module(FLAGS.config_downstream)
def main(_):
"""
Builds the model and runs.
"""
if FLAGS.distributed:
import horovod.tensorflow as hvd
hvd.init()
tf.logging.set_verbosity(tf.logging.INFO)
tx.utils.maybe_create_dir(FLAGS.output_dir)
bert_pretrain_dir = 'bert_pretrained_models/%s' % FLAGS.config_bert_pretrain
# Loads BERT model configuration
if FLAGS.config_format_bert == "json":
bert_config = model_utils.transform_bert_to_texar_config(
os.path.join(bert_pretrain_dir, 'bert_config.json'))
elif FLAGS.config_format_bert == 'texar':
bert_config = importlib.import_module(
'bert_config_lib.config_model_%s' % FLAGS.config_bert_pretrain)
else:
raise ValueError('Unknown config_format_bert.')
# Loads data
processors = {
"cola": data_utils.ColaProcessor,
"mnli": data_utils.MnliProcessor,
"mrpc": data_utils.MrpcProcessor,
"xnli": data_utils.XnliProcessor,
'sst': data_utils.SSTProcessor
}
processor = processors[FLAGS.task.lower()]()
num_classes = len(processor.get_labels())
num_train_data = len(processor.get_train_examples(config_data.data_dir))
tokenizer = tokenization.FullTokenizer(
vocab_file=os.path.join(bert_pretrain_dir, 'vocab.txt'),
do_lower_case=FLAGS.do_lower_case)
train_dataset = data_utils.get_dataset(
processor, tokenizer, config_data.data_dir, config_data.max_seq_length,
config_data.train_batch_size, mode='train', output_dir=FLAGS.output_dir,
is_distributed=FLAGS.distributed)
eval_dataset = data_utils.get_dataset(
processor, tokenizer, config_data.data_dir, config_data.max_seq_length,
config_data.eval_batch_size, mode='eval', output_dir=FLAGS.output_dir)
test_dataset = data_utils.get_dataset(
processor, tokenizer, config_data.data_dir, config_data.max_seq_length,
config_data.test_batch_size, mode='test', output_dir=FLAGS.output_dir)
iterator = tx.data.FeedableDataIterator({
'train': train_dataset, 'eval': eval_dataset, 'test': test_dataset})
batch = iterator.get_next()
input_ids = batch["input_ids"]
segment_ids = batch["segment_ids"]
batch_size = tf.shape(input_ids)[0]
input_length = tf.reduce_sum(1 - tf.to_int32(tf.equal(input_ids, 0)),
axis=1)
# Builds BERT
with tf.variable_scope('bert'):
embedder = tx.modules.WordEmbedder(
vocab_size=bert_config.vocab_size,
hparams=bert_config.embed)
word_embeds = embedder(input_ids)
# Creates segment embeddings for each type of tokens.
segment_embedder = tx.modules.WordEmbedder(
vocab_size=bert_config.type_vocab_size,
hparams=bert_config.segment_embed)
segment_embeds = segment_embedder(segment_ids)
input_embeds = word_embeds + segment_embeds
# The BERT model (a TransformerEncoder)
encoder = tx.modules.TransformerEncoder(hparams=bert_config.encoder)
output = encoder(input_embeds, input_length)
# Builds layers for downstream classification, which is also initialized
# with BERT pre-trained checkpoint.
with tf.variable_scope("pooler"):
# Uses the projection of the 1st-step hidden vector of BERT output
# as the representation of the sentence
bert_sent_hidden = tf.squeeze(output[:, 0:1, :], axis=1)
bert_sent_output = tf.layers.dense(
bert_sent_hidden, config_downstream.hidden_dim,
activation=tf.tanh)
output = tf.layers.dropout(
bert_sent_output, rate=0.1, training=tx.global_mode_train())
# Adds the final classification layer
logits = tf.layers.dense(
output, num_classes,
kernel_initializer=tf.truncated_normal_initializer(stddev=0.02))
preds = tf.argmax(logits, axis=-1, output_type=tf.int32)
accu = tx.evals.accuracy(batch['label_ids'], preds)
# Optimization
loss = tf.losses.sparse_softmax_cross_entropy(
labels=batch["label_ids"], logits=logits)
global_step = tf.Variable(0, trainable=False)
# Builds learning rate decay scheduler
static_lr = config_downstream.lr['static_lr']
num_train_steps = int(num_train_data / config_data.train_batch_size
* config_data.max_train_epoch)
num_warmup_steps = int(num_train_steps * config_data.warmup_proportion)
lr = model_utils.get_lr(global_step, num_train_steps, # lr is a Tensor
num_warmup_steps, static_lr)
opt = tx.core.get_optimizer(
global_step=global_step,
learning_rate=lr,
hparams=config_downstream.opt
)
if FLAGS.distributed:
opt = hvd.DistributedOptimizer(opt)
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=global_step,
learning_rate=None,
optimizer=opt)
# Train/eval/test routine
def _run(sess, mode):
fetches = {
'accu': accu,
'batch_size': batch_size,
'step': global_step,
'loss': loss,
'input_ids': input_ids,
}
if mode == 'train':
fetches['train_op'] = train_op
while True:
try:
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'train'),
tx.global_mode(): tf.estimator.ModeKeys.TRAIN,
}
rets = sess.run(fetches, feed_dict)
if rets['step'] % 50 == 0:
tf.logging.info(
'step:%d loss:%f' % (rets['step'], rets['loss']))
if rets['step'] == num_train_steps:
break
except tf.errors.OutOfRangeError:
break
if mode == 'eval':
cum_acc = 0.0
nsamples = 0
while True:
try:
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'eval'),
tx.context.global_mode(): tf.estimator.ModeKeys.EVAL,
}
rets = sess.run(fetches, feed_dict)
cum_acc += rets['accu'] * rets['batch_size']
nsamples += rets['batch_size']
except tf.errors.OutOfRangeError:
break
tf.logging.info('dev accu: {} nsamples: {}'.format(cum_acc / nsamples, nsamples))
if mode == 'test':
_all_preds = []
while True:
try:
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'test'),
tx.context.global_mode(): tf.estimator.ModeKeys.PREDICT,
}
_preds = sess.run(preds, feed_dict=feed_dict)
_all_preds.extend(_preds.tolist())
except tf.errors.OutOfRangeError:
break
output_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
with tf.gfile.GFile(output_file, "w") as writer:
writer.write('\n'.join(str(p) for p in _all_preds))
# Loads pretrained BERT model parameters
init_checkpoint = os.path.join(bert_pretrain_dir, 'bert_model.ckpt')
model_utils.init_bert_checkpoint(init_checkpoint)
# broadcast global variables from rank-0 process
if FLAGS.distributed:
bcast = hvd.broadcast_global_variables(0)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
if FLAGS.distributed:
bcast.run()
# Restores trained model if specified
saver = tf.train.Saver()
if FLAGS.checkpoint:
saver.restore(sess, FLAGS.checkpoint)
iterator.initialize_dataset(sess)
if FLAGS.do_train:
iterator.restart_dataset(sess, 'train')
_run(sess, mode='train')
saver.save(sess, FLAGS.output_dir + '/model.ckpt')
if FLAGS.do_eval:
iterator.restart_dataset(sess, 'eval')
_run(sess, mode='eval')
if FLAGS.do_test:
iterator.restart_dataset(sess, 'test')
_run(sess, mode='test')
if __name__ == "__main__":
tf.app.run()
================================================
FILE: texar_repo/examples/bert/bert_config_lib/README.md
================================================
### Configuration files of BERT models in Texar style.
For example, `config_model_uncased_L-12_H-768_A-12.py` is the Texar configuration file equivalent to `uncased_L-12_H-768_A-12` downloaded from [BERT official release](https://github.com/haoransh/texar_private/tree/master/examples/bert).
================================================
FILE: texar_repo/examples/bert/bert_config_lib/__init__.py
================================================
================================================
FILE: texar_repo/examples/bert/bert_config_lib/config_model_uncased_L-12_H-768_A-12.py
================================================
embed = {
'dim': 768,
'name': 'word_embeddings'
}
vocab_size = 30522
segment_embed = {
'dim': 768,
'name': 'token_type_embeddings'
}
type_vocab_size = 2
encoder = {
'dim': 768,
'embedding_dropout': 0.1,
'multihead_attention': {
'dropout_rate': 0.1,
'name': 'self',
'num_heads': 12,
'num_units': 768,
'output_dim': 768,
'use_bias': True
},
'name': 'encoder',
'num_blocks': 12,
'position_embedder_hparams': {
'dim': 768
},
'position_embedder_type': 'variables',
'position_size': 512,
'poswise_feedforward': {
'layers': [
{ 'kwargs': {
'activation': 'gelu',
'name': 'intermediate',
'units': 3072,
'use_bias': True
},
'type': 'Dense'
},
{ 'kwargs': {'activation': None,
'name': 'output',
'units': 768,
'use_bias': True
},
'type': 'Dense'
}
]
},
'residual_dropout': 0.1,
'use_bert_config': True
}
output_size = 768 # The output dimension of BERT
================================================
FILE: texar_repo/examples/bert/config_classifier.py
================================================
hidden_dim = 768
opt = {
'optimizer': {
'type': 'AdamWeightDecayOptimizer',
'kwargs': {
'weight_decay_rate': 0.01,
'beta_1': 0.9,
'beta_2': 0.999,
'epsilon': 1e-6,
'exclude_from_weight_decay': ['LayerNorm', 'layer_norm', 'bias']
}
},
'gradient_clip': {
'type': 'clip_by_global_norm',
'kwargs': {
'clip_norm': 1.0,
}
}
}
# By default, we use warmup and linear decay for learinng rate
lr = {
'static_lr': 2e-5,
}
================================================
FILE: texar_repo/examples/bert/config_data_mrpc.py
================================================
data_dir = 'data/MRPC'
train_batch_size = 32
max_seq_length = 128
eval_batch_size = 8
test_batch_size = 8
max_train_epoch = 3
warmup_proportion = 0.1
================================================
FILE: texar_repo/examples/bert/config_data_sst.py
================================================
data_dir = 'data/SST-2'
train_batch_size = 32
max_seq_length = 128
eval_batch_size = 8
test_batch_size = 8
max_train_epoch = 3
warmup_proportion = 0.1
================================================
FILE: texar_repo/examples/bert/utils/data_utils.py
================================================
"""
This is the Data Loading Pipeline for Sentence Classifier Task from
https://github.com/google-research/bert/blob/master/run_classifier.py
"""
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import csv
import collections
import sys
sys.path.append(os.path.dirname(__file__))
import tokenization
import tensorflow as tf
class InputExample():
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence.
For single sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second
sequence. Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class InputFeatures():
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_id):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
i = 0
for line in reader:
lines.append(line)
return lines
class SSTProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
if set_type == 'train' or set_type == 'dev':
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.convert_to_unicode(line[0])
# Single sentence classification, text_b doesn't exist
text_b = None
label = tokenization.convert_to_unicode(line[1])
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
if set_type == 'test':
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.convert_to_unicode(line[1])
# Single sentence classification, text_b doesn't exist
text_b = None
label = '0' # arbitrary set as 0
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
class XnliProcessor(DataProcessor):
"""Processor for the XNLI data set."""
def __init__(self):
self.language = "zh"
def get_train_examples(self, data_dir):
"""See base class."""
lines = self._read_tsv(
os.path.join(data_dir, "multinli",
"multinli.train.%s.tsv" % self.language))
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "train-%d" % (i)
text_a = tokenization.convert_to_unicode(line[0])
text_b = tokenization.convert_to_unicode(line[1])
label = tokenization.convert_to_unicode(line[2])
if label == tokenization.convert_to_unicode("contradictory"):
label = tokenization.convert_to_unicode("contradiction")
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
def get_dev_examples(self, data_dir):
"""See base class."""
lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "dev-%d" % (i)
language = tokenization.convert_to_unicode(line[0])
if language != tokenization.convert_to_unicode(self.language):
continue
text_a = tokenization.convert_to_unicode(line[6])
text_b = tokenization.convert_to_unicode(line[7])
label = tokenization.convert_to_unicode(line[1])
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
class MnliProcessor(DataProcessor):
"""Processor for the MultiNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
"dev_matched")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test_matched.tsv")),
"test")
def get_labels(self):
"""See base class."""
return ["contradiction", "entailment", "neutral"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type,
tokenization.convert_to_unicode(line[0]))
text_a = tokenization.convert_to_unicode(line[8])
text_b = tokenization.convert_to_unicode(line[9])
if set_type == "test":
label = "contradiction"
else:
label = tokenization.convert_to_unicode(line[-1])
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
class MrpcProcessor(DataProcessor):
"""Processor for the MRPC data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")),
"train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")),
"test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.convert_to_unicode(line[3])
text_b = tokenization.convert_to_unicode(line[4])
if set_type == "test":
label = "0"
else:
label = tokenization.convert_to_unicode(line[0])
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
class ColaProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")),
"train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")),
"test")
def get_labels(self):
"""See base class."""
return ["0", "1"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# Only the test set has a header
if set_type == "test" and i == 0:
continue
guid = "%s-%s" % (set_type, i)
if set_type == "test":
text_a = tokenization.convert_to_unicode(line[1])
label = "0"
else:
text_a = tokenization.convert_to_unicode(line[3])
label = tokenization.convert_to_unicode(line[1])
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=None, label=label))
return examples
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention rule is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# segment_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# sigment_ids: 0 0 0 0 0 0 0
#
# Where "segment_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
# here we disable the verbose printing of the data
if ex_index < 0:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_ids length: %d" % len(input_ids))
tf.logging.info("input_mask: %s" %\
" ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" %\
" ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id)
return feature
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
return tf.train.Feature(
int64_list=tf.train.Int64List(value=list(values)))
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
tf_example = tf.train.Example(
features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
def file_based_input_fn_builder(input_file, seq_length, is_training,
drop_remainder, is_distributed=False):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
if is_distributed:
import horovod.tensorflow as hvd
tf.logging.info('distributed mode is enabled.'
'size:{} rank:{}'.format(hvd.size(), hvd.rank()))
# https://github.com/uber/horovod/issues/223
d = d.shard(hvd.size(), hvd.rank())
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size//hvd.size(),
drop_remainder=drop_remainder))
else:
tf.logging.info('distributed mode is not enabled.')
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
else:
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal
# percent of tokens from each, since if one sequence is very short then
# each token that's truncated likely contains more information than a
# longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def get_dataset(processor,
tokenizer,
data_dir,
max_seq_length,
batch_size,
mode,
output_dir,
is_distributed=False):
"""
Args:
processor: Data Preprocessor, must have get_lables,
get_train/dev/test/examples methods defined.
tokenizer: The Sentence Tokenizer. Generally should be
SentencePiece Model.
data_dir: The input data directory.
max_seq_length: Max sequence length.
batch_size: mini-batch size.
model: `train`, `eval` or `test`.
output_dir: The directory to save the TFRecords in.
"""
label_list = processor.get_labels()
if mode == 'train':
train_examples = processor.get_train_examples(data_dir)
train_file = os.path.join(output_dir, "train.tf_record")
file_based_convert_examples_to_features(
train_examples, label_list, max_seq_length,
tokenizer, train_file)
dataset = file_based_input_fn_builder(
input_file=train_file,
seq_length=max_seq_length,
is_training=True,
drop_remainder=True,
is_distributed=is_distributed)({'batch_size': batch_size})
elif mode == 'eval':
eval_examples = processor.get_dev_examples(data_dir)
eval_file = os.path.join(output_dir, "eval.tf_record")
file_based_convert_examples_to_features(
eval_examples, label_list, max_seq_length, tokenizer, eval_file)
dataset = file_based_input_fn_builder(
input_file=eval_file,
seq_length=max_seq_length,
is_training=False,
drop_remainder=False)({'batch_size': batch_size})
elif mode == 'test':
test_examples = processor.get_test_examples(data_dir)
test_file = os.path.join(output_dir, "predict.tf_record")
file_based_convert_examples_to_features(
test_examples, label_list, max_seq_length, tokenizer, test_file)
dataset = file_based_input_fn_builder(
input_file=test_file,
seq_length=max_seq_length,
is_training=False,
drop_remainder=False)({'batch_size': batch_size})
return dataset
================================================
FILE: texar_repo/examples/bert/utils/model_utils.py
================================================
"""
Model utility functions
"""
import json
import collections
import re
import random
import tensorflow as tf
import numpy as np
from texar import HParams
"""
Load the Json config file and transform it into Texar style configuration.
"""
def transform_bert_to_texar_config(input_json):
config_ckpt = json.loads(
open(input_json).read())
configs = {}
configs['random_seed'] = 123
configs['hidden_size'] = config_ckpt['hidden_size']
hidden_dim = config_ckpt['hidden_size']
configs['embed'] = {
'name': 'word_embeddings',
'dim': hidden_dim}
configs['vocab_size'] = config_ckpt['vocab_size']
configs['segment_embed'] = {
'name': 'token_type_embeddings',
'dim': hidden_dim}
configs['type_vocab_size'] = config_ckpt['type_vocab_size']
configs['encoder'] = {
'name': 'encoder',
'position_embedder_type': 'variables',
'position_size': config_ckpt['max_position_embeddings'],
'position_embedder_hparams': {
'dim': hidden_dim,
},
'embedding_dropout': config_ckpt['hidden_dropout_prob'],
'num_blocks': config_ckpt['num_hidden_layers'],
'multihead_attention': {
'use_bias': True,
'num_units': hidden_dim,
'num_heads': config_ckpt['num_attention_heads'],
'output_dim': hidden_dim,
'dropout_rate': config_ckpt['attention_probs_dropout_prob'],
'name': 'self'
},
'residual_dropout': config_ckpt['hidden_dropout_prob'],
'dim': hidden_dim,
'use_bert_config': True,
'poswise_feedforward': {
"layers": [
{
'type': 'Dense',
'kwargs': {
'name': 'intermediate',
'units': config_ckpt['intermediate_size'],
'activation': config_ckpt['hidden_act'],
'use_bias': True,
}
},
{
'type': 'Dense',
'kwargs': {
'name': 'output',
'units': hidden_dim,
'activation': None,
'use_bias': True,
}
},
],
},
}
return HParams(configs, default_hparams=None)
def get_lr(global_step, num_train_steps, num_warmup_steps, static_lr):
"""
Calculate the learinng rate given global step and warmup steps.
The learinng rate is following a linear warmup and linear decay.
"""
learning_rate = tf.constant(value=static_lr,
shape=[], dtype=tf.float32)
learning_rate = tf.train.polynomial_decay(
learning_rate,
global_step,
num_train_steps,
end_learning_rate=0.0,
power=1.0,
cycle=False)
if num_warmup_steps:
global_steps_int = tf.cast(global_step, tf.int32)
warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
global_steps_float = tf.cast(global_steps_int, tf.float32)
warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
warmup_percent_done = global_steps_float / warmup_steps_float
warmup_learning_rate = static_lr * warmup_percent_done
is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
learning_rate = ((1.0 - is_warmup) * learning_rate\
+is_warmup * warmup_learning_rate)
return learning_rate
def _get_assignment_map_from_checkpoint(tvars, init_checkpoint):
"""
Compute the union of the current variables and checkpoint variables.
Because the variable scope of the original BERT and Texar implementation,
we need to build a assignment map to match the variables.
"""
assignment_map = {}
initialized_variable_names = {}
name_to_variable = collections.OrderedDict()
for var in tvars:
name = var.name
m = re.match("^(.*):\\d+$", name)
if m is not None:
name = m.group(1)
name_to_variable[name] = var
init_vars = tf.train.list_variables(init_checkpoint)
assignment_map = {
'bert/embeddings/word_embeddings': 'bert/word_embeddings/w',
'bert/embeddings/token_type_embeddings': 'bert/token_type_embeddings/w',
'bert/embeddings/position_embeddings':
'bert/encoder/position_embedder/w',
'bert/embeddings/LayerNorm/beta': 'bert/encoder/LayerNorm/beta',
'bert/embeddings/LayerNorm/gamma': 'bert/encoder/LayerNorm/gamma',
}
for check_name, model_name in assignment_map.items():
initialized_variable_names[model_name] = 1
initialized_variable_names[model_name + ":0"] = 1
for check_name, shape in init_vars:
if check_name.startswith('bert'):
if check_name.startswith('bert/embeddings'):
continue
model_name = re.sub(
'layer_\d+/output/dense',
lambda x: x.group(0).replace('output/dense', 'ffn/output'),
check_name)
if model_name == check_name:
model_name = re.sub(
'layer_\d+/output/LayerNorm',
lambda x: x.group(0).replace('output/LayerNorm',
'ffn/LayerNorm'),
check_name)
if model_name == check_name:
model_name = re.sub(
'layer_\d+/intermediate/dense',
lambda x: x.group(0).replace('intermediate/dense',
'ffn/intermediate'),
check_name)
if model_name == check_name:
model_name = re.sub('attention/output/dense',
'attention/self/output', check_name)
if model_name == check_name:
model_name = check_name.replace('attention/output/LayerNorm',
'output/LayerNorm')
assert model_name in name_to_variable.keys(),\
'model name:{} not exists!'.format(model_name)
assignment_map[check_name] = model_name
initialized_variable_names[model_name] = 1
initialized_variable_names[model_name + ":0"] = 1
return (assignment_map, initialized_variable_names)
def init_bert_checkpoint(init_checkpoint):
tvars = tf.trainable_variables()
initialized_variable_names = []
if init_checkpoint:
(assignment_map, initialized_variable_names
) = _get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
def set_random_seed(myseed):
tf.set_random_seed(myseed)
np.random.seed(myseed)
random.seed(myseed)
================================================
FILE: texar_repo/examples/bert/utils/tokenization.py
================================================
# coding=utf-8
# Copied from google BERT repo.
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import collections
import unicodedata
import tensorflow as tf
def convert_to_unicode(text):
"""Returns the given argument as a unicode string."""
return tf.compat.as_text(text)
def printable_text(text):
"""Returns text encoded in a way suitable for print or `tf.logging`."""
return tf.compat.as_str_any(text)
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
index = 0
with tf.gfile.GFile(vocab_file, "r") as reader:
while True:
token = tf.compat.as_text(reader.readline())
if not token:
break
token = token.strip()
vocab[token] = index
index += 1
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = tf.compat.as_text(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode
# block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean
# characters, despite its name.
# The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to
# write space-separated words, so they are not treated specially and
# handled like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens.
This should have already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = tf.compat.as_text(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
================================================
FILE: texar_repo/examples/distributed_gpu/README.md
================================================
# Model Training with Multi/Distributed GPUs
This example shows how models built with Texar can be trained with multiple GPUs on single or multiple machines. Multi/Distributed-GPU training is based on the third-party library [Horovod](https://github.com/uber/horovod).
Here we take language model for example, adapting the [single-GPU language model example](https://github.com/asyml/texar/tree/master/examples/language_model_ptb) by adding a few lines of Horovod-related code to enable distributed training (more details below).
## Prerequisites
Two third-party packages are required:
* `openmpi >= 3.0.0`
* `horovod`
The following commands install [OpenMPI](https://www.open-mpi.org) 4.0.0 to the path `/usr/local/openmpi`. Run `mpirun --version` to check the version of installed OpenNMT.
```
# Download and install OpenMPI
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar xvf openmpi-4.0.0.tar.gz
cd openmpi-4.0.0/
./configure --prefix=/usr/local/openmpi
sudo make all install
# Add path of the installed OpenMPI to your system path
export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
```
Then install Horovod with the cmd:
```
pip install horovod
```
## Adapting Single-GPU Code for distributed Training
Based on the [single-GPU code](https://github.com/asyml/texar/tree/master/examples/language_model_ptb), we made the following adaptions. Note that one processor is created for each GPU.
- Setting up Horovod in the code (click the links below to see the corresponding actual code in `lm_ptb_distributed.py`):
1. [`hvd.init()`](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/lm_ptb_distributed.py#L76): initialize Horovod
2. [`hvd.DistributedOptimizer`](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/lm_ptb_distributed.py#L131): wrap your optimizer.
3. [`hvd.broadcast_global_variables(0)`](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/lm_ptb_distributed.py#L191): set the operator to broadcast your global variables to different processes from rank-0 process.
4. [set visible GPU list](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/lm_ptb_distributed.py#L194) by `config.gpu_options.visible_device_list = str(hvd.local_rank())`, to make each process see the attached single GPU.
5. [run the broadcast node](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/lm_ptb_distributed.py#L203): run the broadcast operator before training
- Data sharding:
1. To make sure different GPUs (processors) receive different data batches in each iteration, we [shard the training data](https://github.com/asyml/texar/blob/master/examples/distributed_gpu/ptb_reader.py#L52) into `N` parts, where `N` is the number of GPUs (processors).
2. In this example, `batch_size` in the config files denotes the total batch size in each iteration of all processors. That is, in each iteration, each processor receives `batch_size`/`N` data instances. This replicates the gradients in the single-GPU setting, and we use the same `learning_rate` as in single-GPU.
## Usage ##
Run the following command to train the model with multiple GPUs on multiple machines:
```
mpirun -np 2 \
-H [IP-adress-of-server1]:1,[IP-address-of-server2]:1\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl tcp,self \
-mca btl_tcp_if_include ens3 \
python lm_ptb_distributed.py --config config_small --data_path ./
```
Here:
* The key configurations for ordinary users:
- `-np`: total number of processes
- `-H`: IP addresses of different servers and the number of processes used in each server. For example, `-H 192.168.11.22:1,192.168.33.44:1`
* Other advanced configurations:
- `--bind-to none`: specifies OpenMPI to not bind a training process to a single CPU core (which would hurt performance).
- `-map-by slot`: allows you to have a mixture of different NUMA configurations because the default behavior is to bind to the socket.
- `-x`: specifies (`-x NCCL_DEBUG=INFO`) or copies (`-x LD_LIBRARY_PATH`) an environment variable to all the workers.
- `-mca`: sets the MPI communication interface. Use the setting specified above to avoid possible multiprocessing and network communication issues.
* Language model configurations:
- `--config`: specifies the config file to use. E.g., the above use the configuration defined in config_small.py
- `--data_path`: specifies the directory containing PTB raw data (e.g., ptb.train.txt). If the data files do not exist, the program will automatically download, extract, and pre-process the data.
The model will begin training on the specified GPUs, and evaluate on the validation data periodically. Evaluation on the test data is performed after the training is done. Note that both validation and test are performed only on the rank-0 GPU (i.e., they are not distributed).
## Results ##
We did simple test on two AWS p2.xlarge instances.
Since the language model is small and the communication cost is considerable, as expected, the example here doesn't scale very well on 2-GPU 2-machine in terms of speedup rate. The perplexity results of multi-GPU are the same with those of single-GPU.
| config | epochs | train | valid | test | time/epoch (2-gpu) | time/epoch (single-gpu) |
| -------| -------| ------| -------| ------| -----| -----|
| small | 13 | 40.81 | 118.99 | 114.72| 207s | 137s |
| medium | 39 | 44.18 | 87.63 | 84.42| 461s | 311s |
| large | 55 | 36.54 | 82.55 | 78.72| 1765s | 931s |
================================================
FILE: texar_repo/examples/distributed_gpu/config_large.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM large size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.04
num_epochs = 55
hidden_size = 1500
keep_prob = 0.35
batch_size = 20
num_steps = 35
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 10.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 1. / 1.15,
"staircase": True
},
"start_decay_step": 14
}
}
================================================
FILE: texar_repo/examples/distributed_gpu/config_medium.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM medium size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.05
num_epochs = 39
hidden_size = 650
keep_prob = 0.5
batch_size = 20
num_steps = 35
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.8,
"staircase": True
},
"start_decay_step": 5
}
}
================================================
FILE: texar_repo/examples/distributed_gpu/config_small.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM small size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.1
num_epochs = 13
hidden_size = 200
keep_prob = 1.0
batch_size = 20
num_steps = 20
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.5,
"staircase": True
},
"start_decay_step": 3
}
}
================================================
FILE: texar_repo/examples/distributed_gpu/lm_ptb_distributed.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example for building the language model.
This is a reimpmentation of the TensorFlow official PTB example in:
tensorflow/models/rnn/ptb
Model and training are described in:
(Zaremba, et. al.) Recurrent Neural Network Regularization
http://arxiv.org/abs/1409.2329
There are 3 provided model configurations:
===========================================
| config | epochs | train | valid | test
===========================================
| small | 13 | 37.99 | 121.39 | 115.91
| medium | 39 | 48.45 | 86.16 | 82.07
| large | 55 | 37.87 | 82.62 | 78.29
The exact results may vary depending on the random initialization.
The data required for this example is in the `data/` dir of the
PTB dataset from Tomas Mikolov's webpage:
$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
$ tar xvf simple-examples.tgz
If data is not provided, the program will download from above automatically.
To run:
$ python lm_ptb.py --data_path=simple-examples/data --config=config_small
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, no-member, too-many-locals
import time
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
import horovod.tensorflow as hvd
from ptb_reader import prepare_data, ptb_iterator
flags = tf.flags
flags.DEFINE_string("data_path", "./",
"Directory containing PTB raw data (e.g., ptb.train.txt). "
"E.g., ./simple-examples/data. If not exists, "
"the directory will be created and PTB raw data will "
"be downloaded.")
flags.DEFINE_string("config", "config_small", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
tf.logging.set_verbosity(tf.logging.INFO)
## 1. initialize the horovod
hvd.init()
batch_size = config.batch_size
num_steps = config.num_steps
data = prepare_data(FLAGS.data_path)
vocab_size = data["vocab_size"]
inputs = tf.placeholder(tf.int32, [None, num_steps],
name='inputs')
targets = tf.placeholder(tf.int32, [None, num_steps],
name='targets')
# Model architecture
initializer = tf.random_uniform_initializer(
-config.init_scale, config.init_scale)
with tf.variable_scope("model", initializer=initializer):
embedder = tx.modules.WordEmbedder(
vocab_size=vocab_size, hparams=config.emb)
emb_inputs = embedder(inputs)
if config.keep_prob < 1:
emb_inputs = tf.nn.dropout(
emb_inputs, tx.utils.switch_dropout(config.keep_prob))
decoder = tx.modules.BasicRNNDecoder(
vocab_size=vocab_size, hparams={"rnn_cell": config.cell})
# This _batch_size equals to batch_size // hvd.size() in
# distributed training.
# because the mini-batch is distributed to multiple GPUs
_batch_size = tf.shape(inputs)[0]
initial_state = decoder.zero_state(_batch_size,
tf.float32)
seq_length = tf.broadcast_to([num_steps], (_batch_size, ))
outputs, final_state, seq_lengths = decoder(
decoding_strategy="train_greedy",
impute_finished=True,
inputs=emb_inputs,
sequence_length=seq_length,
initial_state=initial_state)
# Losses & train ops
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=targets,
logits=outputs.logits,
sequence_length=seq_lengths)
# Use global_step to pass epoch, for lr decay
global_step = tf.placeholder(tf.int32)
opt = tx.core.get_optimizer(
global_step=global_step,
hparams=config.opt
)
# 2. wrap the optimizer
opt = hvd.DistributedOptimizer(opt)
train_op = tx.core.get_train_op(
loss=mle_loss,
optimizer=opt,
global_step=global_step,
learning_rate=None,
increment_global_step=False,
hparams=config.opt
)
def _run_epoch(sess, data_iter, epoch, is_train=False, verbose=False):
start_time = time.time()
loss = 0.
iters = 0
fetches = {
"mle_loss": mle_loss,
"final_state": final_state,
}
if is_train:
fetches["train_op"] = train_op
epoch_size = (len(data["train_text_id"]) // batch_size - 1)\
// num_steps
mode = (tf.estimator.ModeKeys.TRAIN
if is_train
else tf.estimator.ModeKeys.EVAL)
for step, (x, y) in enumerate(data_iter):
if step == 0:
state = sess.run(initial_state,
feed_dict={inputs: x})
feed_dict = {
inputs: x, targets: y, global_step: epoch,
tx.global_mode(): mode,
}
for i, (c, h) in enumerate(initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
rets = sess.run(fetches, feed_dict)
loss += rets["mle_loss"]
state = rets["final_state"]
iters += num_steps
ppl = np.exp(loss / iters)
if verbose and is_train and hvd.rank() == 0 \
and (step+1) % (epoch_size // 10) == 0:
tf.logging.info("%.3f perplexity: %.3f speed: %.0f wps" %
((step+1) * 1.0 / epoch_size, ppl,
iters * batch_size / (time.time() - start_time)))
_elapsed_time = time.time() - start_time
tf.logging.info("epoch time elapsed: %f" % (_elapsed_time))
ppl = np.exp(loss / iters)
return ppl, _elapsed_time
# 3. set broadcase global variables from rank-0 process
bcast = hvd.broadcast_global_variables(0)
# 4. set visible GPU
session_config = tf.ConfigProto()
session_config.gpu_options.visible_device_list = str(hvd.local_rank())
with tf.Session(config=session_config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
# 5. run the broadcast_global_variables node before training
bcast.run()
_times = []
for epoch in range(config.num_epochs):
# Train
train_data_iter = ptb_iterator(
data["train_text_id"], config.batch_size, num_steps,
is_train=True)
train_ppl, train_time = _run_epoch(
sess, train_data_iter, epoch, is_train=True, verbose=True)
_times.append(train_time)
tf.logging.info("Epoch: %d Train Perplexity: %.3f" % (epoch, train_ppl))
# Valid in the main process
if hvd.rank() == 0:
valid_data_iter = ptb_iterator(
data["valid_text_id"], config.batch_size, num_steps)
valid_ppl, _ = _run_epoch(sess, valid_data_iter, epoch)
tf.logging.info("Epoch: %d Valid Perplexity: %.3f"
% (epoch, valid_ppl))
tf.logging.info('train times: %s' % (_times))
tf.logging.info('average train time/epoch %f'
% np.mean(np.array(_times)))
# Test in the main process
if hvd.rank() == 0:
test_data_iter = ptb_iterator(
data["test_text_id"], batch_size, num_steps)
test_ppl, _ = _run_epoch(sess, test_data_iter, 0)
tf.logging.info("Test Perplexity: %.3f" % (test_ppl))
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/distributed_gpu/ptb_reader.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for preprocessing and iterating over the PTB data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals
import os
import numpy as np
import tensorflow as tf
import horovod.tensorflow as hvd
import texar as tx
def ptb_iterator(data, batch_size, num_steps, is_train=False):
"""Iterates through the ptb data.
"""
data_length = len(data)
batch_length = data_length // batch_size
data = np.asarray(data[:batch_size*batch_length])
data = data.reshape([batch_size, batch_length])
epoch_size = (batch_length - 1) // num_steps
if epoch_size == 0:
raise ValueError("epoch_size == 0, decrease batch_size or num_steps")
def _sharded_data(data):
_batch_size = len(data)
_shard_size = _batch_size // hvd.size()
data = [data[i*_shard_size: (i+1) * _shard_size]
for i in range(_shard_size)]
data = data[hvd.rank()]
return data
if is_train:
# split the dataset into shards to make sure
# different processed are loaded with different training data
data = _sharded_data(data)
for i in range(epoch_size):
x = data[:, i * num_steps : (i+1) * num_steps]
y = data[:, i * num_steps + 1 : (i+1) * num_steps + 1]
yield (x, y)
def prepare_data(data_path):
"""Preprocess PTB data.
"""
train_path = os.path.join(data_path, "ptb.train.txt")
if not tf.gfile.Exists(train_path):
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz'
tx.data.maybe_download(url, data_path, extract=True)
data_path = os.path.join(data_path, 'simple-examples', 'data')
train_path = os.path.join(data_path, "ptb.train.txt")
valid_path = os.path.join(data_path, "ptb.valid.txt")
test_path = os.path.join(data_path, "ptb.test.txt")
word_to_id = tx.data.make_vocab(
train_path, newline_token="", return_type="dict")
assert len(word_to_id) == 10000
train_text = tx.data.read_words(
train_path, newline_token="")
train_text_id = [word_to_id[w] for w in train_text if w in word_to_id]
valid_text = tx.data.read_words(
valid_path, newline_token="")
valid_text_id = [word_to_id[w] for w in valid_text if w in word_to_id]
test_text = tx.data.read_words(
test_path, newline_token="")
test_text_id = [word_to_id[w] for w in test_text if w in word_to_id]
data = {
"train_text": train_text,
"valid_text": valid_text,
"test_text": test_text,
"train_text_id": train_text_id,
"valid_text_id": valid_text_id,
"test_text_id": test_text_id,
"vocab": word_to_id,
"vocab_size": len(word_to_id)
}
return data
================================================
FILE: texar_repo/examples/hierarchical_dialog/README.md
================================================
# Hierarchical Recurrent Encoder-Decoder (HRED) Dialogue Model
This example builds a HRED dialogue model described in [(Serban et al. 2016) Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models](https://arxiv.org/abs/1507.04808).
The dataset used here is provided by [(Zhao et al. 2017) Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders](https://arxiv.org/abs/1703.10960), which adapts [switchboard-1 Release 2](https://catalog.ldc.upenn.edu/ldc97s62). In particular, for evaluation purpose, multiple reference responses for each dialog context in the test set are collected through manual annotations.
This example demonstrates:
* Use of `MultiAlignedData` to read parallel data with multiple fields, e.g., (source, target, meta, ...)
* Use of the `'variable_utterance'` hyperparameter in TextData to read dialog history data.
* Use of the `'embedding_init'` hyperparameter in TextData to read pre-trained word embedding as initialization.
* Use of `HierarchicalRNNEncoder` to encode dialog history with utterance-level and word-level encoding.
* Use of *beam search decoding* and *random sample decoding* at inference time.
* Addition of speaker meta-data in the encoder-decoder model.
## Usage
### Dataset
Download and preprocess the data with the following cmd:
```
python sw_loader.py
```
* Train/dev/test sets contain 200K, 5K, 5K examples, respectively.
* Vocab size is 10,000.
* `./data/switchboard/embedding.txt` contains word embeddings extracted from [glove.twitter.27B.200d](https://nlp.stanford.edu/projects/glove). You can also directly use the original glove.twitter.27B.200d file, and the Texar TextData module will automatically extract relevant embeddings for the vocabulary.
### Train the model
To train the model, run
```
python hred.py --config_data config_data --config_model config_model_biminor
```
Evaluation will be performed after each epoch.
Here:
* `--config_data` specifies the data configuration.
* `--config_model` specifies the model configuration. Note not to include the `.py` suffix. Two configs are provided:
- [biminor.py](./config_model_biminor.py) uses a bi-directional RNN as the word-level (minor-level) encoder
- [uniminor.py](./config_model_uniminor.py) uses a uni-directional RNN as the word-level (minor-level) encoder
Both configs use a uni-directional RNN for the utterance-level (major-level) encoder
## Results
The table shows results of perplexity and BLEU after 10 epochs, comparing the results of [(Zhao et al. 2017)](https://arxiv.org/abs/1703.10960) (See "Baseline" of Table.1 in the paper). Note that:
* We report results of random sample decoding, which performs slightly better than beam search decoding.
* `num_samples` is the number of samples generated for each test instances (for computing precision and recall of BLEU). See sec.5.2 of the paper for the definition of the metrics.
* (Zhao et al. 2017) uses more meta data besides the speaker meta-data here.
* Results may vary a bit due to randomness.
| | biminor num_samples=10 | biminor num_samples=5 | Zhao et al. num_samples=5 |
| --------------| ---------------| --------------| --------------|
| Perlexity | 23.79 | 24.26 | 35.4 |
| BLEU-1 recall | 0.478 | 0.386 | 0.405 |
| BLEU-1 prec | 0.379 | 0.395 | 0.336 |
| BLEU-2 recall | 0.391 | 0.319 | 0.300 |
| BLEU-2 prec | 0.310 | 0.324 | 0.281 |
| BLEU-3 recall | 0.330 | 0.270 | 0.272 |
| BLEU-3 prec | 0.259 | 0.272 | 0.254 |
| BLEU-4 recall | 0.262 | 0.216 | 0.226 |
| BLEU-4 prec | 0.204 | 0.215 | 0.215 |
================================================
FILE: texar_repo/examples/hierarchical_dialog/config_data.py
================================================
import os
data_root = './data'
max_utterance_cnt = 9
data_hparams = {
stage: {
"num_epochs": 1,
"shuffle": stage != 'test',
"batch_size": 30,
"datasets": [
{ # source
"variable_utterance": True,
"max_utterance_cnt": max_utterance_cnt,
"files": [
os.path.join(data_root,
'{}-source.txt'.format(stage))],
"vocab_file": os.path.join(data_root, 'vocab.txt'),
"embedding_init": {
"file": os.path.join(data_root, 'embedding.txt'),
"dim": 200,
"read_fn": "load_glove"
},
"data_name": "source"
},
{ # target
"files": [
os.path.join(data_root, '{}-target.txt'.format(stage))],
"vocab_share_with": 0,
"data_name": "target"
},
] + [{ # source speaker token
"files": os.path.join(data_root,
'{}-source-spk-{}.txt'.format(stage, i)),
"data_type": "float",
"data_name": "spk_{}".format(i)
} for i in range(max_utterance_cnt)
] + [{ # target speaker token
"files": os.path.join(data_root,
'{}-target-spk.txt'.format(stage)),
"data_type": "float",
"data_name": "spk_tgt"
}
] + [{ # target refs for BLEU evaluation
"variable_utterance": True,
"max_utterance_cnt": 10,
"files": [os.path.join(data_root,
'{}-target-refs.txt'.format(stage))],
"vocab_share_with": 0,
"data_name": "refs"
}]
}
for stage in ['train', 'val', 'test']
}
================================================
FILE: texar_repo/examples/hierarchical_dialog/config_model_biminor.py
================================================
import tensorflow as tf
num_samples = 10 # Number of samples generated for each test data instance
beam_width = num_samples
encoder_hparams = {
"encoder_minor_type": "BidirectionalRNNEncoder",
"encoder_minor_hparams": {
"rnn_cell_fw": {
"type": "GRUCell",
"kwargs": {
"num_units": 300,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08)
},
"dropout": {
"input_keep_prob": 0.5,
}
},
"rnn_cell_share_config": True
},
"encoder_major_type": "UnidirectionalRNNEncoder",
"encoder_major_hparams": {
"rnn_cell": {
"type": "GRUCell",
"kwargs": {
"num_units": 600,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08)
},
"dropout": {
"output_keep_prob": 0.3
}
}
}
}
decoder_hparams = {
"rnn_cell": {
"type": "GRUCell",
"kwargs": {
"num_units": 400,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08),
},
"dropout": {
"input_keep_prob": 0.3
}
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001,
}
},
## (It looks gradient clip does not affect the results a lot)
#"gradient_clip": {
# "type": "clip_by_global_norm",
# "kwargs": {"clip_norm": 5.}
#},
}
================================================
FILE: texar_repo/examples/hierarchical_dialog/config_model_uniminor.py
================================================
import tensorflow as tf
num_samples = 10 # Number of samples generated for each test data instance
beam_width = num_samples
encoder_hparams = {
"encoder_minor_type": "UnidirectionalRNNEncoder",
"encoder_minor_hparams": {
"rnn_cell": {
"type": "GRUCell",
"kwargs": {
"num_units": 300,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08)
},
"dropout": {
"input_keep_prob": 0.5,
}
},
},
"encoder_major_type": "UnidirectionalRNNEncoder",
"encoder_major_hparams": {
"rnn_cell": {
"type": "GRUCell",
"kwargs": {
"num_units": 600,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08)
},
"dropout": {
"input_keep_prob": 0.3,
}
}
}
}
decoder_hparams = {
"rnn_cell": {
"type": "GRUCell",
"kwargs": {
"num_units": 400,
"kernel_initializer": tf.initializers.random_uniform(-0.08, 0.08),
},
"dropout": {
"output_keep_prob": 0.3,
}
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001,
}
}
}
================================================
FILE: texar_repo/examples/hierarchical_dialog/hred.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Hierarchical Recurrent Encoder-Decoder (HRED) for dialog response
generation.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
flags = tf.flags
flags.DEFINE_string('config_data', 'config_data', 'The data config')
flags.DEFINE_string('config_model', 'config_model_biminor', 'The model config')
FLAGS = flags.FLAGS
config_data = importlib.import_module(FLAGS.config_data)
config_model = importlib.import_module(FLAGS.config_model)
encoder_hparams = config_model.encoder_hparams
decoder_hparams = config_model.decoder_hparams
opt_hparams = config_model.opt_hparams
def main():
"""Entrypoint.
"""
# Data
train_data = tx.data.MultiAlignedData(config_data.data_hparams['train'])
val_data = tx.data.MultiAlignedData(config_data.data_hparams['val'])
test_data = tx.data.MultiAlignedData(config_data.data_hparams['test'])
iterator = tx.data.TrainTestDataIterator(train=train_data,
val=val_data,
test=test_data)
data_batch = iterator.get_next()
# (speaker's meta info)
spk_src = tf.stack([data_batch['spk_{}'.format(i)]
for i in range(config_data.max_utterance_cnt)], 1)
spk_tgt = data_batch['spk_tgt']
def _add_source_speaker_token(x):
return tf.concat([x, tf.reshape(spk_src, (-1, 1))], 1)
def _add_target_speaker_token(x):
return (x, ) + (tf.reshape(spk_tgt, (-1, 1)), )
# HRED model
embedder = tx.modules.WordEmbedder(
init_value=train_data.embedding_init_value(0).word_vecs)
encoder = tx.modules.HierarchicalRNNEncoder(hparams=encoder_hparams)
decoder = tx.modules.BasicRNNDecoder(
hparams=decoder_hparams, vocab_size=train_data.vocab(0).size)
connector = tx.modules.connectors.MLPTransformConnector(
decoder.cell.state_size)
context_embed = embedder(data_batch['source_text_ids'])
ecdr_states = encoder(
context_embed,
medium=['flatten', _add_source_speaker_token],
sequence_length_minor=data_batch['source_length'],
sequence_length_major=data_batch['source_utterance_cnt'])
ecdr_states = ecdr_states[1]
ecdr_states = _add_target_speaker_token(ecdr_states)
dcdr_states = connector(ecdr_states)
# (decoding for training)
target_embed = embedder(data_batch['target_text_ids'])
outputs, _, lengths = decoder(
initial_state=dcdr_states,
inputs=target_embed,
sequence_length=data_batch['target_length'] - 1)
# Sentence level lld, for training
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=data_batch['target_text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=lengths)
# Token level lld, for perplexity evaluation
avg_mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=data_batch['target_text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=lengths,
sum_over_timesteps=False,
average_across_timesteps=True)
perplexity = tf.exp(avg_mle_loss)
global_step = tf.Variable(0, name='global_step', trainable=True)
train_op = tx.core.get_train_op(
mle_loss, global_step=global_step, hparams=opt_hparams)
# Decoding
target_bos_token_id = train_data.vocab(0).bos_token_id
target_eos_token_id = train_data.vocab(0).eos_token_id
start_tokens = \
tf.ones_like(data_batch['target_length']) * target_bos_token_id
# Random sample decoding
decoding_strategy = 'infer_' + 'sample'
infer_samples, lengths = [], []
for _ in range(config_model.num_samples):
infer_outputs_i, _, lengths_i = decoder(
decoding_strategy=decoding_strategy,
initial_state=dcdr_states,
start_tokens=start_tokens,
end_token=target_eos_token_id,
embedding=embedder,
max_decoding_length=50)
infer_samples.append(
tf.expand_dims(infer_outputs_i.sample_id, axis=2))
lengths.append(tf.expand_dims(lengths_i, axis=1))
infer_samples = tx.utils.pad_and_concat(
infer_samples, axis=2, pad_axis=1)
rand_sample_text = train_data.vocab(0).map_ids_to_tokens(infer_samples)
rand_lengths = tf.concat(lengths, axis=1)
# Beam search decoding
beam_search_samples, beam_states, _ = tx.modules.beam_search_decode(
decoder,
initial_state=dcdr_states,
start_tokens=start_tokens,
end_token=target_eos_token_id,
embedding=embedder,
beam_width=config_model.beam_width,
max_decoding_length=50)
beam_sample_text = train_data.vocab(0).map_ids_to_tokens(
beam_search_samples.predicted_ids)
beam_lengths = beam_states.lengths
# Running procedures
def _train_epoch(sess, epoch, display=1000):
iterator.switch_to_train_data(sess)
while True:
try:
feed = {tx.global_mode(): tf.estimator.ModeKeys.TRAIN}
step, loss, _ = sess.run(
[global_step, mle_loss, train_op], feed_dict=feed)
if step % display == 0:
print('step {} at epoch {}: loss={}'.format(
step, epoch, loss))
except tf.errors.OutOfRangeError:
break
print('epoch {} train: loss={}'.format(epoch, loss))
def _test_epoch_ppl(sess, epoch):
iterator.switch_to_test_data(sess)
pples = []
while True:
try:
feed = {tx.global_mode(): tf.estimator.ModeKeys.EVAL}
ppl = sess.run(perplexity, feed_dict=feed)
pples.append(ppl)
except tf.errors.OutOfRangeError:
avg_ppl = np.mean(pples)
print('epoch {} perplexity={}'.format(epoch, avg_ppl))
break
def _test_epoch_bleu(sess, epoch, sample_text, sample_lengths):
iterator.switch_to_test_data(sess)
bleu_prec = [[] for i in range(1, 5)]
bleu_recall = [[] for i in range(1, 5)]
def _bleus(ref, sample):
res = []
for weight in [[1, 0, 0, 0],
[1, 0, 0, 0],
[1/2., 1/2., 0, 0],
[1/3., 1/3., 1/3., 0],
[1/4., 1/4., 1/4., 1/4.]]:
res.append(sentence_bleu(
[ref],
sample,
smoothing_function=SmoothingFunction().method7,
weights=weight))
return res
while True:
try:
feed = {tx.global_mode(): tf.estimator.ModeKeys.EVAL}
samples_, sample_lengths_, references, refs_cnt = \
sess.run([sample_text,
sample_lengths,
data_batch['refs_text'][:, :, 1:],
data_batch['refs_utterance_cnt']],
feed_dict=feed)
samples_ = np.transpose(samples_, (0, 2, 1))
samples_ = [
[sample[:l] for sample, l in zip(beam, lens)]
for beam, lens in zip(samples_.tolist(), sample_lengths_)
]
references = [
[ref[:ref.index(b'')] for ref in refs[:cnt]]
for refs, cnt in zip(references.tolist(), refs_cnt)
]
for beam, refs in zip(samples_, references):
bleu_scores = [
[_bleus(ref, sample) for ref in refs]
for sample in beam
]
bleu_scores = np.transpose(np.array(bleu_scores), (2, 0, 1))
for i in range(1, 5):
bleu_i = bleu_scores[i]
bleu_i_precision = bleu_i.max(axis=1).mean()
bleu_i_recall = bleu_i.max(axis=0).mean()
bleu_prec[i-1].append(bleu_i_precision)
bleu_recall[i-1].append(bleu_i_recall)
except tf.errors.OutOfRangeError:
break
bleu_prec = [np.mean(x) for x in bleu_prec]
bleu_recall = [np.mean(x) for x in bleu_recall]
print('epoch {}:'.format(epoch))
for i in range(1, 5):
print(' -- bleu-{} prec={}, recall={}'.format(
i, bleu_prec[i-1], bleu_recall[i-1]))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
num_epochs = 10
for epoch in range(1, num_epochs+1):
_train_epoch(sess, epoch)
_test_epoch_ppl(sess, epoch)
if epoch % 5 == 0:
print('random sample: ')
_test_epoch_bleu(sess, epoch, rand_sample_text, rand_lengths)
print('beam-search: ')
_test_epoch_bleu(sess, epoch, beam_sample_text, beam_lengths)
if num_epochs % 5 != 0:
print('random sample: ')
_test_epoch_bleu(sess, num_epochs, rand_sample_text, rand_lengths)
print('beam-search: ')
_test_epoch_bleu(sess, num_epochs, beam_sample_text, beam_lengths)
if __name__ == "__main__":
main()
================================================
FILE: texar_repo/examples/hierarchical_dialog/sw_loader.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" loader for switch board dataset.
"""
import os
import json
from json_lines import reader
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import texar as tx
from config_data import data_root
# pylint: disable=invalid-name, too-many-locals
wnd_sz = 10
class Dataset(object):
"""Data preprocessor.
"""
def __init__(self, jsonl_path, mode=None):
self.mode = mode
self.raw = []
self.lst = []
self.refs = []
if mode == 'test':
lst = json.load(open(jsonl_path, 'r'))
for item in lst:
context = item['context']
dialog = []
for utts in context:
p = utts.find(':')
dialog.append((
(utts[p-1] == 'A') * 2 - 1, utts[p + 2:-1], 0))
if dialog[0][1][-1] == '>':
dialog = dialog[1:]
if len(dialog) == 0:
continue
responses = []
for resp in item['responses']:
responses.append(resp)
spk = (item['speaker'] == 'A') * 2 - 1
dialog.append((spk, responses[0], 0))
responses = responses[1:]
responses = [' '.join(WordPunctTokenizer().tokenize(resp))
for resp in responses]
if len(responses) == 0:
continue
self.raw.append(dialog)
self.lst.append((len(self.raw) - 1, 0, len(dialog)))
self.refs.append(responses)
return
from collections import Counter
self.ct = Counter()
self.topics = []
with open(jsonl_path, 'r') as f:
for idx, item in enumerate(reader(f)):
utts = item['utts']
self.topics.append(item['topic'])
self.raw.append([(int(speaker == 'A') * 2 - 1, sentence, _)
for speaker, sentence, _ in utts])
lst = [(idx, start, start + wnd_sz)
for start in range(0, len(utts)-wnd_sz)] + \
[(idx, 0, end)
for end in range(2, min(wnd_sz+1, len(utts)))]
self.lst += lst
self.refs = [['none']] * len(self.lst)
def __len__(self):
return len(self.lst)
def __getitem__(self, idx):
idx, start, end = self.lst[idx]
dialog = self.raw[idx][start:end]
source, target = dialog[:-1], dialog[-1]
spks, utts = list(zip(*[(speaker, WordPunctTokenizer().tokenize(uttr)) for speaker, uttr, _ in source]))
spks = list(spks)
while len(spks) < 10:
spks.append(0)
source = '|||'.join([' '.join(uttr) for uttr in utts])
target_test = ' '.join(WordPunctTokenizer().tokenize(target[1]))
return spks, source, target_test, target[0]
def get(self, idx):
idx, start, end = self.lst[idx]
source = self.raw[idx][start:end-1]
target = self.raw[idx][end-1]
source = ' '.join([b for a, b, c in source])
cct = self.raw[idx][end-2][0] == self.raw[idx][end-1][0]
return self.topics[idx], cct, source, target
def sw1c2r(data_root):
dts_train = Dataset(os.path.join(data_root, 'train.jsonl'))
dts_valid = Dataset(os.path.join(data_root, 'valid.jsonl'))
dts_test = Dataset(os.path.join(data_root, 'test_multi_ref.json'), 'test')
datasets = {
'train': dts_train,
'val': dts_valid,
'test': dts_test
}
return datasets
def generate_reference_for_test_dialog(dataset, data_root):
vocab = {}
with open(os.path.join(data_root, 'vocab.txt'), 'r') as f:
p = f.read().splitlines()
for i, x in enumerate(p):
vocab[x] = i
dts_train = dataset['train']
dts_val = dataset['val']
dts_test = dataset['test']
vectorizer = TfidfVectorizer(tokenizer=WordPunctTokenizer().tokenize,
vocabulary=vocab)
saved = []
meta = []
data = []
tidx = {}
for i in range(len(dts_test)):
topic, cct, source, target = dts_test.get(i)
meta.append((topic, cct, target))
data.append(source)
for i in range(len(dts_train)):
topic, cct, source, target = dts_train.get(i)
saved.append((topic, cct, target))
data.append(source)
if topic not in tidx:
tidx[topic] = []
tidx[topic].append(i)
result = vectorizer.fit_transform(data)
x = result[:len(dts_test)]
y = result[len(dts_test):]
from tqdm import tqdm
from sklearn.preprocessing import normalize
y = normalize(y)
x = normalize(x)
dts_test.refs = []
for i in tqdm(range(len(dts_test))):
c = tidx[meta[i][0]]
p = (y * x[i].T).toarray().reshape(-1)[c]
d = p.argsort()
cnt = 0
refs = []
for a in d[::-1]:
if saved[a][1] == meta[i][1]:
refs.append(' '.join(
WordPunctTokenizer().tokenize(saved[a][2][1])))
cnt += 1
if cnt == 10:
break
dts_test.refs.append(refs)
def download_and_process(data_root):
if not os.path.isdir(data_root):
os.makedirs(data_root)
os.makedirs(os.path.join(data_root, 'raw'))
tx.data.maybe_download(
urls='https://drive.google.com/file/d/1Gytd-SSetUkIY6aVVKNrBOxkHjAlSGeU/view?usp=sharing',
path='./',
filenames=os.path.join(data_root, 'sw1c2r.tar.gz'),
extract=True)
os.system('mv {} {}'.format(os.path.join(data_root, 'sw1c2r.tar.gz'),
os.path.join(data_root, 'raw/sw1c2r.tar.gz')))
os.system('mv {}/* {}'.format(
os.path.join(data_root, 'switchboard'), data_root))
datasets = sw1c2r(os.path.join(data_root, 'json_data'))
for stage in ['train', 'val', 'test']:
dts = datasets[stage]
spk, src, tgt, meta = list(zip(*[dts[i] for i in range(len(dts))]))
src_txt = '\n'.join(src)
tgt_txt = '\n'.join(tgt)
spk = list(zip(*spk))
for i in range(len(spk)):
with open(os.path.join(data_root, '{}-source-spk-{}.txt'.format(stage, i)), 'w') as f:
f.write('\n'.join([str(a) for a in spk[i]]))
spk_tgt = meta
with open(os.path.join(data_root, '{}-target-spk.txt'.format(stage)), 'w') as f:
f.write('\n'.join([str(a) for a in spk_tgt]))
with open(os.path.join(data_root, '{}-source.txt'.format(stage)), 'w') as f:
f.write(src_txt)
with open(os.path.join(data_root, '{}-target.txt'.format(stage)), 'w') as f:
f.write(tgt_txt)
with open(os.path.join(data_root, '{}-target-refs.txt'.format(stage)), 'w') as f:
f.write('\n'.join(['|||'.join(v) for v in dts.refs]))
if __name__ == '__main__':
download_and_process(data_root)
================================================
FILE: texar_repo/examples/language_model_ptb/README.md
================================================
# Language Model on PTB #
This example builds an LSTM language model, and trains on PTB data. Model and training are described in
[(Zaremba, et. al.) Recurrent Neural Network Regularization](https://arxiv.org/pdf/1409.2329.pdf). This is a reimpmentation of the TensorFlow official PTB example in [tensorflow/models/rnn/ptb](https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb).
The example shows:
* Contruction of simple model, involving the `Embedder` and `RNN Decoder`.
* Use of Texar with external Python data pipeline ([ptb_reader.py](./ptb_reader.py)).
* Specification of various features of train op, like *gradient clipping* and *lr decay*.
## Usage ##
The following cmd trains a small-size model:
```
python lm_ptb.py [--config config_small] [--data_path ./]
```
Here:
* `--config` specifies the config file to use. E.g., the above use the configuration defined in [config_small.py](./config_small.py)
* `--data_path` specifies the directory containing PTB raw data (e.g., `ptb.train.txt`). If the data files do not exist, the program will automatically download, extract, and pre-process the data.
The model will begin training, and will evaluate on the validation data periodically, and evaluate on the test data after the training is done.
## Results ##
As per the TensorFlow official PTB example, the perplexity of different configs is:
| config | epochs | train | valid | test |
| -------| -------| ------| -------| ------|
| small | 13 | 37.99 | 121.39 | 115.91|
| medium | 39 | 48.45 | 86.16 | 82.07|
| large | 55 | 37.87 | 82.62 | 78.29|
================================================
FILE: texar_repo/examples/language_model_ptb/config_large.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM large size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.04
num_epochs = 55
hidden_size = 1500
keep_prob = 0.35
batch_size = 20
num_steps = 35
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 10.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 1. / 1.15,
"staircase": True
},
"start_decay_step": 14
}
}
================================================
FILE: texar_repo/examples/language_model_ptb/config_medium.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM medium size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.05
num_epochs = 39
hidden_size = 650
keep_prob = 0.5
batch_size = 20
num_steps = 35
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.8,
"staircase": True
},
"start_decay_step": 5
}
}
================================================
FILE: texar_repo/examples/language_model_ptb/config_small.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM small size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.1
num_epochs = 13
hidden_size = 200
keep_prob = 1.0
batch_size = 20
num_steps = 20
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.5,
"staircase": True
},
"start_decay_step": 3
}
}
================================================
FILE: texar_repo/examples/language_model_ptb/lm_ptb.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example for building the language model.
This is a reimpmentation of the TensorFlow official PTB example in:
tensorflow/models/rnn/ptb
Model and training are described in:
(Zaremba, et. al.) Recurrent Neural Network Regularization
http://arxiv.org/abs/1409.2329
There are 3 provided model configurations:
===========================================
| config | epochs | train | valid | test
===========================================
| small | 13 | 37.99 | 121.39 | 115.91
| medium | 39 | 48.45 | 86.16 | 82.07
| large | 55 | 37.87 | 82.62 | 78.29
The exact results may vary depending on the random initialization.
The data required for this example is in the `data/` dir of the
PTB dataset from Tomas Mikolov's webpage:
$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
$ tar xvf simple-examples.tgz
If data is not provided, the program will download from above automatically.
To run:
$ python lm_ptb.py --data_path=simple-examples/data --config=config_small
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, no-member, too-many-locals
import time
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from ptb_reader import prepare_data, ptb_iterator
flags = tf.flags
flags.DEFINE_string("data_path", "./",
"Directory containing PTB raw data (e.g., ptb.train.txt). "
"E.g., ./simple-examples/data. If not exists, "
"the directory will be created and PTB raw data will "
"be downloaded.")
flags.DEFINE_string("config", "config_small", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
batch_size = config.batch_size
num_steps = config.num_steps
data = prepare_data(FLAGS.data_path)
vocab_size = data["vocab_size"]
inputs = tf.placeholder(tf.int32, [batch_size, num_steps])
targets = tf.placeholder(tf.int32, [batch_size, num_steps])
# Model architecture
initializer = tf.random_uniform_initializer(
-config.init_scale, config.init_scale)
with tf.variable_scope("model", initializer=initializer):
embedder = tx.modules.WordEmbedder(
vocab_size=vocab_size, hparams=config.emb)
emb_inputs = embedder(inputs)
if config.keep_prob < 1:
emb_inputs = tf.nn.dropout(
emb_inputs, tx.utils.switch_dropout(config.keep_prob))
decoder = tx.modules.BasicRNNDecoder(
vocab_size=vocab_size, hparams={"rnn_cell": config.cell})
initial_state = decoder.zero_state(batch_size, tf.float32)
outputs, final_state, seq_lengths = decoder(
decoding_strategy="train_greedy",
impute_finished=True,
inputs=emb_inputs,
sequence_length=[num_steps]*batch_size,
initial_state=initial_state)
# Losses & train ops
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=targets,
logits=outputs.logits,
sequence_length=seq_lengths)
# Use global_step to pass epoch, for lr decay
global_step = tf.placeholder(tf.int32)
train_op = tx.core.get_train_op(
mle_loss, global_step=global_step, increment_global_step=False,
hparams=config.opt)
def _run_epoch(sess, data_iter, epoch, is_train=False, verbose=False):
start_time = time.time()
loss = 0.
iters = 0
state = sess.run(initial_state)
fetches = {
"mle_loss": mle_loss,
"final_state": final_state,
}
if is_train:
fetches["train_op"] = train_op
epoch_size = (len(data["train_text_id"]) // batch_size - 1)\
// num_steps
mode = (tf.estimator.ModeKeys.TRAIN
if is_train
else tf.estimator.ModeKeys.EVAL)
for step, (x, y) in enumerate(data_iter):
feed_dict = {
inputs: x, targets: y, global_step: epoch,
tx.global_mode(): mode,
}
for i, (c, h) in enumerate(initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
rets = sess.run(fetches, feed_dict)
loss += rets["mle_loss"]
state = rets["final_state"]
iters += num_steps
ppl = np.exp(loss / iters)
if verbose and is_train and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps" %
((step+1) * 1.0 / epoch_size, ppl,
iters * batch_size / (time.time() - start_time)))
ppl = np.exp(loss / iters)
return ppl
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for epoch in range(config.num_epochs):
# Train
train_data_iter = ptb_iterator(
data["train_text_id"], config.batch_size, num_steps)
train_ppl = _run_epoch(
sess, train_data_iter, epoch, is_train=True, verbose=True)
print("Epoch: %d Train Perplexity: %.3f" % (epoch, train_ppl))
# Valid
valid_data_iter = ptb_iterator(
data["valid_text_id"], config.batch_size, num_steps)
valid_ppl = _run_epoch(sess, valid_data_iter, epoch)
print("Epoch: %d Valid Perplexity: %.3f" % (epoch, valid_ppl))
# Test
test_data_iter = ptb_iterator(
data["test_text_id"], batch_size, num_steps)
test_ppl = _run_epoch(sess, test_data_iter, 0)
print("Test Perplexity: %.3f" % (test_ppl))
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/language_model_ptb/ptb_reader.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for preprocessing and iterating over the PTB data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals
import os
import numpy as np
import tensorflow as tf
import texar as tx
def ptb_iterator(data, batch_size, num_steps):
"""Iterates through the ptb data.
"""
data_length = len(data)
batch_length = data_length // batch_size
data = np.asarray(data[:batch_size*batch_length])
data = data.reshape([batch_size, batch_length])
epoch_size = (batch_length - 1) // num_steps
if epoch_size == 0:
raise ValueError("epoch_size == 0, decrease batch_size or num_steps")
for i in range(epoch_size):
x = data[:, i * num_steps : (i+1) * num_steps]
y = data[:, i * num_steps + 1 : (i+1) * num_steps + 1]
yield (x, y)
def prepare_data(data_path):
"""Preprocess PTB data.
"""
train_path = os.path.join(data_path, "ptb.train.txt")
if not tf.gfile.Exists(train_path):
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz'
tx.data.maybe_download(url, data_path, extract=True)
data_path = os.path.join(data_path, 'simple-examples', 'data')
train_path = os.path.join(data_path, "ptb.train.txt")
valid_path = os.path.join(data_path, "ptb.valid.txt")
test_path = os.path.join(data_path, "ptb.test.txt")
word_to_id = tx.data.make_vocab(
train_path, newline_token="", return_type="dict")
assert len(word_to_id) == 10000
train_text = tx.data.read_words(
train_path, newline_token="")
train_text_id = [word_to_id[w] for w in train_text if w in word_to_id]
valid_text = tx.data.read_words(
valid_path, newline_token="")
valid_text_id = [word_to_id[w] for w in valid_text if w in word_to_id]
test_text = tx.data.read_words(
test_path, newline_token="")
test_text_id = [word_to_id[w] for w in test_text if w in word_to_id]
data = {
"train_text": train_text,
"valid_text": valid_text,
"test_text": test_text,
"train_text_id": train_text_id,
"valid_text_id": valid_text_id,
"test_text_id": test_text_id,
"vocab": word_to_id,
"vocab_size": len(word_to_id)
}
return data
================================================
FILE: texar_repo/examples/memory_network_lm/README.md
================================================
# End-to-End Memory Network for Language Modeling #
This example builds a Memory Network language model, and trains on PTB data. Model and training are described in
[(Sukhbaatar, et. al.) End-To-End Memory Networks](https://arxiv.org/pdf/1503.08895v4.pdf). Model details are implemented in `texar.modules.memnet`.
Though the example is for language modeling, it is easy to adapt to other tasks like Question Answering, etc, as described in the above paper.
## Dataset ##
The standard [Penn Treebank (PTB) dataset](http://www.fit.vutbr.cz/~imikolov/rnnlm/) is used.
If data does not exist under `data_path`, the program will automatically download the data.
## Usage ##
The following cmd trains the model:
```bash
python3 lm_ptb_memnet.py --config config --data_path ./
```
Here:
* `--config` specifies the config file to use. E.g., the above use the configuration defined in [config.py](./config.py).
* `--data_path` specifies the directory containing PTB raw data (e.g., `ptb.train.txt`). If the data files do not exist, the program will automatically download, extract, and pre-process the data.
* `--lr` specifies the initial learning rate. If not specified, the program will use the learning rate in the config file.
The model will begin training, and will evaluate on the validation data periodically, and evaluate on the test data after the training is done. Checkpoints are saved every 5 epochs.
## Configurations ##
[config.py](./config.py) is the largest and best configuration described on the last line of Table 2 in [(Sukhbaatar, et. al.) End-To-End Memory Networks](https://arxiv.org/pdf/1503.08895v4.pdf). It sets number of hops to 7, hidden dim to 150, and memory size to 200. This model has 4,582,500 parameters in total.
## Results ##
The perplexity of different configs is:
| config | epochs | train | valid | test |
| ------------- | -------| ------| -------| ------|
| config | 51 | 50.70 | 120.97 | 113.06|
This result of `config.py` is slightly inferior to the result presented in the paper, since the result in the paper is the best among 10 runs.
================================================
FILE: texar_repo/examples/memory_network_lm/config.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
n_hops = 7
dim = 150
relu_dim = dim // 2
batch_size = 128
num_epochs = 200
memory_size = 200
initialize_stddev = 0.05
query_constant = 0.1
learning_rate_anneal_factor = 1.5
terminating_learning_rate = 1e-5
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 0.01}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 50.}
},
}
embed = {
"embedding": {
"dim": dim,
},
"temporal_embedding": {
"dim": dim,
}
}
memnet = {
"n_hops": n_hops,
"relu_dim": relu_dim,
"memory_size": memory_size,
"A": embed,
"C": embed,
}
================================================
FILE: texar_repo/examples/memory_network_lm/lm_ptb_memnet.py
================================================
#!/usr/bin/env python3
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example for building the PTB language model with Memory Network.
Memory Network model is described in https://arxiv.org/abs/1503.08895v4
The data required for this example is in the `data/` dir of the
PTB dataset from Tomas Mikolov's webpage:
$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
$ tar xvf simple-examples.tgz
If data is now provided, the program will download from above automatically.
To run:
$ python lm_ptb_memnet.py --data_path=simple-examples/data \
--config=config
This code will automatically save and restore from directory `ckpt/`.
If the directory doesn't exist, it will be created automatically.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, no-member, too-many-locals
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from ptb_reader import prepare_data
from ptb_reader import ptb_iterator_memnet as ptb_iterator
flags = tf.flags
flags.DEFINE_string("data_path", "./",
"Directory containing PTB raw data (e.g., ptb.train.txt). "
"E.g., ./simple-examples/data. If not exists, "
"the directory will be created and PTB raw data will "
"be downloaded.")
flags.DEFINE_string("config", "config", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
batch_size = config.batch_size
memory_size = config.memory_size
terminating_learning_rate = config.terminating_learning_rate
data = prepare_data(FLAGS.data_path)
vocab_size = data["vocab_size"]
print('vocab_size = {}'.format(vocab_size))
inputs = tf.placeholder(tf.int32, [None, memory_size], name="inputs")
targets = tf.placeholder(tf.int32, [None], name="targets")
# Model architecture
initializer = tf.random_normal_initializer(
stddev=config.initialize_stddev)
with tf.variable_scope("model", initializer=initializer):
memnet = tx.modules.MemNetRNNLike(raw_memory_dim=vocab_size,
hparams=config.memnet)
queries = tf.fill([tf.shape(inputs)[0], config.dim],
config.query_constant)
logits = memnet(inputs, queries)
# Losses & train ops
mle_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=targets, logits=logits)
mle_loss = tf.reduce_sum(mle_loss)
# Use global_step to pass epoch, for lr decay
lr = config.opt["optimizer"]["kwargs"]["learning_rate"]
learning_rate = tf.placeholder(tf.float32, [], name="learning_rate")
global_step = tf.Variable(0, dtype=tf.int32, name="global_step")
increment_global_step = tf.assign_add(global_step, 1)
train_op = tx.core.get_train_op(
mle_loss,
learning_rate=learning_rate,
global_step=global_step,
increment_global_step=False,
hparams=config.opt)
def _run_epoch(sess, data_iter, epoch, is_train=False):
loss = 0.
iters = 0
fetches = {
"mle_loss": mle_loss
}
if is_train:
fetches["train_op"] = train_op
mode = (tf.estimator.ModeKeys.TRAIN
if is_train
else tf.estimator.ModeKeys.EVAL)
for _, (x, y) in enumerate(data_iter):
batch_size = x.shape[0]
feed_dict = {
inputs: x, targets: y, learning_rate: lr,
tx.global_mode(): mode,
}
rets = sess.run(fetches, feed_dict)
loss += rets["mle_loss"]
iters += batch_size
ppl = np.exp(loss / iters)
return ppl
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
try:
saver.restore(sess, "ckpt/model.ckpt")
print('restored checkpoint.')
except:
print('restore checkpoint failed.')
last_valid_ppl = None
heuristic_lr_decay = (hasattr(config, 'heuristic_lr_decay')
and config.heuristic_lr_decay)
while True:
if lr < terminating_learning_rate:
break
epoch = sess.run(global_step)
if epoch >= config.num_epochs:
print('Too many epochs!')
break
print('epoch: {} learning_rate: {:.6f}'.format(epoch, lr))
# Train
train_data_iter = ptb_iterator(
data["train_text_id"], batch_size, memory_size)
train_ppl = _run_epoch(
sess, train_data_iter, epoch, is_train=True)
print("Train Perplexity: {:.3f}".format(train_ppl))
sess.run(increment_global_step)
# checkpoint
if epoch % 5 == 0:
try:
saver.save(sess, "ckpt/model.ckpt")
print("saved checkpoint.")
except:
print("save checkpoint failed.")
# Valid
valid_data_iter = ptb_iterator(
data["valid_text_id"], batch_size, memory_size)
valid_ppl = _run_epoch(sess, valid_data_iter, epoch)
print("Valid Perplexity: {:.3f}".format(valid_ppl))
# Learning rate decay
if last_valid_ppl:
if heuristic_lr_decay:
if valid_ppl > last_valid_ppl * config.heuristic_threshold:
lr /= 1. + (valid_ppl / last_valid_ppl \
- config.heuristic_threshold) \
* config.heuristic_rate
last_valid_ppl = last_valid_ppl \
* (1 - config.heuristic_smooth_rate) \
+ valid_ppl * config.heuristic_smooth_rate
else:
if valid_ppl > last_valid_ppl:
lr /= config.learning_rate_anneal_factor
last_valid_ppl = valid_ppl
else:
last_valid_ppl = valid_ppl
print("last_valid_ppl: {:.6f}".format(last_valid_ppl))
epoch = sess.run(global_step)
print('Terminate after epoch ', epoch)
# Test
test_data_iter = ptb_iterator(data["test_text_id"], 1, memory_size)
test_ppl = _run_epoch(sess, test_data_iter, 0)
print("Test Perplexity: {:.3f}".format(test_ppl))
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/memory_network_lm/ptb_reader.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for preprocessing and iterating over the PTB data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals
import os
import numpy as np
import tensorflow as tf
import texar as tx
def ptb_iterator(data, batch_size, num_steps):
"""Iterates through the ptb data.
"""
data_length = len(data)
batch_length = data_length // batch_size
data = np.asarray(data[:batch_size*batch_length])
data = data.reshape([batch_size, batch_length])
epoch_size = (batch_length - 1) // num_steps
if epoch_size == 0:
raise ValueError("epoch_size == 0, decrease batch_size or num_steps")
for i in range(epoch_size):
x = data[:, i * num_steps : (i+1) * num_steps]
y = data[:, i * num_steps + 1 : (i+1) * num_steps + 1]
yield (x, y)
def ptb_iterator_memnet(data, batch_size, memory_size):
"""Iterates through the ptb data.
"""
data_length = len(data)
length = data_length - memory_size
order = list(range(length))
np.random.shuffle(order)
data = np.asarray(data)
for i in range(0, length, batch_size):
x, y = [], []
for j in range(i, min(i + batch_size, length)):
idx = order[j]
x.append(data[idx : idx + memory_size])
y.append(data[idx + memory_size])
x, y = np.asarray(x), np.asarray(y)
yield (x, y)
def prepare_data(data_path):
"""Preprocess PTB data.
"""
train_path = os.path.join(data_path, "ptb.train.txt")
if not tf.gfile.Exists(train_path):
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz'
tx.data.maybe_download(url, data_path, extract=True)
data_path = os.path.join(data_path, 'simple-examples', 'data')
train_path = os.path.join(data_path, "ptb.train.txt")
valid_path = os.path.join(data_path, "ptb.valid.txt")
test_path = os.path.join(data_path, "ptb.test.txt")
word_to_id = tx.data.make_vocab(
train_path, newline_token="", return_type="dict")
assert len(word_to_id) == 10000
train_text = tx.data.read_words(
train_path, newline_token="")
train_text_id = [word_to_id[w] for w in train_text if w in word_to_id]
valid_text = tx.data.read_words(
valid_path, newline_token="")
valid_text_id = [word_to_id[w] for w in valid_text if w in word_to_id]
test_text = tx.data.read_words(
test_path, newline_token="")
test_text_id = [word_to_id[w] for w in test_text if w in word_to_id]
data = {
"train_text": train_text,
"valid_text": valid_text,
"test_text": test_text,
"train_text_id": train_text_id,
"valid_text_id": valid_text_id,
"test_text_id": test_text_id,
"vocab": word_to_id,
"vocab_size": len(word_to_id)
}
return data
================================================
FILE: texar_repo/examples/rl_gym/README.md
================================================
# Reinforcement Learning for Games #
This example implements three RL algorithms for the Cartpole game based on the OpenAI Gym environment:
* [pg_cartpole.py](./pg_cartpole.py) uses Policy Gradient
* [dqn_cartpole.py](./dqn_cartpole.py) uses Deep-Q
* [ac_cartpole.py](./ac_cartpole.py) uses Actor-critic
The example is for demonstrating the Texar RL APIs (for games), and only implements the most basic versions of respective algorithms.
## Usage ##
Run the following cmd to start training:
```
python pg_cartpole.py --config config
python dqn_cartpole.py --config config
python ac_cartpole.py --config config
```
================================================
FILE: texar_repo/examples/rl_gym/ac_cartpole.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Policy gradient for the CartPole game in OpenAI gym.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name
import importlib
import gym
import tensorflow as tf
import texar as tx
flags = tf.flags
flags.DEFINE_string("config", "config", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
if __name__ == '__main__':
env = gym.make('CartPole-v0')
env = env.unwrapped
env_config = tx.agents.get_gym_env_config(env)
agent = tx.agents.ActorCriticAgent(env_config=env_config)
with tf.Session() as sess:
agent.sess = sess
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
feed_dict = {tx.global_mode(): tf.estimator.ModeKeys.TRAIN}
for e in range(5000):
reward_sum = 0.
observ = env.reset()
agent.reset()
while True:
action = agent.get_action(observ, feed_dict=feed_dict)
next_observ, reward, terminal, _ = env.step(action=action)
agent.observe(reward, terminal, feed_dict=feed_dict)
observ = next_observ
reward_sum += reward
if terminal:
break
if (e + 1) % 10 == 0:
print('episode {}: {}'.format(e + 1, reward_sum))
================================================
FILE: texar_repo/examples/rl_gym/config.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Cartpole config.
"""
# pylint: disable=invalid-name
policy_hparams = None # Use default hyperparameters
pg_agent_hparams = {
"policy_hparams": policy_hparams,
"normalize_reward": True
}
================================================
FILE: texar_repo/examples/rl_gym/dqn_cartpole.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Policy gradient for the CartPole game in OpenAI gym.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name
import importlib
import gym
import tensorflow as tf
import texar as tx
from texar.agents import PGAgent
flags = tf.flags
flags.DEFINE_string("config", "config", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
if __name__ == '__main__':
env = gym.make('CartPole-v0')
env = env.unwrapped
env_config = tx.agents.get_gym_env_config(env)
with tf.Session() as sess:
agent = tx.agents.DQNAgent(sess=sess, env_config=env_config)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
feed_dict = {tx.global_mode(): tf.estimator.ModeKeys.TRAIN}
for e in range(500):
reward_sum = 0.
observ = env.reset()
agent.reset()
while True:
action = agent.get_action(observ, feed_dict=feed_dict)
next_observ, reward, terminal, _ = env.step(action=action)
agent.observe(reward, terminal, feed_dict=feed_dict)
observ = next_observ
reward_sum += reward
if terminal:
break
if (e + 1) % 10 == 0:
print('episode {}: {}'.format(e + 1, reward_sum))
================================================
FILE: texar_repo/examples/rl_gym/pg_cartpole.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Policy gradient for the CartPole game in OpenAI gym.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name
import importlib
import gym
import tensorflow as tf
import texar as tx
from texar.agents import PGAgent
flags = tf.flags
flags.DEFINE_string("config", "config", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
env = gym.make('CartPole-v0')
env = env.unwrapped
env_config = tx.agents.get_gym_env_config(env)
agent = PGAgent(
env_config,
policy_kwargs={'action_space': env_config.action_space},
hparams=config.pg_agent_hparams)
sess = tf.Session()
agent.sess = sess
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
feed_dict = {tx.global_mode(): tf.estimator.ModeKeys.TRAIN}
for e in range(300):
reward_sum = 0.
observ = env.reset()
agent.reset()
while True:
action = agent.get_action(observ, feed_dict=feed_dict)
next_observ, reward, terminal, _ = env.step(action=action)
if terminal:
reward = 0.
agent.observe(reward, terminal, feed_dict=feed_dict)
observ = next_observ
reward_sum += reward
if terminal:
break
if (e + 1) % 10 == 0:
print('episode {}: {}'.format(e + 1, reward_sum))
sess.close()
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/sentence_classifier/README.md
================================================
# Sentence Sentiment Classifier #
This example builds sentence convolutional classifier, and trains on [SST data](https://nlp.stanford.edu/sentiment/index.html). The example config [config_kim.py](./config_kim.py) corresponds to the paper
[(Kim) Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf).
The example shows:
* Contruction of simple model, involving the `Embedder` and `Conv1DClassifier`.
* Use of Texar `MultiAlignedData` to read parallel text and label data.
## Usage ##
Use the following cmd to download and prepare the SST binary data:
```
python sst_data_preprocessor.py [--data_path ./data]
```
Here
* `--data_path` specifies the directory to store the SST data. If the data files do not exist, the program will automatically download, extract, and pre-process the data.
The following cmd trains the model with Kim's config:
```
python clas_main.py --config config_kim
```
Here:
* `--config` specifies the config file to use. E.g., the above use the configuration defined in [config_kim.py](./config_kim.py)
The model will begin training and evaluating on the validation data, and will evaluate on the test data after every epoch if a valid accuracy is obtained.
## Results ##
The model achieves around `83%` test set accuracy.
================================================
FILE: texar_repo/examples/sentence_classifier/clas_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example for building a sentence convolutional classifier.
Use `./sst_data_preprocessor.py` to download and clean the SST binary data.
To run:
$ python clas_main.py --config=config_kim
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import importlib
import tensorflow as tf
import texar as tx
# pylint: disable=invalid-name, too-many-locals
flags = tf.flags
flags.DEFINE_string("config", "config_kim", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
train_data = tx.data.MultiAlignedData(config.train_data)
val_data = tx.data.MultiAlignedData(config.val_data)
test_data = tx.data.MultiAlignedData(config.test_data)
iterator = tx.data.TrainTestDataIterator(train_data, val_data, test_data)
batch = iterator.get_next()
# Model architecture
embedder = tx.modules.WordEmbedder(
vocab_size=train_data.vocab('x').size, hparams=config.emb)
classifier = tx.modules.Conv1DClassifier(config.clas)
logits, pred = classifier(embedder(batch['x_text_ids']))
# Losses & train ops
loss = tf.losses.sparse_softmax_cross_entropy(
labels=batch['y'], logits=logits)
accu = tx.evals.accuracy(batch['y'], pred)
train_op = tx.core.get_train_op(loss, hparams=config.opt)
def _run_epoch(sess, mode, epoch=0, verbose=False):
is_train = tx.utils.is_train_mode_py(mode)
fetches = {
"accu": accu,
"batch_size": tx.utils.get_batch_size(batch['y'])
}
if is_train:
fetches["train_op"] = train_op
feed_dict = {tx.context.global_mode(): mode}
cum_accu = 0.
nsamples = 0
step = 0
while True:
try:
rets = sess.run(fetches, feed_dict)
step += 1
accu_ = rets['accu']
cum_accu += accu_ * rets['batch_size']
nsamples += rets['batch_size']
if verbose and (step == 1 or step % 100 == 0):
tf.logging.info(
"epoch: {0:2} step: {1:4} accu: {2:.4f}"
.format(epoch, step, accu_))
except tf.errors.OutOfRangeError:
break
return cum_accu / nsamples
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
best_val_accu = -1.
for epoch in range(config.num_epochs):
# Train
iterator.switch_to_train_data(sess)
train_accu = _run_epoch(sess, tf.estimator.ModeKeys.TRAIN, epoch)
# Val
iterator.switch_to_val_data(sess)
val_accu = _run_epoch(sess, tf.estimator.ModeKeys.EVAL, epoch)
tf.logging.info('epoch: {0:2} train accu: {1:.4f} val accu: {2:.4f}'
.format(epoch+1, train_accu, val_accu))
# Test
if val_accu > best_val_accu:
best_val_accu = val_accu
iterator.switch_to_test_data(sess)
test_accu = _run_epoch(sess, tf.estimator.ModeKeys.EVAL)
tf.logging.info('test accu: {0:.4f}'.format(test_accu))
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/sentence_classifier/config_kim.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Sentence convolutional classifier config.
This is (approximately) the config of the paper:
(Kim) Convolutional Neural Networks for Sentence Classification
https://arxiv.org/pdf/1408.5882.pdf
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
import copy
num_epochs = 15
train_data = {
"batch_size": 50,
"datasets": [
{
"files": "./data/sst2.train.sentences.txt",
"vocab_file": "./data/sst2.vocab",
# Discards samples with length > 56
"max_seq_length": 56,
"length_filter_mode": "discard",
# Do not append BOS/EOS tokens to the sentences
"bos_token": "",
"eos_token": "",
"data_name": "x"
},
{
"files": "./data/sst2.train.labels.txt",
"data_type": "int",
"data_name": "y"
}
]
}
# The val and test data have the same config with the train data, except
# for the file names
val_data = copy.deepcopy(train_data)
val_data["datasets"][0]["files"] = "./data/sst2.dev.sentences.txt"
val_data["datasets"][1]["files"] = "./data/sst2.dev.labels.txt"
test_data = copy.deepcopy(train_data)
test_data["datasets"][0]["files"] = "./data/sst2.test.sentences.txt"
test_data["datasets"][1]["files"] = "./data/sst2.test.labels.txt"
# Word embedding
emb = {
"dim": 300
}
# Classifier
clas = {
"num_conv_layers": 1,
"filters": 100,
"kernel_size": [3, 4, 5],
"conv_activation": "relu",
"pooling": "MaxPooling1D",
"num_dense_layers": 0,
"dropout_conv": [1],
"dropout_rate": 0.5,
"num_classes": 2
}
# Optimization
# Just use the default config, e.g., Adam Optimizer
opt = {}
================================================
FILE: texar_repo/examples/sentence_classifier/sst_data_preprocessor.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Preparing the SST2 dataset.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import re
from io import open # pylint: disable=redefined-builtin
import tensorflow as tf
import texar as tx
# pylint: disable=invalid-name, too-many-locals
flags = tf.flags
flags.DEFINE_string("data_path", "./data",
"Directory containing SST data. "
"E.g., ./data/sst2.train.sentences.txt. If not exists, "
"the directory will be created and SST raw data will "
"be downloaded.")
FLAGS = flags.FLAGS
def clean_sst_text(text):
"""Cleans tokens in the SST data, which has already been tokenized.
"""
text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", text)
text = re.sub(r"\s{2,}", " ", text)
return text.strip().lower()
def transform_raw_sst(data_path, raw_fn, new_fn):
"""Transforms the raw data format to a new format.
"""
fout_x_name = os.path.join(data_path, new_fn + '.sentences.txt')
fout_x = open(fout_x_name, 'w', encoding='utf-8')
fout_y_name = os.path.join(data_path, new_fn + '.labels.txt')
fout_y = open(fout_y_name, 'w', encoding='utf-8')
fin_name = os.path.join(data_path, raw_fn)
with open(fin_name, 'r', encoding='utf-8') as fin:
for line in fin:
parts = line.strip().split()
label = parts[0]
sent = ' '.join(parts[1:])
sent = clean_sst_text(sent)
fout_x.write(sent + '\n')
fout_y.write(label + '\n')
return fout_x_name, fout_y_name
def prepare_data(data_path):
"""Preprocesses SST2 data.
"""
train_path = os.path.join(data_path, "sst.train.sentences.txt")
if not tf.gfile.Exists(train_path):
url = ('https://raw.githubusercontent.com/ZhitingHu/'
'logicnn/master/data/raw/')
files = ['stsa.binary.phrases.train', 'stsa.binary.dev',
'stsa.binary.test']
for fn in files:
tx.data.maybe_download(url + fn, data_path, extract=True)
fn_train, _ = transform_raw_sst(
data_path, 'stsa.binary.phrases.train', 'sst2.train')
transform_raw_sst(data_path, 'stsa.binary.dev', 'sst2.dev')
transform_raw_sst(data_path, 'stsa.binary.test', 'sst2.test')
vocab = tx.data.make_vocab(fn_train)
fn_vocab = os.path.join(data_path, 'sst2.vocab')
with open(fn_vocab, 'w', encoding='utf-8') as f_vocab:
for v in vocab:
f_vocab.write(v + '\n')
tf.logging.info('Preprocessing done: {}'.format(data_path))
def _main(_):
prepare_data(FLAGS.data_path)
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/seq2seq_attn/README.md
================================================
# Seq2seq Model #
This example builds an attentional seq2seq model for machine translation.
## Usage ##
### Dataset ###
Two example datasets are provided:
* toy_copy: A small toy autoencoding dataset from [TF Seq2seq toolkit](https://github.com/google/seq2seq/tree/2500c26add91b079ca00cf1f091db5a99ddab9ae).
* iwslt14: The benchmark [IWSLT2014](https://sites.google.com/site/iwsltevaluation2014/home) (de-en) machine translation dataset, following [(Ranzato et al., 2015)](https://arxiv.org/pdf/1511.06732.pdf) for data pre-processing.
Download the data with the following cmds:
```
python prepare_data.py --data toy_copy
python prepare_data.py --data iwslt14
```
### Train the model ###
Train the model with the following cmd:
```
python seq2seq_attn.py --config_model config_model --config_data config_toy_copy
```
Here:
* `--config_model` specifies the model config. Note not to include the `.py` suffix.
* `--config_data` specifies the data config.
[config_model.py](./config_model.py) specifies a single-layer seq2seq model with Luong attention and bi-directional RNN encoder. Hyperparameters taking default values can be omitted from the config file.
For demonstration purpose, [config_model_full.py](./config_model_full.py) gives all possible hyperparameters for the model. The two config files will lead to the same model.
## Results ##
On the IWSLT14 dataset, using original target texts as reference(no `` in the reference), the model achieves `BLEU = 26.44 ± 0.18` .
================================================
FILE: texar_repo/examples/seq2seq_attn/config_iwslt14.py
================================================
num_epochs = 15
display = 500
source_vocab_file = './data/iwslt14/vocab.de'
target_vocab_file = './data/iwslt14/vocab.en'
train = {
'batch_size': 32,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": 'data/iwslt14/train.de',
'vocab_file': source_vocab_file,
'max_seq_length': 50
},
'target_dataset': {
'files': 'data/iwslt14/train.en',
'vocab_file': target_vocab_file,
'max_seq_length': 50
}
}
val = {
'batch_size': 32,
'shuffle': False,
'source_dataset': {
"files": 'data/iwslt14/valid.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/valid.en',
'vocab_file': target_vocab_file,
}
}
test = {
'batch_size': 32,
'shuffle': False,
'source_dataset': {
"files": 'data/iwslt14/test.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/test.en',
'vocab_file': target_vocab_file,
}
}
================================================
FILE: texar_repo/examples/seq2seq_attn/config_model.py
================================================
# Attentional Seq2seq model.
# Hyperparameters not specified here will take the default values.
num_units = 256
beam_width = 10
embedder = {
'dim': num_units
}
encoder = {
'rnn_cell_fw': {
'kwargs': {
'num_units': num_units
}
}
}
decoder = {
'rnn_cell': {
'kwargs': {
'num_units': num_units
},
},
'attention': {
'kwargs': {
'num_units': num_units,
},
'attention_layer_size': num_units
}
}
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'learning_rate': 0.001,
},
},
}
================================================
FILE: texar_repo/examples/seq2seq_attn/config_model_full.py
================================================
# The full possible hyperparameters for the attentional seq2seq model.
# Most of the hyperparameters take the default values and are not necessary to
# specify explicitly. The config here results in the same model with the
# `config_model.py`.
num_units = 256
beam_width = 10
# --------------------- Embedder --------------------- #
embedder = {
'dim': num_units,
'initializer': {
'type': 'random_uniform_initializer',
'kwargs': {
'minval': -0.1,
'maxval': 0.1,
'seed': None
},
},
'regularizer': {
'type': 'L1L2',
'kwargs': {
'l1': 0,
'l2': 0
}
},
'dropout_rate': 0,
'dropout_strategy': 'element',
'trainable': True,
'name': 'word_embedder'
}
# --------------------- Encoder --------------------- #
encoder = {
'rnn_cell_fw': {
'type': 'LSTMCell',
'kwargs': {
'num_units': num_units,
'forget_bias': 1.0,
'activation': None,
# Other arguments go here for tf.nn.rnn_cell.LSTMCell
# ...
},
'num_layers': 1,
'dropout': {
'input_keep_prob': 1.0,
'output_keep_prob': 1.0,
'state_keep_prob': 1.0,
'variational_recurrent': False,
'input_size': [],
},
'residual': False,
'highway': False,
},
'rnn_cell_bw': {
# The same possible hyperparameters as with 'rnn_cell_fw'
# ...
},
'rnn_cell_share_config': True,
'output_layer_fw': {
'num_layers': 0,
'layer_size': 128,
'activation': 'identity',
'final_layer_activation': None,
'other_dense_kwargs': None,
'dropout_layer_ids': [],
'dropout_rate': 0.5,
'variational_dropout': False
},
'output_layer_bw': {
# The same possible hyperparameters as with 'output_layer_fw'
# ...
},
'output_layer_share_config': True,
'name': 'bidirectional_rnn_encoder'
}
# --------------------- Decoder --------------------- #
decoder = {
'rnn_cell': {
'type': 'LSTMCell',
'kwargs': {
'num_units': num_units,
'forget_bias': 1.0,
'activation': None,
# Other arguments go here for tf.nn.rnn_cell.LSTMCell
# ...
},
'num_layers': 1,
'dropout': {
'input_keep_prob': 1.0,
'output_keep_prob': 1.0,
'state_keep_prob': 1.0,
'variational_recurrent': False,
'input_size': [],
},
'residual': False,
'highway': False,
},
'attention': {
'type': 'LuongAttention',
'kwargs': {
'num_units': num_units,
'scale': False,
'probability_fn': None,
'score_mask_value': None,
# Other arguments go here for tf.contrib.seq2seq.LuongAttention
# ...
},
'attention_layer_size': num_units,
'alignment_history': False,
'output_attention': True,
},
'helper_train': {
'type': 'TrainingHelper',
'kwargs': {
# Arguments go here for tf.contrib.seq2seq.TrainingHelper
}
},
'helper_infer': {
# The same possible hyperparameters as with 'helper_train'
# ...
},
'max_decoding_length_train': None,
'max_decoding_length_infer': None,
'name': 'attention_rnn_decoder'
}
# --------------------- Optimization --------------------- #
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'learning_rate': 0.001,
# Other keyword arguments for the optimizer class
},
},
'learning_rate_decay': {
# Hyperparameters of learning rate decay
},
'gradient_clip': {
# Hyperparameters of gradient clipping
},
'gradient_noise_scale': None,
'name': None
}
================================================
FILE: texar_repo/examples/seq2seq_attn/config_toy_copy.py
================================================
num_epochs = 4
display = 50
source_vocab_file = './data/toy_copy/train/vocab.sources.txt'
target_vocab_file = './data/toy_copy/train/vocab.targets.txt'
train = {
'batch_size': 32,
'source_dataset': {
"files": './data/toy_copy/train/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
'files': './data/toy_copy/train/targets.txt',
'vocab_file': target_vocab_file
}
}
val = {
'batch_size': 32,
'source_dataset': {
"files": './data/toy_copy/dev/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
"files": './data/toy_copy/dev/targets.txt',
'vocab_file': target_vocab_file
}
}
test = {
'batch_size': 32,
'source_dataset': {
"files": './data/toy_copy/test/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
"files": './data/toy_copy/test/targets.txt',
'vocab_file': target_vocab_file
}
}
================================================
FILE: texar_repo/examples/seq2seq_attn/prepare_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Downloads data.
"""
import tensorflow as tf
import texar as tx
# pylint: disable=invalid-name
flags = tf.flags
flags.DEFINE_string("data", "iwslt14", "Data to download [iwslt14|toy_copy]")
FLAGS = flags.FLAGS
def prepare_data():
"""Downloads data.
"""
if FLAGS.data == 'iwslt14':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1y4mUWXRS2KstgHopCS9koZ42ENOh6Yb9/view?usp=sharing',
path='./',
filenames='iwslt14.zip',
extract=True)
elif FLAGS.data == 'toy_copy':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1fENE2rakm8vJ8d3voWBgW4hGlS6-KORW/view?usp=sharing',
path='./',
filenames='toy_copy.zip',
extract=True)
else:
raise ValueError('Unknown data: {}'.format(FLAGS.data))
def main():
"""Entrypoint.
"""
prepare_data()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_attn/seq2seq_attn.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Attentional Seq2seq.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
#pylint: disable=invalid-name, too-many-arguments, too-many-locals
import importlib
import tensorflow as tf
import texar as tx
flags = tf.flags
flags.DEFINE_string("config_model", "config_model", "The model config.")
flags.DEFINE_string("config_data", "config_iwslt14", "The dataset config.")
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
def build_model(batch, train_data):
"""Assembles the seq2seq model.
"""
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = tx.modules.AttentionRNNDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
training_outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=target_embedder(batch['target_text_ids'][:, :-1]),
sequence_length=batch['target_length'] - 1)
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=training_outputs.logits,
sequence_length=batch['target_length'] - 1)
train_op = tx.core.get_train_op(mle_loss, hparams=config_model.opt)
start_tokens = tf.ones_like(batch['target_length']) * \
train_data.target_vocab.bos_token_id
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return train_op, beam_search_outputs
def main():
"""Entrypoint.
"""
train_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
data_iterator = tx.data.TrainTestDataIterator(
train=train_data, val=val_data, test=test_data)
batch = data_iterator.get_next()
train_op, infer_outputs = build_model(batch, train_data)
def _train_epoch(sess):
data_iterator.switch_to_train_data(sess)
step = 0
while True:
try:
loss = sess.run(train_op)
if step % config_data.display == 0:
print("step={}, loss={:.4f}".format(step, loss))
step += 1
except tf.errors.OutOfRangeError:
break
def _eval_epoch(sess, mode):
if mode == 'val':
data_iterator.switch_to_val_data(sess)
else:
data_iterator.switch_to_test_data(sess)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
target_texts_ori, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(
target_texts_ori, is_token_list=True)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
for hypo, ref in zip(output_texts, target_texts):
hypos.append(hypo)
refs.append([ref])
except tf.errors.OutOfRangeError:
break
return tx.evals.corpus_bleu_moses(list_of_references=refs,
hypotheses=hypos)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
best_val_bleu = -1.
for i in range(config_data.num_epochs):
_train_epoch(sess)
val_bleu = _eval_epoch(sess, 'val')
best_val_bleu = max(best_val_bleu, val_bleu)
print('val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
i, val_bleu, best_val_bleu))
test_bleu = _eval_epoch(sess, 'test')
print('test epoch={}, BLEU={:.4f}'.format(i, test_bleu))
print('=' * 50)
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_configs/README.md
================================================
# Seq2seq Model #
This example builds a (plain) seq2seq model with Texar's model template and Tensorflow estimator.
## Usage ##
### Dataset ###
Download the example dataset:
* toy_copy: A small toy autoencoding dataset from [TF Seq2seq toolkit](https://github.com/google/seq2seq/tree/2500c26add91b079ca00cf1f091db5a99ddab9ae).
```
python [PATH_TEXAR]/examples/seq2seq_attn/prepare_data.py --data toy_copy
```
### Train the model ###
Train the model with the following cmd:
```
python [PATH_TEXAR]/bin/train.py --config_paths config_model_small.yml,config_data_toy_copy.yml
```
See [train.py](../../bin/train.py) for other available configurations.
[config_model_small.yml](./config_model_small.yml) speicifies a small-size model with single-layer RNN encoder/decoder. [config_model_medium.yml](./config_model_medium.yml) specifies a medium-size one with 2-layer RNN encoder/decoder.
The model will be trained/evaluated/checkpointed within the [Tensorflow Estimator](https://www.tensorflow.org/guide/estimators).
================================================
FILE: texar_repo/examples/seq2seq_configs/config_data_toy_copy.yml
================================================
# NMT data config. See `texar.data.PairedTextData.default_hparams()` for
# hyperparameters of train/eval data. Hyperparameters not specified here will
# take the default values.
data_hparams_train:
num_epochs: 10
batch_size: 32
source_dataset:
files: ./data/toy_copy/train/sources.txt
vocab_file: ./data/toy_copy/train/vocab.sources.txt
max_seq_length: 30
target_dataset:
files: ./data/toy_copy/train/targets.txt
vocab_file: ./data/toy_copy/train/vocab.targets.txt
max_seq_length: 30
data_hparams_eval:
batch_size: 32
shuffle: False
source_dataset:
files: ./data/toy_copy/dev/sources.txt
vocab_file: ./data/toy_copy/train/vocab.sources.txt
max_seq_length: 50
target_dataset:
files: ./data/toy_copy/dev/targets.txt
vocab_file: ./data/toy_copy/train/vocab.targets.txt
max_seq_length: 50
================================================
FILE: texar_repo/examples/seq2seq_configs/config_model_medium.yml
================================================
# Basic Seq2seq model of medium size. See
# `texar.models.BasicSeq2seq.default_hparams()` for possible hyperparameters
# default values. Hyperparameters not specified here will take the default
# values.
model: BasicSeq2seq
model_hparams:
source_embedder_hparams:
dim: 256
encoder_hparams:
rnn_cell:
type: GRUCell
kwargs:
num_units: 256
num_layers: 2
dropout:
input_keep_prob: 0.8
decoder_hparams:
rnn_cell:
type: GRUCell
kwargs:
num_units: 256
num_layers: 2
dropout:
input_keep_prob: 0.8
optimization:
optimizer:
type: AdamOptimizer
kwargs:
learning_rate: 0.0001
================================================
FILE: texar_repo/examples/seq2seq_configs/config_model_small.yml
================================================
# Basic Seq2seq model of small size. See
# `texar.models.BasicSeq2seq.default_hparams()` for possible hyperparameters
# default values. Hyperparameters not specified here will take the default
# values.
model: BasicSeq2seq
model_hparams:
source_embedder_hparams:
dim: 128
encoder_hparams:
rnn_cell:
type: GRUCell
kwargs:
num_units: 128
dropout:
input_keep_prob: 0.8
decoder_hparams:
rnn_cell:
type: GRUCell
kwargs:
num_units: 128
dropout:
input_keep_prob: 0.8
optimization:
optimizer:
type: AdamOptimizer
kwargs:
learning_rate: 0.0001
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/README.md
================================================
# Sequence Generation Algorithms Tackling Exposure Bias #
Despite the computational simplicity and efficiency, maximum likelihood training of sequence generation models (e.g., RNNs) suffers from the exposure bias [(Ranzato et al., 2015)](https://arxiv.org/pdf/1511.06732.pdf). That is, the model is trained to predict the next token given the previous ground-truth tokens; while at test time, since the resulting model does not have access to the ground truth, tokens generated by the model itself are instead used to make the next prediction. This discrepancy between training and test leads to the issue that mistakes in prediction can quickly accumulate.
This example provide implementations of some classic and advanced training algorithms that tackles the exposure bias. The base model is an attentional seq2seq.
* **Maximum Likelihood (MLE)**: attentional seq2seq model with maximum likelihood training.
* **Reward Augmented Maximum Likelihood (RAML)**: Described in [(Norouzi et al., 2016)](https://arxiv.org/pdf/1609.00150.pdf) and we use the sampling approach (n-gram replacement) by [(Ma et al., 2017)](https://arxiv.org/abs/1705.07136).
* **Scheduled Sampling**: Described in [(Bengio et al., 2015)](https://arxiv.org/abs/1506.03099)
* **Interpolation Algorithm**: Described in [(Tan et al., 2018) Connecting the Dots Between MLE and RL for Sequence Generation](https://arxiv.org/abs/1811.09740)
## Usage ##
### Dataset ###
Two example datasets are provided:
* iwslt14: The benchmark [IWSLT2014](https://sites.google.com/site/iwsltevaluation2014/home) (de-en) machine translation dataset, following [(Ranzato et al., 2015)](https://arxiv.org/pdf/1511.06732.pdf) for data pre-processing.
* gigaword: The benchmark [GIGAWORD](https://catalog.ldc.upenn.edu/LDC2003T05) text summarization dataset. we sampled 200K out of the 3.8M pre-processed training examples provided by [(Rush et al., 2015)](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) for the sake of training efficiency. We used the refined validation and test sets provided by [(Zhou et al., 2017)](https://arxiv.org/pdf/1704.07073.pdf).
Download the data with the following commands:
```
python utils/prepare_data.py --data iwslt14
python utils/prepare_data.py --data giga
```
### Train the models ###
#### Baseline Attentional Seq2seq
```
python baseline_seq2seq_attn_main.py \
--config_model configs.config_model \
--config_data configs.config_iwslt14
```
Here:
* `--config_model` specifies the model config. Note not to include the `.py` suffix.
* `--config_data` specifies the data config.
[configs.config_model.py](./configs/config_model.py) specifies a single-layer seq2seq model with Luong attention and bi-directional RNN encoder. Hyperparameters taking default values can be omitted from the config file.
For demonstration purpose, [configs.config_model_full.py](./configs/config_model_full.py) gives all possible hyperparameters for the model. The two config files will lead to the same model.
#### Reward Augmented Maximum Likelihood (RAML)
```
python raml_main.py \
--config_model configs.config_model \
--config_data configs.config_iwslt14 \
--raml_file data/iwslt14/samples_iwslt14.txt \
--n_samples 10
```
Here:
* `--raml_file` specifies the file containing the augmented samples and rewards.
* `--n_samples` specifies number of augmented samples for every target sentence.
* `--tau` specifies the temperature of the exponentiated payoff distribution in RAML.
In the downloaded datasets, we have provided example files for `--raml_file`, which including augmented samples for ```iwslt14``` and ```gigaword``` respectively. We also provide scripts for generating augmented samples by yourself. Please refer to [utils/raml_samples_generation](utils/raml_samples_generation).
#### Scheduled Sampling
```
python scheduled_sampling_main.py \
--config_model configs.config_model \
--config_data configs.config_iwslt14 \
--decay_factor 500.
```
Here:
* `--decay_factor` specifies the hyperparameter controling the speed of increasing the probability of sampling from model.
#### Interpolation Algorithm
```
python interpolation_main.py \
--config_model configs.config_model \
--config_data configs.config_iwslt14 \
--lambdas_init [0.04,0.96,0.0] \
--delta_lambda_self 0.06 \
--delta_lambda_reward 0.06 \
--lambda_reward_steps 4
```
Here:
* `--lambdas_init` specifies the initial value of lambdas.
* `--delta_lambda_reward` specifies the increment of lambda_reward every annealing step.
* `--delta_lambda_self` specifies the decrement of lambda_self every annealing step.
* `--k` specifies the times of increasing lambda_reward after incresing lambda_self once.
## Results ##
### Machine Translation
| Model | BLEU Score |
| -----------| -------|
| MLE | 26.44 ± 0.18 |
| Scheduled Sampling | 26.76 ± 0.17 |
| RAML | 27.22 ± 0.14 |
| Interpolation | 27.82 ± 0.11 |
### Text Summarization
| Model | Rouge-1 | Rouge-2 | Rouge-L |
| -----------| -------|-------|-------|
| MLE | 36.11 ± 0.21 | 16.39 ± 0.16 | 32.32 ± 0.19 |
| Scheduled Sampling | 36.59 ± 0.12 |16.79 ± 0.22|32.77 ± 0.17|
| RAML | 36.30 ± 0.24 | 16.69 ± 0.20 | 32.49 ± 0.17 |
| Interpolation | 36.72 ± 0.29 |16.99 ± 0.17 | 32.95 ± 0.33|
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/baseline_seq2seq_attn_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Attentional Seq2seq.
same as examples/seq2seq_attn except that here Rouge is also supported.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
# pylint: disable=invalid-name, too-many-arguments, too-many-locals
from io import open
import importlib
import tensorflow as tf
import texar as tx
from rouge import Rouge
flags = tf.flags
flags.DEFINE_string("config_model", "configs.config_model", "The model config.")
flags.DEFINE_string("config_data", "configs.config_iwslt14",
"The dataset config.")
flags.DEFINE_string('output_dir', '.', 'where to keep training logs')
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
if not FLAGS.output_dir.endswith('/'):
FLAGS.output_dir += '/'
log_dir = FLAGS.output_dir + 'training_log_baseline/'
tx.utils.maybe_create_dir(log_dir)
def build_model(batch, train_data):
"""Assembles the seq2seq model.
"""
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = tx.modules.AttentionRNNDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
training_outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=target_embedder(batch['target_text_ids'][:, :-1]),
sequence_length=batch['target_length'] - 1)
train_op = tx.core.get_train_op(
tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=training_outputs.logits,
sequence_length=batch['target_length'] - 1),
hparams=config_model.opt)
start_tokens = tf.ones_like(batch['target_length']) *\
train_data.target_vocab.bos_token_id
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return train_op, beam_search_outputs
def print_stdout_and_file(content, file):
print(content)
print(content, file=file)
def main():
"""Entrypoint.
"""
train_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
data_iterator = tx.data.TrainTestDataIterator(
train=train_data, val=val_data, test=test_data)
batch = data_iterator.get_next()
train_op, infer_outputs = build_model(batch, train_data)
def _train_epoch(sess, epoch_no):
data_iterator.switch_to_train_data(sess)
training_log_file = \
open(log_dir + 'training_log' + str(epoch_no) + '.txt', 'w',
encoding='utf-8')
step = 0
while True:
try:
loss = sess.run(train_op)
print("step={}, loss={:.4f}".format(step, loss),
file=training_log_file)
if step % config_data.observe_steps == 0:
print("step={}, loss={:.4f}".format(step, loss))
training_log_file.flush()
step += 1
except tf.errors.OutOfRangeError:
break
def _eval_epoch(sess, mode, epoch_no):
if mode == 'val':
data_iterator.switch_to_val_data(sess)
else:
data_iterator.switch_to_test_data(sess)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
target_texts_ori, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(
target_texts_ori.tolist(), is_token_list=True)
target_texts = tx.utils.str_join(target_texts)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
tx.utils.write_paired_text(
target_texts, output_texts,
log_dir + mode + '_results' + str(epoch_no) + '.txt',
append=True, mode='h', sep=' ||| ')
for hypo, ref in zip(output_texts, target_texts):
if config_data.eval_metric == 'bleu':
hypos.append(hypo)
refs.append([ref])
elif config_data.eval_metric == 'rouge':
hypos.append(tx.utils.compat_as_text(hypo))
refs.append(tx.utils.compat_as_text(ref))
except tf.errors.OutOfRangeError:
break
if config_data.eval_metric == 'bleu':
return tx.evals.corpus_bleu_moses(
list_of_references=refs, hypotheses=hypos)
elif config_data.eval_metric == 'rouge':
rouge = Rouge()
return rouge.get_scores(hyps=hypos, refs=refs, avg=True)
def _calc_reward(score):
"""
Return the bleu score or the sum of (Rouge-1, Rouge-2, Rouge-L).
"""
if config_data.eval_metric == 'bleu':
return score
elif config_data.eval_metric == 'rouge':
return sum([value['f'] for key, value in score.items()])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
best_val_score = -1.
scores_file = open(log_dir + 'scores.txt', 'w', encoding='utf-8')
for i in range(config_data.num_epochs):
_train_epoch(sess, i)
val_score = _eval_epoch(sess, 'val', i)
test_score = _eval_epoch(sess, 'test', i)
best_val_score = max(best_val_score, _calc_reward(val_score))
if config_data.eval_metric == 'bleu':
print_stdout_and_file(
'val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
i, val_score, best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch={}, BLEU={:.4f}'.format(i, test_score),
file=scores_file)
print_stdout_and_file('=' * 50, file=scores_file)
elif config_data.eval_metric == 'rouge':
print_stdout_and_file(
'valid epoch {}:'.format(i), file=scores_file)
for key, value in val_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('fsum: {}; best_val_fsum: {}'.format(
_calc_reward(val_score), best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch {}:'.format(i), file=scores_file)
for key, value in test_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('=' * 110, file=scores_file)
scores_file.flush()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/configs/__init__.py
================================================
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/configs/config_giga.py
================================================
num_epochs = 30
observe_steps = 500
eval_metric = 'rouge'
batch_size = 64
source_vocab_file = './data/giga/vocab.article'
target_vocab_file = './data/giga/vocab.title'
train = {
'batch_size': batch_size,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": 'data/giga/train.article',
'vocab_file': source_vocab_file
},
'target_dataset': {
'files': 'data/giga/train.title',
'vocab_file': target_vocab_file
}
}
val = {
'batch_size': batch_size,
'shuffle': False,
'allow_smaller_final_batch': True,
'source_dataset': {
"files": 'data/giga/valid.article',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/giga/valid.title',
'vocab_file': target_vocab_file,
}
}
test = {
'batch_size': batch_size,
'shuffle': False,
'allow_smaller_final_batch': True,
'source_dataset': {
"files": 'data/giga/test.article',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/giga/test.title',
'vocab_file': target_vocab_file,
}
}
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/configs/config_iwslt14.py
================================================
num_epochs = 50 # the best epoch occurs within 10 epochs in most cases
observe_steps = 500
eval_metric = 'bleu'
batch_size = 64
source_vocab_file = './data/iwslt14/vocab.de'
target_vocab_file = './data/iwslt14/vocab.en'
train = {
'batch_size': batch_size,
'shuffle': True,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": 'data/iwslt14/train.de',
'vocab_file': source_vocab_file,
'max_seq_length': 50
},
'target_dataset': {
'files': 'data/iwslt14/train.en',
'vocab_file': target_vocab_file,
'max_seq_length': 50
}
}
val = {
'batch_size': batch_size,
'shuffle': False,
'allow_smaller_final_batch': True,
'source_dataset': {
"files": 'data/iwslt14/valid.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/valid.en',
'vocab_file': target_vocab_file,
}
}
test = {
'batch_size': batch_size,
'shuffle': False,
'allow_smaller_final_batch': True,
'source_dataset': {
"files": 'data/iwslt14/test.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/test.en',
'vocab_file': target_vocab_file,
}
}
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/configs/config_model.py
================================================
num_units = 256
beam_width = 5
decoder_layers = 1
dropout = 0.2
embedder = {
'dim': num_units
}
encoder = {
'rnn_cell_fw': {
'kwargs': {
'num_units': num_units
},
'dropout': {
'input_keep_prob': 1. - dropout
}
}
}
decoder = {
'rnn_cell': {
'kwargs': {
'num_units': num_units
},
'dropout': {
'input_keep_prob': 1. - dropout
},
'num_layers': decoder_layers
},
'attention': {
'kwargs': {
'num_units': num_units,
},
'attention_layer_size': num_units
}
}
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'learning_rate': 0.001,
},
},
}
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/interpolation_decoder.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Interpolation Decoder is used for interpolation algorithm
which stores one more variable in 'state' recording the
decoded ids(state: [decoded_ids, rnn_state]).
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=no-name-in-module, too-many-arguments, too-many-locals
# pylint: disable=not-context-manager, protected-access, invalid-name
import tensorflow as tf
from texar.modules.decoders.rnn_decoders import \
AttentionRNNDecoder, AttentionRNNDecoderOutput
class InterpolationDecoder(AttentionRNNDecoder):
"""
Basicly the same as AttentionRNNDecoder except one
more variable except rnn_state in 'state' recording the
decoded ids(state: [decoded_ids, rnn_state])
Args:
memory: The memory to query, e.g., the output of an RNN encoder. This
tensor should be shaped `[batch_size, max_time, dim]`.
memory_sequence_length (optional): A tensor of shape `[batch_size]`
containing the sequence lengths for the batch
entries in memory. If provided, the memory tensor rows are masked
with zeros for values past the respective sequence lengths.
cell (RNNCell, optional): An instance of `RNNCell`. If `None`, a cell
is created as specified in :attr:`hparams`.
cell_dropout_mode (optional): A Tensor taking value of
:tf_main:`tf.estimator.ModeKeys `, which
toggles dropout in the RNN cell (e.g., activates dropout in
TRAIN mode). If `None`, :func:`~texar.global_mode` is used.
Ignored if :attr:`cell` is given.
vocab_size (int, optional): Vocabulary size. Required if
:attr:`output_layer` is `None`.
output_layer (optional): An instance of
:tf_main:`tf.layers.Layer `, or
:tf_main:`tf.identity `. Apply to the RNN cell
output to get logits. If `None`, a dense layer
is used with output dimension set to :attr:`vocab_size`.
Set `output_layer=tf.identity` if you do not want to have an
output layer after the RNN cell outputs.
cell_input_fn (callable, optional): A callable that produces RNN cell
inputs. If `None` (default), the default is used:
`lambda inputs, attention: tf.concat([inputs, attention], -1)`,
which cancats regular RNN cell inputs with attentions.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self,
memory,
memory_sequence_length=None,
cell=None,
cell_dropout_mode=None,
vocab_size=None,
output_layer=None,
cell_input_fn=None,
hparams=None):
AttentionRNNDecoder.__init__(
self, memory, memory_sequence_length, cell, cell_dropout_mode,
vocab_size, output_layer, cell_input_fn, hparams)
def initialize(self, name=None):
init = AttentionRNNDecoder.initialize(self, name)
batch_size = tf.shape(init[0])[0]
# decoded_ids can be initialized as any arbitrary value
# because it will be assigned later in decoding
initial_decoded_ids = tf.ones((batch_size, 60), dtype=tf.int32)
initial_rnn_state = init[2]
initial_state = [initial_decoded_ids, initial_rnn_state]
init[2] = initial_state
return init
def step(self, time, inputs, state, name=None):
# Basicly the same as in AttentionRNNDecoder except considering
# about the different form of 'state'(decoded_ids, rnn_state)
wrapper_outputs, wrapper_state = self._cell(inputs, state[1])
decoded_ids = state[0]
logits = self._output_layer(wrapper_outputs)
sample_ids = self._helper.sample(
time=time, outputs=logits, state=[decoded_ids, wrapper_state])
(finished, next_inputs, next_state) = self._helper.next_inputs(
time=time,
outputs=logits,
state=[decoded_ids, wrapper_state],
sample_ids=sample_ids)
attention_scores = wrapper_state.alignments
attention_context = wrapper_state.attention
outputs = AttentionRNNDecoderOutput(
logits, sample_ids, wrapper_outputs,
attention_scores, attention_context)
return (outputs, next_state, next_inputs, finished)
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/interpolation_helper.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Helper for interpolation algirithm.
New token is sample from model, ground_truth or reward according to lambdas
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import numpy as np
from tensorflow.contrib.seq2seq import SampleEmbeddingHelper
from texar.evals.bleu import sentence_bleu
from rouge import Rouge
rouge = Rouge()
def calc_reward(refs, hypo, unk_id, metric):
"""
calculate the reward given hypo and refs and will return
bleu score if metric is 'bleu' or return
sum of (Rouge-1, Rouge-2, Rouge-L) if metric is 'rouge'
"""
if len(hypo) == 0 or len(refs[0]) == 0:
return 0.
for i in range(len(hypo)):
assert isinstance(hypo[i], int)
if hypo[i] == unk_id:
hypo[i] = -1
if metric == 'bleu':
return 0.01 * sentence_bleu(
references=refs, hypothesis=hypo, smooth=True)
else:
ref_str = ' '.join([str(word) for word in refs[0]])
hypo_str = ' '.join([str(word) for word in hypo])
rouge_scores = \
rouge.get_scores(hyps=[hypo_str], refs=[ref_str], avg=True)
return sum([value['f'] for key, value in rouge_scores.items()])
class InterpolationHelper(SampleEmbeddingHelper):
"""
Helper for interpolation algirithm.
New token is sample from model, ground_truth or reward according to lambdas
Args:
embedding: A callable that takes a vector tensor of `ids` (argmax ids),
or the `params` argument for `embedding_lookup`. The returned tensor
will be passed to the decoder input.
start_tokens: `int32` vector shaped `[batch_size]`, the start tokens.
end_token: `int32` scalar, the token that marks end of decoding.
vocab: texar.Vocab, the vocabularies of training set
reward_metric: 'bleu' or 'rouge', the metric of reward
ground_truth: the ground truth in training set
ground_truth_length: the length of ground truth sentences
lambdas: 'float32' vector shapes [3], according to which
decide the way of generate the next token in training
"""
def __init__(self,
embedding,
start_tokens,
end_token,
vocab,
reward_metric,
ground_truth,
ground_truth_length,
lambdas):
SampleEmbeddingHelper.__init__(self, embedding, start_tokens, end_token)
self._vocab = vocab
self._ground_truth = ground_truth
self._lambdas = lambdas
self._ground_truth_length = ground_truth_length
self._metric = reward_metric
def sample(self, time, outputs, state, name=None):
"""
sample tokens for next step, notice the special form
of 'state'([decoded_ids, rnn_state])
"""
sample_method_sampler = \
tf.distributions.Categorical(probs=self._lambdas)
sample_method_id = sample_method_sampler.sample()
truth_feeding = lambda: tf.cond(
tf.less(time, tf.shape(self._ground_truth)[1]),
lambda: tf.to_int32(self._ground_truth[:, time]),
lambda: tf.ones_like(self._ground_truth[:, 0],
dtype=tf.int32) * self._vocab.eos_token_id)
self_feeding = lambda : SampleEmbeddingHelper.sample(
self, time, outputs, state, name)
reward_feeding = lambda : self._sample_by_reward(time, state)
sample_ids = tf.cond(
tf.logical_or(tf.equal(time, 0), tf.equal(sample_method_id, 1)),
truth_feeding,
lambda: tf.cond(
tf.equal(sample_method_id, 2),
reward_feeding,
self_feeding))
return sample_ids
def next_inputs(self, time, outputs, state, sample_ids, name=None):
"""
notice the special form of 'state'([decoded_ids, rnn_state])
"""
finished, next_inputs, next_state = SampleEmbeddingHelper.next_inputs(
self, time, outputs, state[1], sample_ids, name)
next_state = [tf.concat(
[state[0][:, :time], tf.expand_dims(sample_ids, 1),
state[0][:, time + 1:]], axis=1), next_state]
next_state[0] = tf.reshape(next_state[0], (tf.shape(sample_ids)[0], 60))
return finished, next_inputs, next_state
def _sample_by_reward(self, time, state):
def _get_rewards(time, prefix_ids, target_ids, ground_truth_length):
batch_size = np.shape(target_ids)[0]
words_in_target = \
[np.unique(target_ids[i]) for i in range(batch_size)]
unk_id = self._vocab.unk_token_id
eos_id = self._vocab.eos_token_id
# before append
baseline_scores = []
baseline_ids = prefix_ids[:, :time]
for i in range(batch_size):
ref = target_ids[i].tolist()
if self._vocab.eos_token_id in ref:
ref = ref[:ref.index(self._vocab.eos_token_id)]
hypo = baseline_ids[i].tolist()
if self._vocab.eos_token_id in hypo:
hypo = hypo[:hypo.index(self._vocab.eos_token_id)]
baseline_scores.append(calc_reward(
refs=[ref], hypo=hypo, unk_id=unk_id,
metric=self._metric))
# append UNK
syn_ids = np.concatenate([
prefix_ids[:, :time],
np.ones((batch_size, 1), dtype=np.int32) * unk_id], axis=1)
reward_unk = []
for i in range(batch_size):
ref = target_ids[i].tolist()
if self._vocab.eos_token_id in ref:
ref = ref[:ref.index(self._vocab.eos_token_id)]
hypo = syn_ids[i].tolist()
if self._vocab.eos_token_id in hypo:
hypo = hypo[:hypo.index(self._vocab.eos_token_id)]
reward = calc_reward(refs=[ref], hypo=hypo, unk_id=unk_id,
metric=self._metric)
reward_unk.append(
np.ones((1, self._vocab.size), dtype=np.float32) *
reward - baseline_scores[i])
result = np.concatenate(reward_unk, axis=0)
# append tokens
for i in range(batch_size):
for id in words_in_target[i]:
if id == unk_id:
continue
syn_id = np.concatenate(
[prefix_ids[i:i + 1, :time], np.array([[id, ]])],
axis=1)
hypo = syn_id[0].tolist()
if self._vocab.eos_token_id in hypo:
hypo = hypo[:hypo.index(self._vocab.eos_token_id)]
ref = target_ids[i].tolist()
if self._vocab.eos_token_id in ref:
ref = ref[:ref.index(self._vocab.eos_token_id)]
dup = 1. if prefix_ids[i][time] == id and \
id != unk_id else 0.
eos = 1. if time < ground_truth_length[i] - 1 and \
id == eos_id else 0.
reward = calc_reward(
refs=[ref], hypo=hypo, unk_id=unk_id,
metric=self._metric)
result[i][id] = reward - baseline_scores[i] - dup - eos
return result
sampler = tf.distributions.Categorical(
logits=tf.py_func(_get_rewards, [
time, state[0], self._ground_truth,
self._ground_truth_length], tf.float32))
return tf.reshape(
sampler.sample(), (tf.shape(self._ground_truth)[0],))
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/interpolation_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Interpolation Algorithm.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import importlib
from io import open
import tensorflow as tf
import texar as tx
import numpy as np
from interpolation_decoder import InterpolationDecoder
from interpolation_helper import InterpolationHelper
from rouge import Rouge
flags = tf.flags
flags.DEFINE_string("config_model", "configs.config_model", "The model config.")
flags.DEFINE_string("config_data", "configs.config_iwslt14",
"The dataset config.")
flags.DEFINE_string('lambdas_init', '[0.04,0.96,0.0]',
'initial value of lambdas')
flags.DEFINE_float('delta_lambda_reward', 0.06,
'increment of lambda_reward every annealing')
flags.DEFINE_float('delta_lambda_self', 0.06,
'decrement of lambda_self every annealing')
flags.DEFINE_integer('lambda_reward_steps', 4,
'times of increasing lambda_reward '
'after incresing lambda_self once')
flags.DEFINE_string('output_dir', '.', 'where to keep training logs')
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
FLAGS.lambdas_init = eval(FLAGS.lambdas_init)
if not FLAGS.output_dir.endswith('/'):
FLAGS.output_dir += '/'
log_dir = FLAGS.output_dir + 'training_log_interpolation' +\
'_init' + '_' + str(FLAGS.lambdas_init[0]) +\
'_' + str(FLAGS.lambdas_init[1]) +\
'_' + str(FLAGS.lambdas_init[2]) +\
'_dr' + str(FLAGS.delta_lambda_reward) +\
'_ds' + str(FLAGS.delta_lambda_self) +\
'_rstep' + str(FLAGS.lambda_reward_steps) + '/'
tx.utils.maybe_create_dir(log_dir)
def build_model(batch, train_data, lambdas):
"""
This function is basically the same as build_model() in
baseline_seq2seq_attn.py, except the
InterpolateDecoder and InterpolateHelper.
"""
batch_size = tf.shape(batch['target_length'])[0]
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = InterpolationDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
start_tokens = tf.ones_like(
batch['target_length']) * train_data.target_vocab.bos_token_id
helper = InterpolationHelper(
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
reward_metric=config_data.eval_metric,
vocab=train_data.target_vocab,
ground_truth=batch['target_text_ids'][:, 1:],
ground_truth_length=batch['target_length'] - 1,
lambdas=lambdas,)
training_outputs, _, training_length = decoder(
helper=helper,
initial_state=decoder.zero_state(
batch_size=batch_size, dtype=tf.float32),
max_decoding_length=60)
train_op = tx.core.get_train_op(
tx.losses.sequence_sparse_softmax_cross_entropy(
labels=training_outputs.sample_id,
logits=training_outputs.logits,
sequence_length=training_length),
hparams=config_model.opt)
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return train_op, beam_search_outputs
def print_stdout_and_file(content, file):
print(content)
print(content, file=file)
def main():
"""Entrypoint.
"""
training_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
data_iterator = tx.data.TrainTestDataIterator(
train=training_data, val=val_data, test=test_data)
batch = data_iterator.get_next()
lambdas_ts = tf.placeholder(shape=[3], dtype=tf.float32)
train_op, infer_outputs = build_model(batch, training_data, lambdas_ts)
def _train_epoch(sess, epoch, lambdas):
data_iterator.switch_to_train_data(sess)
log_file = open(log_dir + 'training_log' + str(epoch) + '.txt', 'w',
encoding='utf-8')
step = 0
while True:
try:
loss = sess.run(train_op, feed_dict={
lambdas_ts: np.array(lambdas)})
print("step={}, loss={:.4f}, lambdas={}".format(
step, loss, lambdas), file=log_file)
if step % config_data.observe_steps == 0:
print("step={}, loss={:.4f}, lambdas={}".format(
step, loss, lambdas))
log_file.flush()
step += 1
except tf.errors.OutOfRangeError:
break
def _eval_epoch(sess, mode, epoch_no):
"""
This function is the same as _eval_epoch() in
baseline_seq2seq_attn_main.py.
"""
if mode == 'val':
data_iterator.switch_to_val_data(sess)
else:
data_iterator.switch_to_test_data(sess)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
target_texts_ori, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(
target_texts_ori.tolist(), is_token_list=True)
target_texts = tx.utils.str_join(target_texts)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
tx.utils.write_paired_text(
target_texts, output_texts,
log_dir + mode + '_results' + str(epoch_no) + '.txt',
append=True, mode='h', sep=' ||| ')
for hypo, ref in zip(output_texts, target_texts):
if config_data.eval_metric == 'bleu':
hypos.append(hypo)
refs.append([ref])
elif config_data.eval_metric == 'rouge':
hypos.append(tx.utils.compat_as_text(hypo))
refs.append(tx.utils.compat_as_text(ref))
except tf.errors.OutOfRangeError:
break
if config_data.eval_metric == 'bleu':
return tx.evals.corpus_bleu_moses(
list_of_references=refs, hypotheses=hypos)
elif config_data.eval_metric == 'rouge':
rouge = Rouge()
return rouge.get_scores(hyps=hypos, refs=refs, avg=True)
def _calc_reward(score):
"""
Return the bleu score or the sum of (Rouge-1, Rouge-2, Rouge-L).
"""
if config_data.eval_metric == 'bleu':
return score
elif config_data.eval_metric == 'rouge':
return sum([value['f'] for key, value in score.items()])
def _anneal():
"""
Operate lambdas when the reward of val set decrease.
"""
def _update_self():
"""
Decrease lambda_truth and increase lambda_self.
"""
lambdas[1] -= FLAGS.delta_lambda_self
lambdas[0] += FLAGS.delta_lambda_self
updates.append('self')
def _update_rew():
"""
Decrease lambda_truth and increase lambda_reward.
"""
lambdas[1] -= FLAGS.delta_lambda_reward
lambdas[2] += FLAGS.delta_lambda_reward
updates.append('rew')
if updates[-FLAGS.lambda_reward_steps:] == \
['rew'] * FLAGS.lambda_reward_steps:
_update_self()
else:
_update_rew()
saver = tf.train.Saver(max_to_keep=2)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
lambdas = FLAGS.lambdas_init
updates = ['rew'] * FLAGS.lambda_reward_steps
best_val_score, best_val_score_current_lambdas = -1., -1.
scores_file = open(log_dir + 'scores.txt', 'w', encoding='utf-8')
for i in range(config_data.num_epochs):
print_stdout_and_file(
'training epoch={}, lambdas={}'.format(i, lambdas),
file=scores_file)
_train_epoch(sess, i, lambdas)
saver.save(sess, log_dir + 'models/model{}.ckpt'.format(i))
val_score = _eval_epoch(sess, 'val', i)
test_score = _eval_epoch(sess, 'test', i)
if _calc_reward(val_score) < best_val_score_current_lambdas:
_anneal()
best_val_score_current_lambdas = -1.
saver.restore(
sess, log_dir + 'models/model{}.ckpt'.format(i - 1))
else:
best_val_score_current_lambdas = _calc_reward(val_score)
best_val_score = max(best_val_score, _calc_reward(val_score))
if config_data.eval_metric == 'bleu':
print_stdout_and_file(
'val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
i, val_score, best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch={}, BLEU={:.4f}'.format(i, test_score),
file=scores_file)
print_stdout_and_file('=' * 50, file=scores_file)
elif config_data.eval_metric == 'rouge':
print_stdout_and_file(
'valid epoch {}:'.format(i), file=scores_file)
for key, value in val_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('fsum: {}; best_val_fsum: {}'.format(
_calc_reward(val_score), best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch {}:'.format(i), file=scores_file)
for key, value in test_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('=' * 110, file=scores_file)
scores_file.flush()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/raml_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Attentional Seq2seq with RAML algorithm.
Read a pre-processed file containing the augmented samples and
corresponding rewards for every target sentence.
RAML Algorithm is described in https://arxiv.org/pdf/1705.07136.pdf
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
from io import open
import importlib
import tensorflow as tf
import texar as tx
import numpy as np
import random
from rouge import Rouge
flags = tf.flags
flags.DEFINE_string("config_model", "configs.config_model", "The model config.")
flags.DEFINE_string("config_data", "configs.config_iwslt14",
"The dataset config.")
flags.DEFINE_string('raml_file', 'data/iwslt14/samples_iwslt14.txt',
'the samples and rewards described in RAML')
flags.DEFINE_integer('n_samples', 10,
'number of samples for every target sentence')
flags.DEFINE_float('tau', 0.4, 'the temperature in RAML algorithm')
flags.DEFINE_string('output_dir', '.', 'where to keep training logs')
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
if not FLAGS.output_dir.endswith('/'):
FLAGS.output_dir += '/'
log_dir = FLAGS.output_dir + 'training_log_raml' +\
'_' + str(FLAGS.n_samples) + 'samples' +\
'_tau' + str(FLAGS.tau) + '/'
tx.utils.maybe_create_dir(log_dir)
def read_raml_sample_file():
raml_file = open(FLAGS.raml_file, encoding='utf-8')
train_data = []
sample_num = -1
for line in raml_file.readlines():
line = line[:-1]
if line.startswith('***'):
continue
elif line.endswith('samples'):
sample_num = eval(line.split()[0])
assert sample_num == 1 or sample_num == FLAGS.n_samples
elif line.startswith('source:'):
train_data.append({'source': line[7:], 'targets': []})
else:
train_data[-1]['targets'].append(line.split('|||'))
if sample_num == 1:
for i in range(FLAGS.n_samples - 1):
train_data[-1]['targets'].append(line.split('|||'))
return train_data
def raml_loss(batch, output, training_rewards):
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=output.logits,
sequence_length=batch['target_length'] - 1,
average_across_batch=False)
return tf.reduce_sum(mle_loss * training_rewards) /\
tf.reduce_sum(training_rewards)
def build_model(batch, train_data, rewards):
"""
Assembles the seq2seq model.
Code in this function is basically the same of build_model() in
baseline_seq2seq_attn_main.py except the normalization in loss_fn.
"""
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = tx.modules.AttentionRNNDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
training_outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=target_embedder(batch['target_text_ids'][:, :-1]),
sequence_length=batch['target_length'] - 1)
train_op = tx.core.get_train_op(
raml_loss(batch, training_outputs, rewards),
hparams=config_model.opt)
start_tokens = tf.ones_like(batch['target_length']) *\
train_data.target_vocab.bos_token_id
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return train_op, beam_search_outputs
def print_stdout_and_file(content, file):
print(content)
print(content, file=file)
def main():
"""Entrypoint.
"""
config_data.train['batch_size'] *= FLAGS.n_samples
config_data.val['batch_size'] *= FLAGS.n_samples
config_data.test['batch_size'] *= FLAGS.n_samples
train_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
data_iterator = tx.data.TrainTestDataIterator(
train=train_data, val=val_data, test=test_data)
batch = data_iterator.get_next()
rewards_ts = tf.placeholder(
dtype=tf.float32, shape=[None, ], name='training_rewards')
train_op, infer_outputs = build_model(batch, train_data, rewards_ts)
raml_train_data = read_raml_sample_file()
def _train_epoch(sess, epoch_no):
data_iterator.switch_to_train_data(sess)
training_log_file = \
open(log_dir + 'training_log' + str(epoch_no) + '.txt', 'w',
encoding='utf-8')
step = 0
source_buffer, target_buffer = [], []
random.shuffle(raml_train_data)
for training_pair in raml_train_data:
for target in training_pair['targets']:
source_buffer.append(training_pair['source'])
target_buffer.append(target)
if len(target_buffer) != train_data.batch_size:
continue
source_ids = []
source_length = []
target_ids = []
target_length = []
scores = []
trunc_len_src = train_data.hparams.source_dataset.max_seq_length
trunc_len_tgt = train_data.hparams.target_dataset.max_seq_length
for sentence in source_buffer:
ids = [train_data.source_vocab.token_to_id_map_py[token]
for token in sentence.split()][:trunc_len_src]
ids = ids + [train_data.source_vocab.eos_token_id]
source_ids.append(ids)
source_length.append(len(ids))
for sentence, score_str in target_buffer:
ids = [train_data.target_vocab.bos_token_id]
ids = ids + [train_data.target_vocab.token_to_id_map_py[token]
for token in sentence.split()][:trunc_len_tgt]
ids = ids + [train_data.target_vocab.eos_token_id]
target_ids.append(ids)
scores.append(eval(score_str))
target_length.append(len(ids))
rewards = []
for i in range(0, train_data.batch_size, FLAGS.n_samples):
tmp = np.array(scores[i:i + FLAGS.n_samples])
tmp = np.exp(tmp / FLAGS.tau) / np.sum(np.exp(tmp / FLAGS.tau))
for j in range(0, FLAGS.n_samples):
rewards.append(tmp[j])
for value in source_ids:
while len(value) < max(source_length):
value.append(0)
for value in target_ids:
while len(value) < max(target_length):
value.append(0)
feed_dict = {
batch['source_text_ids']: np.array(source_ids),
batch['target_text_ids']: np.array(target_ids),
batch['source_length']: np.array(source_length),
batch['target_length']: np.array(target_length),
rewards_ts: np.array(rewards)
}
source_buffer = []
target_buffer = []
loss = sess.run(train_op, feed_dict=feed_dict)
print("step={}, loss={:.4f}".format(step, loss),
file=training_log_file)
if step % config_data.observe_steps == 0:
print("step={}, loss={:.4f}".format(step, loss))
training_log_file.flush()
step += 1
# code below this line is exactly the same as baseline_seq2seq_attn_main.py
def _eval_epoch(sess, mode, epoch_no):
if mode == 'val':
data_iterator.switch_to_val_data(sess)
else:
data_iterator.switch_to_test_data(sess)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
target_texts_ori, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(
target_texts_ori.tolist(), is_token_list=True)
target_texts = tx.utils.str_join(target_texts)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
tx.utils.write_paired_text(
target_texts, output_texts,
log_dir + mode + '_results' + str(epoch_no) + '.txt',
append=True, mode='h', sep=' ||| ')
for hypo, ref in zip(output_texts, target_texts):
if config_data.eval_metric == 'bleu':
hypos.append(hypo)
refs.append([ref])
elif config_data.eval_metric == 'rouge':
hypos.append(tx.utils.compat_as_text(hypo))
refs.append(tx.utils.compat_as_text(ref))
except tf.errors.OutOfRangeError:
break
if config_data.eval_metric == 'bleu':
return tx.evals.corpus_bleu_moses(
list_of_references=refs, hypotheses=hypos)
elif config_data.eval_metric == 'rouge':
rouge = Rouge()
return rouge.get_scores(hyps=hypos, refs=refs, avg=True)
def _calc_reward(score):
"""
Return the bleu score or the sum of (Rouge-1, Rouge-2, Rouge-L).
"""
if config_data.eval_metric == 'bleu':
return score
elif config_data.eval_metric == 'rouge':
return sum([value['f'] for key, value in score.items()])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
best_val_score = -1.
scores_file = open(log_dir + 'scores.txt', 'w', encoding='utf-8')
for i in range(config_data.num_epochs):
_train_epoch(sess, i)
val_score = _eval_epoch(sess, 'val', i)
test_score = _eval_epoch(sess, 'test', i)
best_val_score = max(best_val_score, _calc_reward(val_score))
if config_data.eval_metric == 'bleu':
print_stdout_and_file(
'val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
i, val_score, best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch={}, BLEU={:.4f}'.format(i, test_score),
file=scores_file)
print_stdout_and_file('=' * 50, file=scores_file)
elif config_data.eval_metric == 'rouge':
print_stdout_and_file(
'valid epoch {}:'.format(i), file=scores_file)
for key, value in val_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('fsum: {}; best_val_fsum: {}'.format(
_calc_reward(val_score), best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch {}:'.format(i), file=scores_file)
for key, value in test_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('=' * 110, file=scores_file)
scores_file.flush()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/requirements.txt
================================================
rouge==0.2.1
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/scheduled_sampling_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Attentional Seq2seq using Scheduled sampling algorithm.
This code is basically the same as baseline_seq2seq_attn_main.py,
except using ScheduledEmbeddingTrainingHelper.
Scheduled Sampling Algorithm is described in https://arxiv.org/abs/1506.03099
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
# pylint: disable=invalid-name, too-many-arguments, too-many-locals
from io import open
import math
import importlib
import tensorflow as tf
import texar as tx
from rouge import Rouge
flags = tf.flags
flags.DEFINE_string("config_model", "configs.config_model", "The model config.")
flags.DEFINE_string("config_data", "configs.config_iwslt14",
"The dataset config.")
flags.DEFINE_float('decay_factor', 500.,
'The hyperparameter controling the speed of increasing '
'the probability of sampling from model')
flags.DEFINE_string('output_dir', '.', 'where to keep training logs')
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
if not FLAGS.output_dir.endswith('/'):
FLAGS.output_dir += '/'
log_dir = FLAGS.output_dir + 'training_log_scheduled_sampling' +\
'_decayf' + str(FLAGS.decay_factor) + '/'
tx.utils.maybe_create_dir(log_dir)
def inverse_sigmoid(i):
return FLAGS.decay_factor / (
FLAGS.decay_factor + math.exp(i / FLAGS.decay_factor))
def build_model(batch, train_data, self_sampling_proba):
"""
Assembles the seq2seq model.
It is the same as build_model() in baseline_seq2seq_attn.py except
using ScheduledEmbeddingTrainingHelper.
"""
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = tx.modules.AttentionRNNDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
helper = tx.modules.get_helper(
helper_type='ScheduledEmbeddingTrainingHelper',
inputs=target_embedder(batch['target_text_ids'][:, :-1]),
sequence_length=batch['target_length'] - 1,
embedding=target_embedder,
sampling_probability=self_sampling_proba)
training_outputs, _, _ = decoder(
helper=helper, initial_state=decoder.zero_state(
batch_size=tf.shape(batch['target_length'])[0], dtype=tf.float32))
train_op = tx.core.get_train_op(
tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:],
logits=training_outputs.logits,
sequence_length=batch['target_length'] - 1),
hparams=config_model.opt)
start_tokens = tf.ones_like(batch['target_length']) *\
train_data.target_vocab.bos_token_id
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return train_op, beam_search_outputs
def print_stdout_and_file(content, file):
print(content)
print(content, file=file)
def main():
"""Entrypoint.
"""
train_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
data_iterator = tx.data.TrainTestDataIterator(
train=train_data, val=val_data, test=test_data)
batch = data_iterator.get_next()
self_sampling_proba = tf.placeholder(shape=[], dtype=tf.float32)
train_op, infer_outputs = \
build_model(batch, train_data, self_sampling_proba)
def _train_epoch(sess, epoch_no, total_step_counter):
data_iterator.switch_to_train_data(sess)
training_log_file = \
open(log_dir + 'training_log' + str(epoch_no) + '.txt', 'w',
encoding='utf-8')
step = 0
while True:
try:
sampling_proba_ = 1. - inverse_sigmoid(total_step_counter)
loss = sess.run(train_op, feed_dict={
self_sampling_proba: sampling_proba_})
print("step={}, loss={:.4f}, self_proba={}".format(
step, loss, sampling_proba_), file=training_log_file)
if step % config_data.observe_steps == 0:
print("step={}, loss={:.4f}, self_proba={}".format(
step, loss, sampling_proba_))
training_log_file.flush()
step += 1
total_step_counter += 1
except tf.errors.OutOfRangeError:
break
# code below this line is exactly the same as baseline_seq2seq_attn_main.py
def _eval_epoch(sess, mode, epoch_no):
if mode == 'val':
data_iterator.switch_to_val_data(sess)
else:
data_iterator.switch_to_test_data(sess)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.EVAL
}
target_texts_ori, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(
target_texts_ori.tolist(), is_token_list=True)
target_texts = tx.utils.str_join(target_texts)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
tx.utils.write_paired_text(
target_texts, output_texts,
log_dir + mode + '_results' + str(epoch_no) + '.txt',
append=True, mode='h', sep=' ||| ')
for hypo, ref in zip(output_texts, target_texts):
if config_data.eval_metric == 'bleu':
hypos.append(hypo)
refs.append([ref])
elif config_data.eval_metric == 'rouge':
hypos.append(tx.utils.compat_as_text(hypo))
refs.append(tx.utils.compat_as_text(ref))
except tf.errors.OutOfRangeError:
break
if config_data.eval_metric == 'bleu':
return tx.evals.corpus_bleu_moses(
list_of_references=refs, hypotheses=hypos)
elif config_data.eval_metric == 'rouge':
rouge = Rouge()
return rouge.get_scores(hyps=hypos, refs=refs, avg=True)
def _calc_reward(score):
"""
Return the bleu score or the sum of (Rouge-1, Rouge-2, Rouge-L).
"""
if config_data.eval_metric == 'bleu':
return score
elif config_data.eval_metric == 'rouge':
return sum([value['f'] for key, value in score.items()])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
best_val_score = -1.
total_step_counter = 1
scores_file = open(log_dir + 'scores.txt', 'w', encoding='utf-8')
for i in range(config_data.num_epochs):
_train_epoch(sess, i, total_step_counter)
val_score = _eval_epoch(sess, 'val', i)
test_score = _eval_epoch(sess, 'test', i)
best_val_score = max(best_val_score, _calc_reward(val_score))
if config_data.eval_metric == 'bleu':
print_stdout_and_file(
'val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
i, val_score, best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch={}, BLEU={:.4f}'.format(i, test_score),
file=scores_file)
print_stdout_and_file('=' * 50, file=scores_file)
elif config_data.eval_metric == 'rouge':
print_stdout_and_file(
'valid epoch {}:'.format(i), file=scores_file)
for key, value in val_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('fsum: {}; best_val_fsum: {}'.format(
_calc_reward(val_score), best_val_score), file=scores_file)
print_stdout_and_file(
'test epoch {}:'.format(i), file=scores_file)
for key, value in test_score.items():
print_stdout_and_file(
'{}: {}'.format(key, value), file=scores_file)
print_stdout_and_file('=' * 110, file=scores_file)
scores_file.flush()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/prepare_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Downloads data.
"""
import tensorflow as tf
import texar as tx
# pylint: disable=invalid-name
flags = tf.flags
flags.DEFINE_string("data", "iwslt14", "Data to download [iwslt14|toy_copy]")
FLAGS = flags.FLAGS
def prepare_data():
"""Downloads data.
"""
if FLAGS.data == 'giga':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'12RZs7QFwjj6dfuYNQ_0Ah-ccH1xFDMD5/view?usp=sharing',
path='./',
filenames='giga.zip',
extract=True)
elif FLAGS.data == 'iwslt14':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1y4mUWXRS2KstgHopCS9koZ42ENOh6Yb9/view?usp=sharing',
path='./',
filenames='iwslt14.zip',
extract=True)
else:
raise ValueError('Unknown data: {}'.format(FLAGS.data))
def main():
"""Entrypoint.
"""
prepare_data()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/README.md
================================================
## Augmented Data Generation for RAML Algorithm
Codes here are mainly copied from [pcyin's github](https://github.com/pcyin/pytorch_nmt), with slightly change for supporting ```rouge``` as reward. Note that we have also provided generated samples in the datasets that you can download.
You may tune hyperparameters in ```gen_samples_giga.sh``` or ```gen_samples_iwslt14.sh``` and use commands like ```bash gen_samples_giga.sh``` to begin your generation.
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/gen_samples_giga.sh
================================================
#!/bin/sh
train_src="../../data/giga/train.article"
train_tgt="../../data/giga/train.title"
python vocab.py \
--src_vocab_size 30424 \
--tgt_vocab_size 23738 \
--train_src ${train_src} \
--train_tgt ${train_tgt} \
--include_singleton \
--output giga_vocab.bin
python process_samples.py \
--mode sample_ngram \
--vocab giga_vocab.bin \
--src ${train_src} \
--tgt ${train_tgt} \
--sample_size 10 \
--reward rouge \
--output samples_giga.txt
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/gen_samples_iwslt14.sh
================================================
#!/bin/sh
train_src="../../data/iwslt14/train.de"
train_tgt="../../data/iwslt14/train.en"
python vocab.py \
--src_vocab_size 32007 \
--tgt_vocab_size 22820 \
--train_src ${train_src} \
--train_tgt ${train_tgt} \
--include_singleton \
--output iwslt14_vocab.bin
python process_samples.py \
--mode sample_ngram \
--vocab iwslt14_vocab.bin \
--src ${train_src} \
--tgt ${train_tgt} \
--sample_size 10 \
--reward bleu \
--output samples_iwslt14.txt
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/process_samples.py
================================================
from __future__ import print_function
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
import sys
import re
import argparse
import torch
from util import read_corpus
import numpy as np
from scipy.misc import comb
from vocab import Vocab, VocabEntry
import math
from rouge import Rouge
def is_valid_sample(sent):
tokens = sent.split(' ')
return len(tokens) >= 1 and len(tokens) < 50
def sample_from_model(args):
para_data = args.parallel_data
sample_file = args.sample_file
output = args.output
tgt_sent_pattern = re.compile('^\[(\d+)\] (.*?)$')
para_data = [l.strip().split(' ||| ') for l in open(para_data)]
f_out = open(output, 'w')
f = open(sample_file)
f.readline()
for src_sent, tgt_sent in para_data:
line = f.readline().strip()
assert line.startswith('****')
line = f.readline().strip()
print(line)
assert line.startswith('target:')
tgt_sent2 = line[len('target:'):]
assert tgt_sent == tgt_sent2
line = f.readline().strip() # samples
tgt_sent = ' '.join(tgt_sent.split(' ')[1:-1])
tgt_samples = set()
for i in range(1, 101):
line = f.readline().rstrip('\n')
m = tgt_sent_pattern.match(line)
assert m, line
assert int(m.group(1)) == i
sampled_tgt_sent = m.group(2).strip()
if is_valid_sample(sampled_tgt_sent):
tgt_samples.add(sampled_tgt_sent)
line = f.readline().strip()
assert line.startswith('****')
tgt_samples.add(tgt_sent)
tgt_samples = list(tgt_samples)
assert len(tgt_samples) > 0
tgt_ref_tokens = tgt_sent.split(' ')
bleu_scores = []
for tgt_sample in tgt_samples:
bleu_score = sentence_bleu([tgt_ref_tokens], tgt_sample.split(' '))
bleu_scores.append(bleu_score)
tgt_ranks = sorted(range(len(tgt_samples)), key=lambda i: bleu_scores[i], reverse=True)
print('%d samples' % len(tgt_samples))
print('*' * 50, file=f_out)
print('source: ' + src_sent, file=f_out)
print('%d samples' % len(tgt_samples), file=f_out)
for i in tgt_ranks:
print('%s ||| %f' % (tgt_samples[i], bleu_scores[i]), file=f_out)
print('*' * 50, file=f_out)
f_out.close()
def get_new_ngram(ngram, n, vocab):
"""
replace ngram `ngram` with a newly sampled ngram of the same length
"""
new_ngram_wids = [np.random.randint(3, len(vocab)) for i in range(n)]
new_ngram = [vocab.id2word[wid] for wid in new_ngram_wids]
return new_ngram
def sample_ngram(args):
src_sents = read_corpus(args.src, 'src')
tgt_sents = read_corpus(args.tgt, 'src') # do not read in and
f_out = open(args.output, 'w')
vocab = torch.load(args.vocab)
tgt_vocab = vocab.tgt
smooth_bleu = args.smooth_bleu
sm_func = None
if smooth_bleu:
sm_func = SmoothingFunction().method3
for src_sent, tgt_sent in zip(src_sents, tgt_sents):
src_sent = ' '.join(src_sent)
tgt_len = len(tgt_sent)
tgt_samples = []
tgt_samples_distort_rates = [] # how many unigrams are replaced
# generate 100 samples
# append itself
tgt_samples.append(tgt_sent)
tgt_samples_distort_rates.append(0)
for sid in range(args.sample_size - 1):
n = np.random.randint(1, min(tgt_len, args.max_ngram_size + 1)) # we do not replace the last token: it must be a period!
idx = np.random.randint(tgt_len - n)
ngram = tgt_sent[idx: idx+n]
new_ngram = get_new_ngram(ngram, n, tgt_vocab)
sampled_tgt_sent = list(tgt_sent)
sampled_tgt_sent[idx: idx+n] = new_ngram
# compute the probability of this sample
# prob = 1. / args.max_ngram_size * 1. / (tgt_len - 1 + n) * 1 / (len(tgt_vocab) ** n)
tgt_samples.append(sampled_tgt_sent)
tgt_samples_distort_rates.append(n)
# compute bleu scores or edit distances and rank the samples by bleu scores
rewards = []
for tgt_sample, tgt_sample_distort_rate in zip(tgt_samples, tgt_samples_distort_rates):
if args.reward == 'bleu':
reward = sentence_bleu([tgt_sent], tgt_sample, smoothing_function=sm_func)
elif args.reward == 'rouge':
rouge = Rouge()
scores = rouge.get_scores(hyps=[' '.join(tgt_sample).decode('utf-8')], refs=[' '.join(tgt_sent).decode('utf-8')], avg=True)
reward = sum([value['f'] for key, value in scores.items()])
else:
reward = -tgt_sample_distort_rate
rewards.append(reward)
tgt_ranks = sorted(range(len(tgt_samples)), key=lambda i: rewards[i], reverse=True)
# convert list of tokens into a string
tgt_samples = [' '.join(tgt_sample) for tgt_sample in tgt_samples]
print('*' * 50, file=f_out)
print('source: ' + src_sent, file=f_out)
print('%d samples' % len(tgt_samples), file=f_out)
for i in tgt_ranks:
print('%s ||| %f' % (tgt_samples[i], rewards[i]), file=f_out)
print('*' * 50, file=f_out)
f_out.close()
def sample_ngram_adapt(args):
src_sents = read_corpus(args.src, 'src')
tgt_sents = read_corpus(args.tgt, 'src') # do not read in and
f_out = open(args.output, 'w')
vocab = torch.load(args.vocab)
tgt_vocab = vocab.tgt
max_len = max([len(tgt_sent) for tgt_sent in tgt_sents]) + 1
for src_sent, tgt_sent in zip(src_sents, tgt_sents):
src_sent = ' '.join(src_sent)
tgt_len = len(tgt_sent)
tgt_samples = []
# generate 100 samples
# append itself
tgt_samples.append(tgt_sent)
for sid in range(args.sample_size - 1):
max_n = min(tgt_len - 1, 4)
bias_n = int(max_n * tgt_len / max_len) + 1
assert 1 <= bias_n <= 4, 'bias_n={}, not in [1,4], max_n={}, tgt_len={}, max_len={}'.format(bias_n, max_n, tgt_len, max_len)
p = [1.0/(max_n + 5)] * max_n
p[bias_n - 1] = 1 - p[0] * (max_n - 1)
assert abs(sum(p) - 1) < 1e-10, 'sum(p) != 1'
n = np.random.choice(np.arange(1, int(max_n + 1)), p=p) # we do not replace the last token: it must be a period!
assert n < tgt_len, 'n={}, tgt_len={}'.format(n, tgt_len)
idx = np.random.randint(tgt_len - n)
ngram = tgt_sent[idx: idx+n]
new_ngram = get_new_ngram(ngram, n, tgt_vocab)
sampled_tgt_sent = list(tgt_sent)
sampled_tgt_sent[idx: idx+n] = new_ngram
tgt_samples.append(sampled_tgt_sent)
# compute bleu scores and rank the samples by bleu scores
bleu_scores = []
for tgt_sample in tgt_samples:
bleu_score = sentence_bleu([tgt_sent], tgt_sample)
bleu_scores.append(bleu_score)
tgt_ranks = sorted(range(len(tgt_samples)), key=lambda i: bleu_scores[i], reverse=True)
# convert list of tokens into a string
tgt_samples = [' '.join(tgt_sample) for tgt_sample in tgt_samples]
print('*' * 50, file=f_out)
print('source: ' + src_sent, file=f_out)
print('%d samples' % len(tgt_samples), file=f_out)
for i in tgt_ranks:
print('%s ||| %f' % (tgt_samples[i], bleu_scores[i]), file=f_out)
print('*' * 50, file=f_out)
f_out.close()
def sample_from_hamming_distance_payoff_distribution(args):
src_sents = read_corpus(args.src, 'src')
tgt_sents = read_corpus(args.tgt, 'src') # do not read in and
f_out = open(args.output, 'w')
vocab = torch.load(args.vocab)
tgt_vocab = vocab.tgt
payoff_prob, Z_qs = generate_hamming_distance_payoff_distribution(max(len(sent) for sent in tgt_sents),
vocab_size=len(vocab.tgt),
tau=args.temp)
for src_sent, tgt_sent in zip(src_sents, tgt_sents):
tgt_samples = [] # make sure the ground truth y* is in the samples
tgt_sent_len = len(tgt_sent) - 3 # remove and and ending period .
tgt_ref_tokens = tgt_sent[1:-1]
bleu_scores = []
# sample an edit distances
e_samples = np.random.choice(range(tgt_sent_len + 1), p=payoff_prob[tgt_sent_len], size=args.sample_size,
replace=True)
for i, e in enumerate(e_samples):
if e > 0:
# sample a new tgt_sent $y$
old_word_pos = np.random.choice(range(1, tgt_sent_len + 1), size=e, replace=False)
new_words = [vocab.tgt.id2word[wid] for wid in np.random.randint(3, len(vocab.tgt), size=e)]
new_tgt_sent = list(tgt_sent)
for pos, word in zip(old_word_pos, new_words):
new_tgt_sent[pos] = word
bleu_score = sentence_bleu([tgt_ref_tokens], new_tgt_sent[1:-1])
bleu_scores.append(bleu_score)
else:
new_tgt_sent = list(tgt_sent)
bleu_scores.append(1.)
# print('y: %s' % ' '.join(new_tgt_sent))
tgt_samples.append(new_tgt_sent)
def generate_hamming_distance_payoff_distribution(max_sent_len, vocab_size, tau=1.):
"""compute the q distribution for Hamming Distance (substitution only) as in the RAML paper"""
probs = dict()
Z_qs = dict()
for sent_len in range(1, max_sent_len + 1):
counts = [1.] # e = 0, count = 1
for e in range(1, sent_len + 1):
# apply the rescaling trick as in https://gist.github.com/norouzi/8c4d244922fa052fa8ec18d8af52d366
count = comb(sent_len, e) * math.exp(-e / tau) * ((vocab_size - 1) ** (e - e / tau))
counts.append(count)
Z_qs[sent_len] = Z_q = sum(counts)
prob = [count / Z_q for count in counts]
probs[sent_len] = prob
# print('sent_len=%d, %s' % (sent_len, prob))
return probs, Z_qs
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--mode', choices=['sample_from_model', 'sample_ngram_adapt', 'sample_ngram'], required=True)
parser.add_argument('--vocab', type=str)
parser.add_argument('--src', type=str)
parser.add_argument('--tgt', type=str)
parser.add_argument('--parallel_data', type=str)
parser.add_argument('--sample_file', type=str)
parser.add_argument('--output', type=str, required=True)
parser.add_argument('--sample_size', type=int, default=100)
parser.add_argument('--reward', choices=['bleu', 'edit_dist', 'rouge'], default='bleu')
parser.add_argument('--max_ngram_size', type=int, default=4)
parser.add_argument('--temp', type=float, default=0.5)
parser.add_argument('--smooth_bleu', action='store_true', default=False)
args = parser.parse_args()
if args.mode == 'sample_ngram':
sample_ngram(args)
elif args.mode == 'sample_from_model':
sample_from_model(args)
elif args.mode == 'sample_ngram_adapt':
sample_ngram_adapt(args)
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/util.py
================================================
from collections import defaultdict
import numpy as np
def read_corpus(file_path, source):
data = []
for line in open(file_path):
sent = line.strip().split(' ')
# only append and to the target sentence
if source == 'tgt':
sent = [''] + sent + ['']
data.append(sent)
return data
def batch_slice(data, batch_size, sort=True):
batch_num = int(np.ceil(len(data) / float(batch_size)))
for i in range(batch_num):
cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * i
src_sents = [data[i * batch_size + b][0] for b in range(cur_batch_size)]
tgt_sents = [data[i * batch_size + b][1] for b in range(cur_batch_size)]
if sort:
src_ids = sorted(range(cur_batch_size), key=lambda src_id: len(src_sents[src_id]), reverse=True)
src_sents = [src_sents[src_id] for src_id in src_ids]
tgt_sents = [tgt_sents[src_id] for src_id in src_ids]
yield src_sents, tgt_sents
def data_iter(data, batch_size, shuffle=True):
"""
randomly permute data, then sort by source length, and partition into batches
ensure that the length of source sentences in each batch is decreasing
"""
buckets = defaultdict(list)
for pair in data:
src_sent = pair[0]
buckets[len(src_sent)].append(pair)
batched_data = []
for src_len in buckets:
tuples = buckets[src_len]
if shuffle: np.random.shuffle(tuples)
batched_data.extend(list(batch_slice(tuples, batch_size)))
if shuffle:
np.random.shuffle(batched_data)
for batch in batched_data:
yield batch
================================================
FILE: texar_repo/examples/seq2seq_exposure_bias/utils/raml_samples_generation/vocab.py
================================================
from __future__ import print_function
import argparse
from collections import Counter
from itertools import chain
import torch
from util import read_corpus
class VocabEntry(object):
def __init__(self):
self.word2id = dict()
self.unk_id = 3
self.word2id[''] = 0
self.word2id[''] = 1
self.word2id[''] = 2
self.word2id[''] = 3
self.id2word = {v: k for k, v in self.word2id.iteritems()}
def __getitem__(self, word):
return self.word2id.get(word, self.unk_id)
def __contains__(self, word):
return word in self.word2id
def __setitem__(self, key, value):
raise ValueError('vocabulary is readonly')
def __len__(self):
return len(self.word2id)
def __repr__(self):
return 'Vocabulary[size=%d]' % len(self)
def id2word(self, wid):
return self.id2word[wid]
def add(self, word):
if word not in self:
wid = self.word2id[word] = len(self)
self.id2word[wid] = word
return wid
else:
return self[word]
@staticmethod
def from_corpus(corpus, size, remove_singleton=True):
vocab_entry = VocabEntry()
word_freq = Counter(chain(*corpus))
non_singletons = [w for w in word_freq if word_freq[w] > 1]
print('number of word types: %d, number of word types w/ frequency > 1: %d' % (len(word_freq),
len(non_singletons)))
top_k_words = sorted(word_freq.keys(), reverse=True, key=word_freq.get)[:size]
for word in top_k_words:
if len(vocab_entry) < size:
if not (word_freq[word] == 1 and remove_singleton):
vocab_entry.add(word)
return vocab_entry
class Vocab(object):
def __init__(self, src_sents, tgt_sents, src_vocab_size, tgt_vocab_size, remove_singleton=True):
assert len(src_sents) == len(tgt_sents)
print('initialize source vocabulary ..')
self.src = VocabEntry.from_corpus(src_sents, src_vocab_size, remove_singleton=remove_singleton)
print('initialize target vocabulary ..')
self.tgt = VocabEntry.from_corpus(tgt_sents, tgt_vocab_size, remove_singleton=remove_singleton)
def __repr__(self):
return 'Vocab(source %d words, target %d words)' % (len(self.src), len(self.tgt))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--src_vocab_size', default=50000, type=int, help='source vocabulary size')
parser.add_argument('--tgt_vocab_size', default=50000, type=int, help='target vocabulary size')
parser.add_argument('--include_singleton', action='store_true', default=False, help='whether to include singleton'
'in the vocabulary (default=False)')
parser.add_argument('--train_src', type=str, required=True, help='file of source sentences')
parser.add_argument('--train_tgt', type=str, required=True, help='file of target sentences')
parser.add_argument('--output', default='vocab.bin', type=str, help='output vocabulary file')
args = parser.parse_args()
print('read in source sentences: %s' % args.train_src)
print('read in target sentences: %s' % args.train_tgt)
src_sents = read_corpus(args.train_src, source='src')
tgt_sents = read_corpus(args.train_tgt, source='tgt')
vocab = Vocab(src_sents, tgt_sents, args.src_vocab_size, args.tgt_vocab_size, remove_singleton=not args.include_singleton)
print('generated vocabulary, source %d words, target %d words' % (len(vocab.src), len(vocab.tgt)))
torch.save(vocab, args.output)
print('vocabulary saved to %s' % args.output)
================================================
FILE: texar_repo/examples/seq2seq_rl/README.md
================================================
# Seq2seq Model with Policy Gradient Training #
This example builds an attentional seq2seq model that is trained with policy gradient and BLEU reward. The example is mainly for demonstration of the Texar sequence Reinforcement Learning APIs. No MLE pre-training is included so the model collapsed very quickly. In practice one would usually pretrain the model with teacher-forcing MLE (e.g., see the example [seq2seq_attn](../seq2seq_attn)) and continue to fine-tune with policy gradient.
The data and model configs are exact the same as the [MLE seq2seq example](../seq2seq_attn). The only difference is that MLE cross-entropy minimization is replaced with policy gradient training.
The example shows:
* Use of `texar.agents.SeqPGAgent` for policy gradient sequence generation.
* Use of the Python-based `texar.evals.sentence/corpus_bleu` for efficient reward computing, and the Moses `texar.evals.sentence/corpus_bleu_moses`
for standard test set evaluation.
* Use of `texar.data.FeedableDataIterator` for data feeding and resuming from breakpoint.
## Usage ##
### Dataset ###
Two example datasets are provided:
* toy_copy: A small toy autoencoding dataset from [TF Seq2seq toolkit](https://github.com/google/seq2seq/tree/2500c26add91b079ca00cf1f091db5a99ddab9ae).
* iwslt14: The benchmark [IWSLT2014](https://sites.google.com/site/iwsltevaluation2014/home) (de-en) machine translation dataset.
Download the data with the following cmds:
```
python prepare_data.py --data toy_copy
python prepare_data.py --data iwslt14
```
### Train the model ###
Train the model with the following cmd:
```
python seq2seq_attn_pg.py --config_model config_model --config_data config_toy_copy
```
Here:
* `--config_model` specifies the model config. Note not to include the `.py` suffix.
* `--config_data` specifies the data config.
All configs are (mostly) the same as those in the [seq2seq_attn example](../seq2seq_attn).
## Results ##
The code is for demonstrating Texar API. With pure policy gradient and without MLE pretraining the model collapse very quickly.
================================================
FILE: texar_repo/examples/seq2seq_rl/config_iwslt14.py
================================================
display = 100
display_eval = 5500
source_vocab_file = './data/iwslt14/vocab.de'
target_vocab_file = './data/iwslt14/vocab.en'
train = {
'num_epochs': 10,
'batch_size': 32,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": 'data/iwslt14/train.de',
'vocab_file': source_vocab_file,
'max_seq_length': 50
},
'target_dataset': {
'files': 'data/iwslt14/train.en',
'vocab_file': target_vocab_file,
'max_seq_length': 50
}
}
val = {
'batch_size': 32,
'shuffle': False,
'source_dataset': {
"files": 'data/iwslt14/valid.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/valid.en',
'vocab_file': target_vocab_file,
}
}
test = {
'batch_size': 32,
'shuffle': False,
'source_dataset': {
"files": 'data/iwslt14/test.de',
'vocab_file': source_vocab_file,
},
'target_dataset': {
'files': 'data/iwslt14/test.en',
'vocab_file': target_vocab_file,
}
}
================================================
FILE: texar_repo/examples/seq2seq_rl/config_model.py
================================================
# Attentional Seq2seq model.
# Hyperparameters not specified here will take the default values.
num_units = 256
beam_width = 10
embedder = {
'dim': num_units
}
encoder = {
'rnn_cell_fw': {
'kwargs': {
'num_units': num_units
}
}
}
decoder = {
'rnn_cell': {
'kwargs': {
'num_units': num_units
},
},
'attention': {
'kwargs': {
'num_units': num_units,
},
'attention_layer_size': num_units
}
}
agent = {
'discount_factor': 0.,
'entropy_weight': .5
}
================================================
FILE: texar_repo/examples/seq2seq_rl/config_toy_copy.py
================================================
display = 10
display_eval = 300
source_vocab_file = './data/toy_copy/train/vocab.sources.txt'
target_vocab_file = './data/toy_copy/train/vocab.targets.txt'
train = {
'num_epochs': 10,
'batch_size': 32,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": './data/toy_copy/train/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
'files': './data/toy_copy/train/targets.txt',
'vocab_file': target_vocab_file
}
}
val = {
'batch_size': 32,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": './data/toy_copy/dev/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
"files": './data/toy_copy/dev/targets.txt',
'vocab_file': target_vocab_file
}
}
test = {
'batch_size': 32,
'allow_smaller_final_batch': False,
'source_dataset': {
"files": './data/toy_copy/test/sources.txt',
'vocab_file': source_vocab_file
},
'target_dataset': {
"files": './data/toy_copy/test/targets.txt',
'vocab_file': target_vocab_file
}
}
================================================
FILE: texar_repo/examples/seq2seq_rl/prepare_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Downloads data.
"""
import tensorflow as tf
import texar as tx
# pylint: disable=invalid-name
flags = tf.flags
flags.DEFINE_string("data", "iwslt14", "Data to download [iwslt14|toy_copy]")
FLAGS = flags.FLAGS
def prepare_data():
"""Downloads data.
"""
if FLAGS.data == 'iwslt14':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1Vuv3bed10qUxrpldHdYoiWLzPKa4pNXd/view?usp=sharing',
path='./',
filenames='iwslt14.zip',
extract=True)
elif FLAGS.data == 'toy_copy':
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1fENE2rakm8vJ8d3voWBgW4hGlS6-KORW/view?usp=sharing',
path='./',
filenames='toy_copy.zip',
extract=True)
else:
raise ValueError('Unknown data: {}'.format(FLAGS.data))
def main():
"""Entrypoint.
"""
prepare_data()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seq2seq_rl/seq2seq_attn_pg.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Attentional Seq2seq trained with policy gradient.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
#pylint: disable=invalid-name, too-many-arguments, too-many-locals
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
flags = tf.flags
flags.DEFINE_string("config_model", "config_model", "The model config.")
flags.DEFINE_string("config_data", "config_iwslt14", "The dataset config.")
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
# A caveats of using `texar.agents.SeqPGAgent`:
# The training data iterator should not run to raise `OutOfRangeError`,
# otherwise the iterator cannot be re-initialized and may raise
# `CancelledError`. This is probably because the iterator is used by
# `tf.Session.partial_run` in `SeqPGAgent`.
#
# A simple workaround is to set `'num_epochs'` of training data to a large
# number so that its iterator will never run into `OutOfRangeError`. Use
# `texar.data.FeedableDataIterator` to periodically switch to dev/test data
# for evaluation and switch back to the training data to resume from the
# breakpoint.
def build_model(batch, train_data):
"""Assembles the seq2seq model.
"""
source_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.source_vocab.size, hparams=config_model.embedder)
encoder = tx.modules.BidirectionalRNNEncoder(
hparams=config_model.encoder)
enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
target_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.target_vocab.size, hparams=config_model.embedder)
decoder = tx.modules.AttentionRNNDecoder(
memory=tf.concat(enc_outputs, axis=2),
memory_sequence_length=batch['source_length'],
vocab_size=train_data.target_vocab.size,
hparams=config_model.decoder)
start_tokens = tf.ones_like(batch['target_length']) * \
train_data.target_vocab.bos_token_id
outputs, _, sequence_length = decoder(
decoding_strategy='infer_sample',
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
embedding=target_embedder,
max_decoding_length=30)
beam_search_outputs, _, _ = \
tx.modules.beam_search_decode(
decoder_or_cell=decoder,
embedding=target_embedder,
start_tokens=start_tokens,
end_token=train_data.target_vocab.eos_token_id,
beam_width=config_model.beam_width,
max_decoding_length=60)
return outputs, sequence_length, beam_search_outputs
def main():
"""Entrypoint.
"""
train_data = tx.data.PairedTextData(hparams=config_data.train)
val_data = tx.data.PairedTextData(hparams=config_data.val)
test_data = tx.data.PairedTextData(hparams=config_data.test)
iterator = tx.data.FeedableDataIterator(
{'train': train_data, 'val': val_data, 'test': test_data})
batch = iterator.get_next()
outputs, sequence_length, infer_outputs = build_model(batch, train_data)
agent = tx.agents.SeqPGAgent(
samples=outputs.sample_id,
logits=outputs.logits,
sequence_length=sequence_length,
hparams=config_model.agent)
def _train_and_eval(sess, agent):
iterator.restart_dataset(sess, 'train')
best_val_bleu = -1.
step = 0
while True:
try:
# Samples
extra_fetches = {
'truth': batch['target_text_ids'],
}
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'train')
}
fetches = agent.get_samples(
extra_fetches=extra_fetches, feed_dict=feed_dict)
sample_text = tx.utils.map_ids_to_strs(
fetches['samples'], train_data.target_vocab,
strip_eos=False, join=False)
truth_text = tx.utils.map_ids_to_strs(
fetches['truth'], train_data.target_vocab,
strip_eos=False, join=False)
# Computes rewards
reward = []
for ref, hyp in zip(truth_text, sample_text):
r = tx.evals.sentence_bleu([ref], hyp, smooth=True)
reward.append(r)
# Updates
loss = agent.observe(reward=reward)
# Displays & evaluates
step += 1
if step == 1 or step % config_data.display == 0:
print("step={}, loss={:.4f}, reward={:.4f}".format(
step, loss, np.mean(reward)))
if step % config_data.display_eval == 0:
val_bleu = _eval_epoch(sess, 'val')
best_val_bleu = max(best_val_bleu, val_bleu)
print('val step={}, BLEU={:.4f}; best-ever={:.4f}'.format(
step, val_bleu, best_val_bleu))
test_bleu = _eval_epoch(sess, 'test')
print('test step={}, BLEU={:.4f}'.format(step, test_bleu))
print('=' * 50)
except tf.errors.OutOfRangeError:
break
def _eval_epoch(sess, mode):
"""`mode` is one of {'val', 'test'}
"""
iterator.restart_dataset(sess, mode)
refs, hypos = [], []
while True:
try:
fetches = [
batch['target_text'][:, 1:],
infer_outputs.predicted_ids[:, :, 0]
]
feed_dict = {
tx.global_mode(): tf.estimator.ModeKeys.PREDICT,
iterator.handle: iterator.get_handle(sess, mode)
}
target_texts, output_ids = \
sess.run(fetches, feed_dict=feed_dict)
target_texts = tx.utils.strip_special_tokens(target_texts)
output_texts = tx.utils.map_ids_to_strs(
ids=output_ids, vocab=val_data.target_vocab)
for hypo, ref in zip(output_texts, target_texts):
hypos.append(hypo)
refs.append([ref])
except tf.errors.OutOfRangeError:
break
return tx.evals.corpus_bleu_moses(list_of_references=refs,
hypotheses=hypos)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
agent.sess = sess
_train_and_eval(sess, agent)
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/seqgan/README.md
================================================
# SeqGAN for Text Generation
This example is an implementation of [(Yu et al.) SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient](https://arxiv.org/pdf/1609.05473.pdf), with a language model as the generator and an RNN-based classifier as the discriminator.
Model architecture and parameter settings are in line with the [official implementation](https://github.com/geek-ai/Texygen) of SeqGAN, except that we replace the MC-Tree rollout strategy with token-level reward by the RNN discriminator, which is simpler and provides competitive performance.
Experiments are performed on two datasets:
* The [PTB dataset](https://corochann.com/penn-tree-bank-ptb-dataset-introduction-1456.html) standard for language modeling
* The [COCO Captions dataset](http://cocodataset.org/#download): with 2K vocabularies and an average sentence length of 25. We use the [data](https://github.com/geek-ai/Texygen/tree/master/data) provided in the official implementation, where train/test datasets contain 10K sentences, respectively.
## Usage
### Dataset
Download datasets with the following cmds respectively:
```shell
python data_utils.py --config config_ptb_small --data_path ./ --dataset ptb
python data_utils.py --config config_coco --data_path ./ --dataset coco
```
Here:
* `--config` specifies config parameters to use. Default is `config_ptb_small`.
* `--data_path` is the directory to store the downloaded dataset. Default is `./`.
* `--dataset` indicates the training dataset. Currently `ptb`(default) and `coco` are supported.
### Train the model
Training on `coco` dataset can be performed with the following command:
```shell
python seqgan_train.py --config config_coco --data_path ./ --dataset coco
```
Here:
`--config`, `--data_path` and `--dataset` should be the same with the flags settings used to download the dataset.
The model will start training and will evaluate perplexity and BLEU score every 10 epochs.
## Results
### COCO Caption
We compare the results of SeqGAN and MLE (maximum likelihood training) provided by our and official implemantations, using the default official parameter settings. Each cell below presents the BLEU scores on both the test set and the training set (in the parentheses).
We use the standard BLEU function [`texar.evals.sentence_bleu_moses`](https://texar.readthedocs.io/en/latest/code/evals.html#sentence-bleu-moses) to evaluate BLEU scores for both the official and our implementations.
| |Texar - SeqGAN | Official - SeqGAN | Texar - MLE | Official - MLE |
|---------------|-------------|----------------|-------------|----------------|
|BLEU-1 | 0.5670 (0.6850) | 0.6260 (0.7900) | 0.7130 (0.9360) | 0.6620 (0.8770) |
|BLEU-2 | 0.3490 (0.5330) | 0.3570 (0.5880) | 0.4510 (0.7590) | 0.3780 (0.6910) |
|BLEU-3 | 0.1940 (0.3480) | 0.1660 (0.3590) | 0.2490 (0.4990) | 0.1790 (0.4470) |
|BLEU-4 | 0.0940 (0.1890) | 0.0710 (0.1800) | 0.1170 (0.2680) | 0.0790 (0.2390)|
### PTB
On PTB data, we use three different hyperparameter configurations which result in models of different sizes.
The perplexity on both the test set and the training set are listed in the following table.
|config|train |Official - train |test | Official - test |
|--- |--- |--- |--- |--- |
|small |28.4790 |53.2289 |58.9798 | 55.7736 |
|medium|16.3243 |9.8919 |37.6558 | 20.8537 |
|large |14.5739 |4.7015 |52.0850 | 39.7949 |
## Training Log
During training, loss and BLEU score are recorded in the log directory. Here, we provide sample log output when training on the `coco` dataset.
### Training loss
Training loss will be recorded in coco_log/log.txt.
```text
G pretrain epoch 0, step 1: train_ppl: 1781.854030
G pretrain epoch 1, step 201: train_ppl: 10.483647
G pretrain epoch 2, step 401: train_ppl: 7.335757
...
G pretrain epoch 77, step 12201: train_ppl: 3.372638
G pretrain epoch 78, step 12401: train_ppl: 3.534658
D pretrain epoch 0, step 0: dis_total_loss: 27.025223, r_loss: 13.822192, f_loss: 13.203032
D pretrain epoch 1, step 0: dis_total_loss: 26.331108, r_loss: 13.592842, f_loss: 12.738266
D pretrain epoch 2, step 0: dis_total_loss: 27.042515, r_loss: 13.592712, f_loss: 13.449802
...
D pretrain epoch 77, step 0: dis_total_loss: 25.134272, r_loss: 12.660420, f_loss: 12.473851
D pretrain epoch 78, step 0: dis_total_loss: 23.727032, r_loss: 12.822734, f_loss: 10.904298
D pretrain epoch 79, step 0: dis_total_loss: 24.769077, r_loss: 12.733292, f_loss: 12.035786
G train epoch 80, step 12601: mean_reward: 0.027631, expect_reward_loss:-0.256241, update_loss: -20.670971
D train epoch 80, step 0: dis_total_loss: 25.222481, r_loss: 12.671371, f_loss: 12.551109
D train epoch 81, step 0: dis_total_loss: 25.695383, r_loss: 13.037079, f_loss: 12.658304
...
G train epoch 178, step 22401: mean_reward: 3.409714, expect_reward_loss:-3.474687, update_loss: 733.247009
D train epoch 178, step 0: dis_total_loss: 24.715553, r_loss: 13.181369, f_loss: 11.534184
D train epoch 179, step 0: dis_total_loss: 24.572170, r_loss: 13.176209, f_loss: 11.395961
```
### BLEU
BLEU1~BLEU4 scores will be calculated every 10 epochs, the results are written to log_dir/bleu.txt.
```text
...
epoch 170 BLEU1~4 on train dataset:
0.726647
0.530675
0.299362
0.133602
epoch 170 BLEU1~4 on test dataset:
0.548151
0.283765
0.118528
0.042177
...
```
================================================
FILE: texar_repo/examples/seqgan/config_coco.py
================================================
generator_pretrain_epoch = 80
discriminator_pretrain_epoch = 80
adversial_epoch = 100
hidden_size = 32
batch_size = 64
max_num_steps = 20
enc_keep_prob_in = 1.0
dec_keep_prob_out = 1.0
log_dir = './coco_log/'
log_file = log_dir + 'log.txt'
bleu_file = log_dir + 'bleu.txt'
ckpt = './checkpoint/ckpt'
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": dec_keep_prob_out},
"num_layers": 1
}
emb_hparams = {
'name': 'lookup_table',
"dim": hidden_size,
'initializer': {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': hidden_size**-0.5,
},
}
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'coco_data/coco.train.txt',
"vocab_file": 'coco_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'coco_data/coco.valid.txt',
"vocab_file": 'coco_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": 'coco_data/coco.test.txt',
"vocab_file": 'coco_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
g_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.01
}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
d_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0001
}
}
}
update_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0004
}
}
}
================================================
FILE: texar_repo/examples/seqgan/config_ptb_large.py
================================================
generator_pretrain_epoch = 55
discriminator_pretrain_epoch = 15
adversial_epoch = 20
hidden_size = 1500
batch_size = 64
max_num_steps = 35
enc_keep_prob_in = 1.0
dec_keep_prob_out = 0.35
log_dir = './ptb_log.large/'
log_file = log_dir + 'log.txt'
bleu_file = log_dir + 'bleu.txt'
ckpt = './checkpoint/ckpt'
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": dec_keep_prob_out},
"num_layers": 2
}
emb_hparams = {
'name': 'lookup_table',
"dim": hidden_size,
'initializer': {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': hidden_size**-0.5,
},
}
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.train.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.valid.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": 'ptb_data/ptb.test.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
g_opt_hparams = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 10.}
}
}
d_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0001
}
}
}
update_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0004
}
}
}
================================================
FILE: texar_repo/examples/seqgan/config_ptb_medium.py
================================================
generator_pretrain_epoch = 39
discriminator_pretrain_epoch = 15
adversial_epoch = 20
hidden_size = 650
batch_size = 64
max_num_steps = 35
enc_keep_prob_in = 1.0
dec_keep_prob_out = 0.5
log_dir = './ptb_log.medium/'
log_file = log_dir + 'log.txt'
bleu_file = log_dir + 'bleu.txt'
ckpt = './checkpoint/ckpt'
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": dec_keep_prob_out},
"num_layers": 2
}
emb_hparams = {
'name': 'lookup_table',
"dim": hidden_size,
'initializer': {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': hidden_size**-0.5,
},
}
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.train.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.valid.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": 'ptb_data/ptb.test.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
g_opt_hparams = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
d_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0001
}
}
}
update_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0004
}
}
}
================================================
FILE: texar_repo/examples/seqgan/config_ptb_small.py
================================================
generator_pretrain_epoch = 13
discriminator_pretrain_epoch = 15
adversial_epoch = 10
hidden_size = 200
batch_size = 64
max_num_steps = 20
enc_keep_prob_in = 1.0
dec_keep_prob_out = 1.0
log_dir = './ptb_log.small/'
log_file = log_dir + 'log.txt'
bleu_file = log_dir + 'bleu.txt'
ckpt = './checkpoint/ckpt'
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": dec_keep_prob_out},
"num_layers": 2
}
emb_hparams = {
'name': 'lookup_table',
"dim": hidden_size,
'initializer': {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': hidden_size**-0.5,
},
}
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.train.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": 'ptb_data/ptb.valid.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": 'ptb_data/ptb.test.txt',
"vocab_file": 'ptb_data/vocab.txt',
"max_seq_length": max_num_steps
}
}
g_opt_hparams = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
d_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0001
}
}
}
update_opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.0004
}
}
}
================================================
FILE: texar_repo/examples/seqgan/data_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""SeqGAN for language modeling
"""
import os
import argparse
import importlib
import tensorflow as tf
import texar as tx
parser = argparse.ArgumentParser(description='prepare data')
parser.add_argument('--dataset', type=str, default='ptb',
help='dataset to prepare')
parser.add_argument('--data_path', type=str, default='./',
help="Directory containing coco. If not exists, "
"the directory will be created, and the data "
"will be downloaded.")
parser.add_argument('--config', type=str, default='config_ptb_small',
help='The config to use.')
args = parser.parse_args()
config = importlib.import_module(args.config)
def prepare_data(args, config, train_path):
"""Downloads the PTB or COCO dataset
"""
if not os.path.exists(config.log_dir):
os.mkdir(config.log_dir)
ptb_url = 'https://jxhe.github.io/download/ptb_data.tgz'
coco_url = 'https://VegB.github.io/downloads/coco_data.tgz'
data_path = args.data_path
if not tf.gfile.Exists(train_path):
url = ptb_url if args.dataset == 'ptb' else coco_url
tx.data.maybe_download(url, data_path, extract=True)
os.remove('%s_data.tgz' % args.dataset)
if __name__ == '__main__':
prepare_data(args, config, config.train_data_hparams['dataset']['files'])
================================================
FILE: texar_repo/examples/seqgan/seqgan_train.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""SeqGAN for language modeling
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, no-member, too-many-locals
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
flags = tf.flags
flags.DEFINE_string("dataset", "ptb",
"perform training on ptb or coco.")
flags.DEFINE_string("data_path", "./",
"Directory containing coco. If not exists, "
"the directory will be created, and the data "
"will be downloaded.")
flags.DEFINE_string("config", "config_ptb_small", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
log = open(config.log_file, 'w')
bleu_log = open(config.bleu_file, 'w')
# Data
train_data = tx.data.MonoTextData(config.train_data_hparams)
val_data = tx.data.MonoTextData(config.val_data_hparams)
test_data = tx.data.MonoTextData(config.test_data_hparams)
iterator = tx.data.TrainTestDataIterator(train=train_data,
val=val_data,
test=test_data)
data_batch = iterator.get_next()
batch_size = tf.shape(data_batch["text_ids"])[0]
num_steps = tf.shape(data_batch["text_ids"])[1]
vocab_size = train_data.vocab.size
# Model architecture
g_embedder = tx.modules.WordEmbedder(vocab_size=vocab_size,
hparams=config.emb_hparams)
input_embed = g_embedder(data_batch["text_ids"][:, :-1])
if config.enc_keep_prob_in < 1:
input_embed = tf.nn.dropout(
input_embed, tx.utils.switch_dropout(config.enc_keep_prob_in))
decoder = tx.modules.BasicRNNDecoder(
vocab_size=vocab_size,
hparams={"rnn_cell": config.dec_cell_hparams,
"max_decoding_length_infer": config.max_num_steps + 2})
initial_state = decoder.zero_state(batch_size=batch_size,
dtype=tf.float32)
g_variables = tx.utils.collect_trainable_variables([g_embedder, decoder])
# ------------Pretrain Generator---------------
outputs, _, _ = decoder(
initial_state=initial_state,
decoding_strategy="train_greedy",
inputs=input_embed,
sequence_length=data_batch["length"] - 1)
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=data_batch["text_ids"][:, 1:],
logits=outputs.logits,
sequence_length=data_batch["length"] - 1)
global_step = tf.Variable(0, trainable=False)
gen_train_op = tx.core.get_train_op(mle_loss,
variables=g_variables,
global_step=global_step,
increment_global_step=True,
hparams=config.g_opt_hparams)
# -------------Generator Infer-------------------
start_tokens = tf.cast(tf.fill([batch_size],
train_data.vocab.bos_token_id),
dtype=tf.int32)
infer_outputs, _, sequence_length = decoder(
decoding_strategy="infer_sample",
start_tokens=start_tokens,
end_token=train_data.vocab.eos_token_id,
embedding=g_embedder,
initial_state=initial_state,
max_decoding_length=config.max_num_steps)
infer_logits = infer_outputs.logits
infer_sample_ids = infer_outputs.sample_id
# ------------Pretrain Discriminator---------------
discriminator = tx.modules.UnidirectionalRNNClassifier(
hparams={"clas_strategy": "time_wise", "num_classes": 1})
d_embedder = tx.modules.WordEmbedder(vocab_size=vocab_size,
hparams=config.emb_hparams)
d_variables = tx.utils.collect_trainable_variables([discriminator, d_embedder])
r_logits, _ = discriminator(d_embedder(data_batch["text_ids"][:, 1:]),
sequence_length=data_batch["length"] - 1)
f_logits, _ = discriminator(d_embedder(infer_sample_ids), sequence_length=sequence_length)
r_loss = tx.losses.sequence_sigmoid_cross_entropy(
labels=tf.ones_like(data_batch["text_ids"][:, 1:], dtype=tf.float32),
logits=tf.squeeze(r_logits),
sequence_length=data_batch["length"] - 1) # r_preds -> 1.
f_loss = tx.losses.sequence_sigmoid_cross_entropy(
labels=tf.zeros_like(infer_sample_ids, dtype=tf.float32),
logits=tf.squeeze(f_logits),
sequence_length=sequence_length) # infer_logits -> 0.
dis_loss = r_loss + f_loss
dis_loss.set_shape(())
dis_train_op = tx.core.get_train_op(dis_loss,
variables=d_variables,
global_step=global_step,
increment_global_step=False,
hparams=config.d_opt_hparams)
# ------------Adeversarial---------------
infer_logits = tf.clip_by_value(
tf.nn.softmax(infer_logits) *
tf.one_hot(infer_sample_ids, vocab_size), 1e-20, 1)
expected_reward = tf.Variable(tf.zeros((config.max_num_steps,)))
reward = tf.reshape(f_logits, shape=(batch_size, -1)) - \
expected_reward[:tf.shape(f_logits)[1]]
mean_reward = tf.reduce_mean(reward)
exp_reward_loss = -tf.reduce_mean(tf.abs(reward))
exp_reward_loss.set_shape(())
exp_op = tx.core.get_train_op(exp_reward_loss,
variables=[expected_reward],
global_step=global_step,
increment_global_step=False,
hparams=config.update_opt_hparams)
reward = tx.losses.discount_reward(
reward, sequence_length=tf.squeeze(sequence_length), tensor_rank=2)
update_loss = -tf.reduce_mean(tf.log(infer_logits) *
tf.expand_dims(reward, -1))
update_loss.set_shape(())
gen_op = tx.core.get_train_op(update_loss,
variables=g_variables,
global_step=global_step,
increment_global_step=True,
hparams=config.update_opt_hparams)
update_op = tf.group(gen_op, exp_op)
def _g_train_epoch(sess, epoch, mode_string):
iterator.switch_to_train_data(sess)
while True:
try:
if mode_string == 'train':
fetches = {
'mean_rwd': mean_reward,
'exp_rwd_loss': exp_reward_loss,
'update_loss': update_loss,
'update_op': update_op,
'exp_rwd': expected_reward,
'step': global_step
}
elif mode_string == 'pretrain':
fetches = {
'mle_loss': mle_loss,
'num_steps': num_steps,
'train_op': gen_train_op,
'step': global_step
}
else:
raise ValueError(
"Expect mode_string to be one of "
"['pretrain', 'train'], got %s" % mode_string)
rtns = sess.run(fetches)
step = rtns['step']
if step % 200 == 1:
if mode_string == 'pretrain':
ppl = np.exp(rtns['mle_loss'] / rtns["num_steps"])
rst = "G {0:6s} epoch {1:3d}, step {2:3d}:" \
" train_ppl: {3:6f}".format(mode_string,
epoch, step, ppl)
else:
rst = "G {0:6s} epoch {1:3d}, step {2:3d}: " \
"mean_reward: {3:6f}, " \
"expect_reward_loss:{4:6f}, " \
"update_loss: {5:6f}".format(
mode_string, epoch, step, rtns['mean_rwd'],
rtns['exp_rwd_loss'], rtns['update_loss'])
log.write(rst + '\n')
log.flush()
print(rst)
if mode_string == 'train': # a batch per adversarial epoch
break
except tf.errors.OutOfRangeError:
break
return
def _g_test_epoch(sess, epoch, mode_string):
def _id2word_map(id_arrays):
return [' '.join([train_data.vocab.id_to_token_map_py[i]
for i in sent]) for sent in id_arrays]
if mode_string == 'valid':
iterator.switch_to_val_data(sess)
elif mode_string == 'test':
iterator.switch_to_test_data(sess)
else:
raise ValueError("Expect mode_string to be one of "
"['valid', 'test'], got %s" % mode_string)
target_list, inference_list = [], []
loss, steps = 0., 0
while True:
try:
fetches = {
"mle_loss": mle_loss,
"num_steps": num_steps
}
if mode_string == 'test':
fetches['target_sample_id'] = data_batch["text_ids"]
fetches['infer_sample_id'] = infer_sample_ids
feed_dict = {tx.global_mode(): tf.estimator.ModeKeys.EVAL}
rtns = sess.run(fetches, feed_dict)
loss += rtns['mle_loss']
steps += rtns['num_steps']
if mode_string == 'test':
targets = _id2word_map(rtns['target_sample_id'][:, 1:].tolist()) # remove
for t in targets:
target_list.extend(t.split('')[0].strip().split())
inferences = _id2word_map(rtns['infer_sample_id'].tolist())
for inf in inferences:
inference_list.extend(inf.split('')[0].strip().split())
except tf.errors.OutOfRangeError:
break
ppl = np.exp(loss / steps)
rst = "G {0:6s} epoch {1:3d}, step {2:3s}:" \
" {3:5s}_ppl: {4:6f}"\
.format(mode_string, epoch, '-', mode_string, ppl)
log.write(rst + '\n')
log.flush()
print(rst)
if mode_string == 'test':
bleu_test = tx.evals.sentence_bleu_moses(
references=[target_list],
hypothesis=inference_list,
lowercase=True, return_all=True)
if not isinstance(bleu_test, np.ndarray): # might return 0.0 if inference_list is null
bleu_test = [bleu_test] * 5
rst_test = "epoch %d BLEU1~4 on test dataset:\n" \
"%f\n%f\n%f\n%f\n\n" % \
(epoch, bleu_test[1], bleu_test[2],
bleu_test[3], bleu_test[4])
print(rst_test)
bleu_log.write(rst_test)
bleu_log.flush()
return
def _d_run_epoch(sess, epoch, mode_string='pretrain'):
iterator.switch_to_train_data(sess)
step = 0
while True:
try:
fetches = {
"mle_loss": dis_loss,
"r_loss": r_loss,
"f_loss": f_loss,
"train_op": dis_train_op
}
rtns = sess.run(fetches)
if step % 200 == 0:
rst = "D {0:6s} epoch {1:3d}, step {2:3d}: " \
"dis_total_loss: {3:6f}, r_loss: {4:6f}, " \
"f_loss: {5:6f}".format(
mode_string, epoch, step, rtns['mle_loss'],
rtns['r_loss'], rtns['f_loss'])
log.write(rst + '\n')
log.flush()
print(rst)
step += 1
if step == 15 and mode_string == 'train':
break
except tf.errors.OutOfRangeError:
break
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
with tf.Session(config=tf_config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
# Generator pre-training
for g_epoch in range(config.generator_pretrain_epoch):
_g_train_epoch(sess, g_epoch, 'pretrain')
if g_epoch % 10 == 0 or \
g_epoch == config.generator_pretrain_epoch - 1:
_g_test_epoch(sess, g_epoch, 'valid')
_g_test_epoch(sess, g_epoch, 'test')
# Discriminator pre-training
for d_epoch in range(config.discriminator_pretrain_epoch):
_d_run_epoch(sess, d_epoch)
# Adversarial training
for update_epoch in range(config.adversial_epoch):
cur_epoch = update_epoch + config.generator_pretrain_epoch
_g_train_epoch(sess, cur_epoch, 'train')
_d_run_epoch(sess, cur_epoch, mode_string='train')
if update_epoch % 10 == 0 or \
update_epoch == config.adversial_epoch - 1:
_g_test_epoch(sess, cur_epoch, 'valid')
_g_test_epoch(sess, cur_epoch, 'test')
log.close()
bleu_log.close()
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/sequence_tagging/README.md
================================================
# Sequence tagging on CoNLL-2003 #
This example builds a bi-directional LSTM-CNN model for NER task and trains on CoNLL-2003 data. Model and training are described in
>[(Ma et al.) End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.cs.cmu.edu/~xuezhem/publications/P16-1101.pdf)
The top CRF layer is not used here.
## Dataset ##
The code uses [CoNLL-2003 NER dataset](https://www.clips.uantwerpen.be/conll2003/ner/) (English). Please put data files (e.g., `eng.train.bio.conll`) under `./data` folder. Pretrained Glove word embeddings can also be used (set `load_glove=True` in [config.py](./config.py)). The Glove file should also be under `./data`.
## Run ##
To train a NER model,
python ner.py
The model will begin training, and will evaluate on the validation data periodically, and evaluate on the test data after the training is done.
## Results ##
The results on validation and test data is:
| | prec | recall | F1 |
|-------|----------|----------|----------|
| valid | 91.18 | 92.41 | 91.79 |
| test | 86.13 | 88.31 | 87.21 |
================================================
FILE: texar_repo/examples/sequence_tagging/config.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""NER config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
num_epochs = 200
char_dim = 30
embed_dim = 100
hidden_size = 256
tag_space = 128
keep_prob = 0.5
batch_size = 16
encoder = None
load_glove = True
emb = {
"name": "embedding",
"dim": embed_dim,
"dropout_rate": 0.33,
"dropout_strategy": 'item'
}
char_emb = {
"name": "char_embedding",
"dim": char_dim
}
conv = {
"filters": 30,
"kernel_size": [3],
"conv_activation": "tanh",
"num_dense_layers": 0,
"dropout_rate": 0.
}
cell = {
"type": "LSTMCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 1.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 1
}
opt = {
"optimizer": {
"type": "MomentumOptimizer",
"kwargs": {"learning_rate": 0.1,
"momentum": 0.9,
"use_nesterov": True}
},
"learning_rate_decay": {
"type": "inverse_time_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.05,
"staircase": True
},
"start_decay_step": 1
}
}
================================================
FILE: texar_repo/examples/sequence_tagging/conll_reader.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for preprocessing and iterating over the CoNLL 2003 data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
from collections import defaultdict
import numpy as np
import tensorflow as tf
# pylint: disable=invalid-name, too-many-locals
MAX_CHAR_LENGTH = 45
NUM_CHAR_PAD = 2
UNK_WORD, UNK_CHAR, UNK_NER = 0, 0, 0
PAD_WORD, PAD_CHAR, PAD_NER = 1, 1, 1
# Regular expressions used to normalize digits.
DIGIT_RE = re.compile(r"\d")
def create_vocabs(train_path, dev_path, test_path, normalize_digits=True, min_occur=1, glove_dict=None):
word_vocab = defaultdict(lambda: len(word_vocab))
word_count = defaultdict(lambda: 0)
char_vocab = defaultdict(lambda: len(char_vocab))
ner_vocab = defaultdict(lambda: len(ner_vocab))
UNK_WORD = word_vocab[""]
PAD_WORD = word_vocab[""]
UNK_CHAR = char_vocab[""]
PAD_CHAR = char_vocab[""]
UNK_NER = ner_vocab[""]
PAD_NER = ner_vocab[""]
print("Creating Vocabularies:")
for file_path in [train_path, dev_path, test_path]:
with open(file_path, 'r') as file:
for line in file:
line = line.strip()
if len(line) == 0:
continue
tokens = line.split(' ')
for char in tokens[1]:
cid = char_vocab[char]
word = DIGIT_RE.sub("0", tokens[1]) if normalize_digits else tokens[1]
ner = tokens[4]
if glove_dict is not None and (word in glove_dict or word.lower() in glove_dict):
word_count[word] += min_occur + 1
elif file_path == train_path:
word_count[word] += 1
nid = ner_vocab[ner]
print("Total Vocabulary Size: %d" % len(word_count))
for word in word_count:
if word_count[word] > min_occur:
wid = word_vocab[word]
print("Word Vocabulary Size: %d" % len(word_vocab))
print("Character Alphabet Size: %d" % len(char_vocab))
print("NER Alphabet Size: %d" % len(ner_vocab))
word_vocab = defaultdict(lambda: UNK_WORD, word_vocab)
char_vocab = defaultdict(lambda: UNK_CHAR, char_vocab)
ner_vocab = defaultdict(lambda: UNK_NER, ner_vocab)
i2w = {v: k for k, v in word_vocab.items()}
i2n = {v: k for k, v in ner_vocab.items()}
return (word_vocab, char_vocab, ner_vocab), (i2w, i2n)
def read_data(source_path, word_vocab, char_vocab, ner_vocab, normalize_digits=True):
data = []
print('Reading data from %s' % source_path)
counter = 0
reader = CoNLLReader(source_path, word_vocab, char_vocab, ner_vocab)
inst = reader.getNext(normalize_digits)
while inst is not None:
counter += 1
sent = inst.sentence
data.append([sent.word_ids, sent.char_id_seqs, inst.ner_ids])
inst = reader.getNext(normalize_digits)
reader.close()
print("Total number of data: %d" % counter)
return data
def iterate_batch(data, batch_size, shuffle=False):
if shuffle:
np.random.shuffle(data)
for start_idx in range(0, len(data), batch_size):
excerpt = slice(start_idx, start_idx + batch_size)
batch = data[excerpt]
batch_length = max([len(batch[i][0]) for i in range(len(batch))])
wid_inputs = np.empty([len(batch), batch_length], dtype=np.int64)
cid_inputs = np.empty([len(batch), batch_length, MAX_CHAR_LENGTH], dtype=np.int64)
nid_inputs = np.empty([len(batch), batch_length], dtype=np.int64)
masks = np.zeros([len(batch), batch_length], dtype=np.float32)
lengths = np.empty(len(batch), dtype=np.int64)
for i, inst in enumerate(batch):
wids, cid_seqs, nids = inst
inst_size = len(wids)
lengths[i] = inst_size
# word ids
wid_inputs[i, :inst_size] = wids
wid_inputs[i, inst_size:] = PAD_WORD
for c, cids in enumerate(cid_seqs):
cid_inputs[i, c, :len(cids)] = cids
cid_inputs[i, c, len(cids):] = PAD_CHAR
cid_inputs[i, inst_size:, :] = PAD_CHAR
nid_inputs[i, :inst_size] = nids
nid_inputs[i, inst_size:] = PAD_NER
masks[i, :inst_size] = 1.0
yield wid_inputs, cid_inputs, nid_inputs, masks, lengths
def load_glove(filename, emb_dim, normalize_digits=True):
"""Loads embeddings in the glove text format in which each line is
''. Dimensions of the embedding vector
are separated with whitespace characters.
Args:
filename (str): Path to the embedding file.
vocab (dict): A dictionary that maps token strings to integer index.
Tokens not in :attr:`vocab` are not read.
word_vecs: A 2D numpy array of shape `[vocab_size, embed_dim]`
which is updated as reading from the file.
Returns:
The updated :attr:`word_vecs`.
"""
glove_dict = dict()
with tf.gfile.Open(filename) as fin:
for line in fin:
vec = line.strip().split()
if len(vec) == 0:
continue
word, vec = vec[0], vec[1:]
word = tf.compat.as_text(word)
word = DIGIT_RE.sub("0", word) if normalize_digits else word
glove_dict[word] = np.array([float(v) for v in vec])
if len(vec) != emb_dim:
raise ValueError("Inconsistent word vector sizes: %d vs %d" %
(len(vec), emb_dim))
return glove_dict
def construct_init_word_vecs(vocab, word_vecs, glove_dict):
for word, index in vocab.items():
if word in glove_dict:
embedding = glove_dict[word]
elif word.lower() in glove_dict:
embedding = glove_dict[word.lower()]
else: embedding = None
if embedding is not None:
word_vecs[index] = embedding
return word_vecs
class CoNLLReader(object):
def __init__(self, file_path, word_vocab, char_vocab, ner_vocab):
self.__source_file = open(file_path, 'r', encoding='utf-8')
self.__word_vocab = word_vocab
self.__char_vocab = char_vocab
self.__ner_vocab = ner_vocab
def close(self):
self.__source_file.close()
def getNext(self, normalize_digits=True):
line = self.__source_file.readline()
# skip multiple blank lines.
while len(line) > 0 and len(line.strip()) == 0:
line = self.__source_file.readline()
if len(line) == 0:
return None
lines = []
while len(line.strip()) > 0:
line = line.strip()
lines.append(line.split(' '))
line = self.__source_file.readline()
length = len(lines)
if length == 0:
return None
words = []
word_ids = []
char_seqs = []
char_id_seqs = []
ner_tags = []
ner_ids = []
for tokens in lines:
chars = []
char_ids = []
for char in tokens[1]:
chars.append(char)
char_ids.append(self.__char_vocab[char])
if len(chars) > MAX_CHAR_LENGTH:
chars = chars[:MAX_CHAR_LENGTH]
char_ids = char_ids[:MAX_CHAR_LENGTH]
char_seqs.append(chars)
char_id_seqs.append(char_ids)
word = DIGIT_RE.sub("0", tokens[1]) if normalize_digits else tokens[1]
ner = tokens[4]
words.append(word)
word_ids.append(self.__word_vocab[word])
ner_tags.append(ner)
ner_ids.append(self.__ner_vocab[ner])
return NERInstance(Sentence(words, word_ids, char_seqs, char_id_seqs), ner_tags, ner_ids)
class NERInstance(object):
def __init__(self, sentence, ner_tags, ner_ids):
self.sentence = sentence
self.ner_tags = ner_tags
self.ner_ids = ner_ids
def length(self):
return self.sentence.length()
class Sentence(object):
def __init__(self, words, word_ids, char_seqs, char_id_seqs):
self.words = words
self.word_ids = word_ids
self.char_seqs = char_seqs
self.char_id_seqs = char_id_seqs
def length(self):
return len(self.words)
================================================
FILE: texar_repo/examples/sequence_tagging/conll_writer.py
================================================
__author__ = 'max'
class CoNLLWriter(object):
def __init__(self, i2w, i2n):
self.__source_file = None
self.__i2w = i2w
self.__i2n = i2n
def start(self, file_path):
self.__source_file = open(file_path, 'w', encoding='utf-8')
def close(self):
self.__source_file.close()
def write(self, word, predictions, targets, lengths):
batch_size, _ = word.shape
for i in range(batch_size):
for j in range(lengths[i]):
w = self.__i2w[word[i, j]]
tgt = self.__i2n[targets[i, j]]
pred = self.__i2n[predictions[i, j]]
self.__source_file.write('%d %s %s %s %s %s\n' % (j + 1, w, "_", "_", tgt, pred))
self.__source_file.write('\n')
================================================
FILE: texar_repo/examples/sequence_tagging/conlleval
================================================
#!/usr/bin/perl -w
# conlleval: evaluate result of processing CoNLL-2000 shared task
# usage: conlleval [-l] [-r] [-d delimiterTag] [-o oTag] < file
# README: http://cnts.uia.ac.be/conll2000/chunking/output.html
# options: l: generate LaTeX output for tables like in
# http://cnts.uia.ac.be/conll2003/ner/example.tex
# r: accept raw result tags (without B- and I- prefix;
# assumes one word per chunk)
# d: alternative delimiter tag (default is single space)
# o: alternative outside tag (default is O)
# note: the file should contain lines with items separated
# by $delimiter characters (default space). The final
# two items should contain the correct tag and the
# guessed tag in that order. Sentences should be
# separated from each other by empty lines or lines
# with $boundary fields (default -X-).
# url: http://lcg-www.uia.ac.be/conll2000/chunking/
# started: 1998-09-25
# version: 2004-01-26
# author: Erik Tjong Kim Sang
use strict;
my $false = 0;
my $true = 42;
my $boundary = "-X-"; # sentence boundary
my $correct; # current corpus chunk tag (I,O,B)
my $correctChunk = 0; # number of correctly identified chunks
my $correctTags = 0; # number of correct chunk tags
my $correctType; # type of current corpus chunk tag (NP,VP,etc.)
my $delimiter = " "; # field delimiter
my $FB1 = 0.0; # FB1 score (Van Rijsbergen 1979)
my $firstItem; # first feature (for sentence boundary checks)
my $foundCorrect = 0; # number of chunks in corpus
my $foundGuessed = 0; # number of identified chunks
my $guessed; # current guessed chunk tag
my $guessedType; # type of current guessed chunk tag
my $i; # miscellaneous counter
my $inCorrect = $false; # currently processed chunk is correct until now
my $lastCorrect = "O"; # previous chunk tag in corpus
my $latex = 0; # generate LaTeX formatted output
my $lastCorrectType = ""; # type of previously identified chunk tag
my $lastGuessed = "O"; # previously identified chunk tag
my $lastGuessedType = ""; # type of previous chunk tag in corpus
my $lastType; # temporary storage for detecting duplicates
my $line; # line
my $nbrOfFeatures = -1; # number of features per line
my $precision = 0.0; # precision score
my $oTag = "O"; # outside tag, default O
my $raw = 0; # raw input: add B to every token
my $recall = 0.0; # recall score
my $tokenCounter = 0; # token counter (ignores sentence breaks)
my %correctChunk = (); # number of correctly identified chunks per type
my %foundCorrect = (); # number of chunks in corpus per type
my %foundGuessed = (); # number of identified chunks per type
my @features; # features on line
my @sortedTypes; # sorted list of chunk type names
# sanity check
while (@ARGV and $ARGV[0] =~ /^-/) {
if ($ARGV[0] eq "-l") { $latex = 1; shift(@ARGV); }
elsif ($ARGV[0] eq "-r") { $raw = 1; shift(@ARGV); }
elsif ($ARGV[0] eq "-d") {
shift(@ARGV);
if (not defined $ARGV[0]) {
die "conlleval: -d requires delimiter character";
}
$delimiter = shift(@ARGV);
} elsif ($ARGV[0] eq "-o") {
shift(@ARGV);
if (not defined $ARGV[0]) {
die "conlleval: -o requires delimiter character";
}
$oTag = shift(@ARGV);
} else { die "conlleval: unknown argument $ARGV[0]\n"; }
}
if (@ARGV) { die "conlleval: unexpected command line argument\n"; }
# process input
while () {
chomp($line = $_);
@features = split(/$delimiter/,$line);
if ($nbrOfFeatures < 0) { $nbrOfFeatures = $#features; }
elsif ($nbrOfFeatures != $#features and @features != 0) {
printf STDERR "unexpected number of features: %d (%d)\n",
$#features+1,$nbrOfFeatures+1;
exit(1);
}
if (@features == 0 or
$features[0] eq $boundary) { @features = ($boundary,"O","O"); }
if (@features < 2) {
die "conlleval: unexpected number of features in line $line\n";
}
if ($raw) {
if ($features[$#features] eq $oTag) { $features[$#features] = "O"; }
if ($features[$#features-1] eq $oTag) { $features[$#features-1] = "O"; }
if ($features[$#features] ne "O") {
$features[$#features] = "B-$features[$#features]";
}
if ($features[$#features-1] ne "O") {
$features[$#features-1] = "B-$features[$#features-1]";
}
}
# 20040126 ET code which allows hyphens in the types
if ($features[$#features] =~ /^([^-]*)-(.*)$/) {
$guessed = $1;
$guessedType = $2;
} else {
$guessed = $features[$#features];
$guessedType = "";
}
pop(@features);
if ($features[$#features] =~ /^([^-]*)-(.*)$/) {
$correct = $1;
$correctType = $2;
} else {
$correct = $features[$#features];
$correctType = "";
}
pop(@features);
# ($guessed,$guessedType) = split(/-/,pop(@features));
# ($correct,$correctType) = split(/-/,pop(@features));
$guessedType = $guessedType ? $guessedType : "";
$correctType = $correctType ? $correctType : "";
$firstItem = shift(@features);
# 1999-06-26 sentence breaks should always be counted as out of chunk
if ( $firstItem eq $boundary ) { $guessed = "O"; }
if ($inCorrect) {
if ( &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and
&endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
$lastGuessedType eq $lastCorrectType) {
$inCorrect=$false;
$correctChunk++;
$correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
$correctChunk{$lastCorrectType}+1 : 1;
} elsif (
&endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) !=
&endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) or
$guessedType ne $correctType ) {
$inCorrect=$false;
}
}
if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and
&startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
$guessedType eq $correctType) { $inCorrect = $true; }
if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) ) {
$foundCorrect++;
$foundCorrect{$correctType} = $foundCorrect{$correctType} ?
$foundCorrect{$correctType}+1 : 1;
}
if ( &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) ) {
$foundGuessed++;
$foundGuessed{$guessedType} = $foundGuessed{$guessedType} ?
$foundGuessed{$guessedType}+1 : 1;
}
if ( $firstItem ne $boundary ) {
if ( $correct eq $guessed and $guessedType eq $correctType ) {
$correctTags++;
}
$tokenCounter++;
}
$lastGuessed = $guessed;
$lastCorrect = $correct;
$lastGuessedType = $guessedType;
$lastCorrectType = $correctType;
}
if ($inCorrect) {
$correctChunk++;
$correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
$correctChunk{$lastCorrectType}+1 : 1;
}
if (not $latex) {
# compute overall precision, recall and FB1 (default values are 0.0)
$precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
$recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
$FB1 = 2*$precision*$recall/($precision+$recall)
if ($precision+$recall > 0);
# print overall performance
printf "processed $tokenCounter tokens with $foundCorrect phrases; ";
printf "found: $foundGuessed phrases; correct: $correctChunk.\n";
if ($tokenCounter>0) {
printf "accuracy: %6.2f%%; ",100*$correctTags/$tokenCounter;
printf "precision: %6.2f%%; ",$precision;
printf "recall: %6.2f%%; ",$recall;
printf "FB1: %6.2f\n",$FB1;
}
}
# sort chunk type names
undef($lastType);
@sortedTypes = ();
foreach $i (sort (keys %foundCorrect,keys %foundGuessed)) {
if (not($lastType) or $lastType ne $i) {
push(@sortedTypes,($i));
}
$lastType = $i;
}
# print performance per chunk type
if (not $latex) {
for $i (@sortedTypes) {
$correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
if (not($foundGuessed{$i})) { $foundGuessed{$i} = 0; $precision = 0.0; }
else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
if (not($foundCorrect{$i})) { $recall = 0.0; }
else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
if ($precision+$recall == 0.0) { $FB1 = 0.0; }
else { $FB1 = 2*$precision*$recall/($precision+$recall); }
printf "%17s: ",$i;
printf "precision: %6.2f%%; ",$precision;
printf "recall: %6.2f%%; ",$recall;
printf "FB1: %6.2f %d\n",$FB1,$foundGuessed{$i};
}
} else {
print " & Precision & Recall & F\$_{\\beta=1} \\\\\\hline";
for $i (@sortedTypes) {
$correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
if (not($foundGuessed{$i})) { $precision = 0.0; }
else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
if (not($foundCorrect{$i})) { $recall = 0.0; }
else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
if ($precision+$recall == 0.0) { $FB1 = 0.0; }
else { $FB1 = 2*$precision*$recall/($precision+$recall); }
printf "\n%-7s & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\",
$i,$precision,$recall,$FB1;
}
print "\\hline\n";
$precision = 0.0;
$recall = 0;
$FB1 = 0.0;
$precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
$recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
$FB1 = 2*$precision*$recall/($precision+$recall)
if ($precision+$recall > 0);
printf STDOUT "Overall & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\\\hline\n",
$precision,$recall,$FB1;
}
exit 0;
# endOfChunk: checks if a chunk ended between the previous and current word
# arguments: previous and current chunk tags, previous and current types
# note: this code is capable of handling other chunk representations
# than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
# Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
sub endOfChunk {
my $prevTag = shift(@_);
my $tag = shift(@_);
my $prevType = shift(@_);
my $type = shift(@_);
my $chunkEnd = $false;
if ( $prevTag eq "B" and $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "B" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "E" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "I" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
if ($prevTag ne "O" and $prevTag ne "." and $prevType ne $type) {
$chunkEnd = $true;
}
# corrected 1998-12-22: these chunks are assumed to have length 1
if ( $prevTag eq "]" ) { $chunkEnd = $true; }
if ( $prevTag eq "[" ) { $chunkEnd = $true; }
return($chunkEnd);
}
# startOfChunk: checks if a chunk started between the previous and current word
# arguments: previous and current chunk tags, previous and current types
# note: this code is capable of handling other chunk representations
# than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
# Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
sub startOfChunk {
my $prevTag = shift(@_);
my $tag = shift(@_);
my $prevType = shift(@_);
my $type = shift(@_);
my $chunkStart = $false;
if ( $prevTag eq "B" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "I" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "E" and $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "E" and $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
if ($tag ne "O" and $tag ne "." and $prevType ne $type) {
$chunkStart = $true;
}
# corrected 1998-12-22: these chunks are assumed to have length 1
if ( $tag eq "[" ) { $chunkStart = $true; }
if ( $tag eq "]" ) { $chunkStart = $true; }
return($chunkStart);
}
================================================
FILE: texar_repo/examples/sequence_tagging/ner.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Sequence tagging.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import time
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from examples.sequence_tagging.conll_reader import create_vocabs, read_data, iterate_batch, load_glove, construct_init_word_vecs
from examples.sequence_tagging.conll_writer import CoNLLWriter
from examples.sequence_tagging import scores
flags = tf.flags
flags.DEFINE_string("data_path", "./data",
"Directory containing NER data (e.g., eng.train.bio.conll).")
flags.DEFINE_string("train", "eng.train.bio.conll",
"the file name of the training data.")
flags.DEFINE_string("dev", "eng.dev.bio.conll",
"the file name of the dev data.")
flags.DEFINE_string("test", "eng.test.bio.conll",
"the file name of the test data.")
flags.DEFINE_string("embedding", "glove.6B.100d.txt",
"the file name of the GloVe embedding.")
flags.DEFINE_string("config", "config", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
train_path = os.path.join(FLAGS.data_path, FLAGS.train)
dev_path = os.path.join(FLAGS.data_path, FLAGS.dev)
test_path = os.path.join(FLAGS.data_path, FLAGS.test)
embedding_path = os.path.join(FLAGS.data_path, FLAGS.embedding)
EMBEDD_DIM = config.embed_dim
CHAR_DIM = config.char_dim
# Prepares/loads data
if config.load_glove:
print('loading GloVe embedding...')
glove_dict = load_glove(embedding_path, EMBEDD_DIM)
else:
glove_dict = None
(word_vocab, char_vocab, ner_vocab), (i2w, i2n) = create_vocabs(train_path, dev_path, test_path, glove_dict=glove_dict)
data_train = read_data(train_path, word_vocab, char_vocab, ner_vocab)
data_dev = read_data(dev_path, word_vocab, char_vocab, ner_vocab)
data_test = read_data(test_path, word_vocab, char_vocab, ner_vocab)
scale = np.sqrt(3.0 / EMBEDD_DIM)
word_vecs = np.random.uniform(-scale, scale, [len(word_vocab), EMBEDD_DIM]).astype(np.float32)
if config.load_glove:
word_vecs = construct_init_word_vecs(word_vocab, word_vecs, glove_dict)
scale = np.sqrt(3.0 / CHAR_DIM)
char_vecs = np.random.uniform(-scale, scale, [len(char_vocab), CHAR_DIM]).astype(np.float32)
# Builds TF graph
inputs = tf.placeholder(tf.int64, [None, None])
chars = tf.placeholder(tf.int64, [None, None, None])
targets = tf.placeholder(tf.int64, [None, None])
masks = tf.placeholder(tf.float32, [None, None])
seq_lengths = tf.placeholder(tf.int64, [None])
vocab_size = len(word_vecs)
embedder = tx.modules.WordEmbedder(vocab_size=vocab_size, init_value=word_vecs, hparams=config.emb)
emb_inputs = embedder(inputs)
char_size = len(char_vecs)
char_embedder = tx.modules.WordEmbedder(vocab_size=char_size, init_value=char_vecs, hparams=config.char_emb)
emb_chars = char_embedder(chars)
char_shape = tf.shape(emb_chars) # [batch, length, char_length, char_dim]
emb_chars = tf.reshape(emb_chars, (-1, char_shape[2], CHAR_DIM))
char_encoder = tx.modules.Conv1DEncoder(config.conv)
char_outputs = char_encoder(emb_chars)
char_outputs = tf.reshape(char_outputs, (char_shape[0], char_shape[1], config.conv['filters']))
emb_inputs = tf.concat([emb_inputs, char_outputs], axis=2)
emb_inputs = tf.nn.dropout(emb_inputs, keep_prob=0.67)
encoder = tx.modules.BidirectionalRNNEncoder(hparams={"rnn_cell_fw": config.cell, "rnn_cell_bw": config.cell})
outputs, _ = encoder(emb_inputs, sequence_length=seq_lengths)
outputs = tf.concat(outputs, axis=2)
rnn_shape = tf.shape(outputs)
outputs = tf.reshape(outputs, (-1, 2 * config.hidden_size))
outputs = tf.layers.dense(outputs, config.tag_space, activation=tf.nn.elu)
outputs = tf.nn.dropout(outputs, keep_prob=config.keep_prob)
logits = tf.layers.dense(outputs, len(ner_vocab))
logits = tf.reshape(logits, tf.concat([rnn_shape[0:2], [len(ner_vocab)]], axis=0))
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=targets,
logits=logits,
sequence_length=seq_lengths,
average_across_batch=True,
average_across_timesteps=True,
sum_over_timesteps=False)
predicts = tf.argmax(logits, axis=2)
corrects = tf.reduce_sum(tf.cast(tf.equal(targets, predicts), tf.float32) * masks)
global_step = tf.placeholder(tf.int32)
train_op = tx.core.get_train_op(
mle_loss, global_step=global_step, increment_global_step=False,
hparams=config.opt)
# Training/eval processes
def _train_epoch(sess, epoch):
start_time = time.time()
loss = 0.
corr = 0.
num_tokens = 0.
fetches = {
"mle_loss": mle_loss,
"correct": corrects,
}
fetches["train_op"] = train_op
mode = tf.estimator.ModeKeys.TRAIN
num_inst = 0
for batch in iterate_batch(data_train, config.batch_size, shuffle=True):
word, char, ner, mask, length = batch
feed_dict = {
inputs: word, chars: char, targets: ner, masks: mask, seq_lengths: length,
global_step: epoch, tx.global_mode(): mode,
}
rets = sess.run(fetches, feed_dict)
nums = np.sum(length)
num_inst += len(word)
loss += rets["mle_loss"] * nums
corr += rets["correct"]
num_tokens += nums
print("train: %d (%d/%d) loss: %.4f, acc: %.2f%%" % (epoch, num_inst, len(data_train), loss / num_tokens, corr / num_tokens * 100))
print("train: %d loss: %.4f, acc: %.2f%%, time: %.2fs" % (epoch, loss / num_tokens, corr / num_tokens * 100, time.time() - start_time))
def _eval(sess, epoch, data_tag):
fetches = {
"predicts": predicts,
}
mode = tf.estimator.ModeKeys.EVAL
file_name = 'tmp/%s%d' % (data_tag, epoch)
writer = CoNLLWriter(i2w, i2n)
writer.start(file_name)
data = data_dev if data_tag == 'dev' else data_test
for batch in iterate_batch(data, config.batch_size, shuffle=False):
word, char, ner, mask, length = batch
feed_dict = {
inputs: word, chars: char, targets: ner, masks: mask, seq_lengths: length,
global_step: epoch, tx.global_mode(): mode,
}
rets = sess.run(fetches, feed_dict)
predictions = rets['predicts']
writer.write(word, predictions, ner, length)
writer.close()
acc, precision, recall, f1 = scores.scores(file_name)
print('%s acc: %.2f%%, precision: %.2f%%, recall: %.2f%%, F1: %.2f%%' % (data_tag, acc, precision, recall, f1))
return acc, precision, recall, f1
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
dev_f1 = 0.0
dev_acc = 0.0
dev_precision = 0.0
dev_recall = 0.0
best_epoch = 0
test_f1 = 0.0
test_acc = 0.0
test_prec = 0.0
test_recall = 0.0
tx.utils.maybe_create_dir('./tmp')
for epoch in range(config.num_epochs):
_train_epoch(sess, epoch)
acc, precision, recall, f1 = _eval(sess, epoch, 'dev')
if dev_f1 < f1:
dev_f1 = f1
dev_acc = acc
dev_precision = precision
dev_recall = recall
best_epoch = epoch
test_acc, test_prec, test_recall, test_f1 = _eval(sess, epoch, 'test')
print('best acc: %.2f%%, precision: %.2f%%, recall: %.2f%%, F1: %.2f%%, epoch: %d' % (dev_acc, dev_precision, dev_recall, dev_f1, best_epoch))
print('test acc: %.2f%%, precision: %.2f%%, recall: %.2f%%, F1: %.2f%%, epoch: %d' % (test_acc, test_prec, test_recall, test_f1, best_epoch))
print('---------------------------------------------------')
================================================
FILE: texar_repo/examples/sequence_tagging/scores.py
================================================
import subprocess
import sys
def scores(path):
bashCommand = 'perl conlleval'
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE, stdin=open(path))
output, error = process.communicate()
output = output.decode().split('\n')[1].split('%; ')
output = [out.split(' ')[-1] for out in output]
acc, prec, recall, fb1 = tuple(output)
return float(acc), float(prec), float(recall), float(fb1)
================================================
FILE: texar_repo/examples/text_style_transfer/README.md
================================================
# Text Style Transfer #
This example implements a simplified variant of the `ctrl-gen` model from
[Toward Controlled Generation of Text](https://arxiv.org/pdf/1703.00955.pdf)
*Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric Xing; ICML 2017*
The model roughly has an architecture of `Encoder--Decoder--Classifier`. Compared to the paper, following simplications are made:
* Replaces the base Variational Autoencoder (VAE) model with an attentional Autoencoder (AE) -- VAE is not necessary in the text style transfer setting since we do not need to interpolate the latent space as in the paper.
* Attribute classifier (i.e., discriminator) is trained with real data only. Samples generated by the decoder are not used.
* Independency constraint is omitted.
## Usage ##
### Dataset ###
Download the yelp sentiment dataset with the following cmd:
```
python prepare_data.py
```
### Train the model ###
Train the model on the above data to do sentiment transfer.
```
python main.py --config config
```
[config.py](./config.py) contains the data and mode configurations.
* The model will first be pre-trained for a few epochs (specified in `config.py`). During pre-training, the `Encoder-Decoder` part is trained as an autoencoder, while the `Classifier` part is trained with the classification labels.
* Full-training is then performed for another few epochs. During full-training, the `Classifier` part is fixed, and the `Encoder-Decoder` part is trained to fit the classifier, along with continuing to minimize the autoencoding loss.
Training log is printed as below:
```
gamma: 1.0, lambda_g: 0.0
step: 1, loss_d: 0.6903 accu_d: 0.5625
step: 1, loss_g_clas: 0.6991 loss_g: 9.1452 accu_g: 0.2812 loss_g_ae: 9.1452 accu_g_gdy: 0.2969
step: 500, loss_d: 0.0989 accu_d: 0.9688
step: 500, loss_g_clas: 0.2985 loss_g: 3.9696 accu_g: 0.8891 loss_g_ae: 3.9696 accu_g_gdy: 0.7734
...
step: 6500, loss_d: 0.0806 accu_d: 0.9703
step: 6500, loss_g_clas: 5.7137 loss_g: 0.2887 accu_g: 0.0844 loss_g_ae: 0.2887 accu_g_gdy: 0.0625
epoch: 1, loss_d: 0.0876 accu_d: 0.9719
epoch: 1, loss_g_clas: 6.7360 loss_g: 0.2195 accu_g: 0.0627 loss_g_ae: 0.2195 accu_g_gdy: 0.0642
val: accu_g: 0.0445 loss_g_ae: 0.1302 accu_d: 0.9774 bleu: 90.7896 loss_g: 0.1302 loss_d: 0.0666 loss_g_clas: 7.0310 accu_g_gdy: 0.0482
...
```
where:
- `loss_d` and `accu_d` are the classification loss/accuracy of the `Classifier` part.
- `loss_g_clas` is the classification loss of the generated sentences.
- `loss_g_ae` is the autoencoding loss.
- `loss_g` is the joint loss `= loss_g_ae + lambda_g * loss_g_clas`.
- `accu_g` is the classification accuracy of the generated sentences with soft represetations (i.e., Gumbel-softmax).
- `accu_g_gdy` is the classification accuracy of the generated sentences with greedy decoding.
- `bleu` is the BLEU score between the generated and input sentences.
## Results ##
Text style transfer has two primary goals:
1. The generated sentence should have desired attribute (e.g., positive/negative sentiment)
2. The generated sentence should keep the content of the original one
We use automatic metrics to evaluate both:
* For (1), we can use a pre-trained classifier to classify the generated sentences and evaluate the accuracy (the higher the better). In this code we have not implemented a stand-alone classifier for evaluation, which could be very easy though. The `Classifier` part in the model gives a reasonably good estimation (i.e., `accu_g_gdy` in the above) of the accuracy.
* For (2), we evaluate the BLEU score between the generated sentences and the original sentences, i.e., `bleu` in the above (the higher the better) (See [Yang et al., 2018](https://arxiv.org/pdf/1805.11749.pdf) for more details.)
The implementation here gives the following performance after 10 epochs of pre-training and 2 epochs of full-training:
| Accuracy (by the `Classifier` part) | BLEU (with the original sentence) |
| -------------------------------------| ----------------------------------|
| 0.92 | 54.0 |
Also refer to the following papers that used this code and compared to other text style transfer approaches:
* [Unsupervised Text Style Transfer using Language Models as Discriminators](https://papers.nips.cc/paper/7959-unsupervised-text-style-transfer-using-language-models-as-discriminators.pdf). Zichao Yang, Zhiting Hu, Chris Dyer, Eric Xing, Taylor Berg-Kirkpatrick. NeurIPS 2018
* [Structured Content Preservation for Unsupervised Text Style Transfer](https://arxiv.org/pdf/1810.06526.pdf). Youzhi Tian, Zhiting Hu, Zhou Yu. 2018
### Samples ###
Here are some randomly-picked samples. In each pair, the first sentence is the original sentence and the second is the generated.
```
go to place for client visits with gorgeous views .
go to place for client visits with lacking views .
there was lots of people but they still managed to provide great service .
there was lots of people but they still managed to provide careless service .
this was the best dining experience i have ever had .
this was the worst dining experience i have ever had .
needless to say , we skipped desert .
gentle to say , we edgy desert .
the first time i was missing an entire sandwich and a side of fries .
the first time i was beautifully an entire sandwich and a side of fries .
her boutique has a fabulous selection of designer brands !
her annoying has a sketchy selection of bland warned !
service is pretty good .
service is trashy rude .
ok nothing new .
exceptional impressed new .
```
================================================
FILE: texar_repo/examples/text_style_transfer/config.py
================================================
"""Config
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name
import copy
max_nepochs = 12 # Total number of training epochs
# (including pre-train and full-train)
pretrain_nepochs = 10 # Number of pre-train epochs (training as autoencoder)
display = 500 # Display the training results every N training steps.
display_eval = 1e10 # Display the dev results every N training steps (set to a
# very large value to disable it).
sample_path = './samples'
checkpoint_path = './checkpoints'
restore = '' # Model snapshot to restore from
lambda_g = 0.1 # Weight of the classification loss
gamma_decay = 0.5 # Gumbel-softmax temperature anneal rate
train_data = {
'batch_size': 64,
#'seed': 123,
'datasets': [
{
'files': './data/yelp/sentiment.train.text',
'vocab_file': './data/yelp/vocab',
'data_name': ''
},
{
'files': './data/yelp/sentiment.train.labels',
'data_type': 'int',
'data_name': 'labels'
}
],
'name': 'train'
}
val_data = copy.deepcopy(train_data)
val_data['datasets'][0]['files'] = './data/yelp/sentiment.dev.text'
val_data['datasets'][1]['files'] = './data/yelp/sentiment.dev.labels'
test_data = copy.deepcopy(train_data)
test_data['datasets'][0]['files'] = './data/yelp/sentiment.test.text'
test_data['datasets'][1]['files'] = './data/yelp/sentiment.test.labels'
model = {
'dim_c': 200,
'dim_z': 500,
'embedder': {
'dim': 100,
},
'encoder': {
'rnn_cell': {
'type': 'GRUCell',
'kwargs': {
'num_units': 700
},
'dropout': {
'input_keep_prob': 0.5
}
}
},
'decoder': {
'rnn_cell': {
'type': 'GRUCell',
'kwargs': {
'num_units': 700,
},
'dropout': {
'input_keep_prob': 0.5,
'output_keep_prob': 0.5
},
},
'attention': {
'type': 'BahdanauAttention',
'kwargs': {
'num_units': 700,
},
'attention_layer_size': 700,
},
'max_decoding_length_train': 21,
'max_decoding_length_infer': 20,
},
'classifier': {
'kernel_size': [3, 4, 5],
'filters': 128,
'other_conv_kwargs': {'padding': 'same'},
'dropout_conv': [1],
'dropout_rate': 0.5,
'num_dense_layers': 0,
'num_classes': 1
},
'opt': {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'learning_rate': 5e-4,
},
},
},
}
================================================
FILE: texar_repo/examples/text_style_transfer/ctrl_gen_model.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Text style transfer
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals
import tensorflow as tf
import texar as tx
from texar.modules import WordEmbedder, UnidirectionalRNNEncoder, \
MLPTransformConnector, AttentionRNNDecoder, \
GumbelSoftmaxEmbeddingHelper, Conv1DClassifier
from texar.core import get_train_op
from texar.utils import collect_trainable_variables, get_batch_size
class CtrlGenModel(object):
"""Control
"""
def __init__(self, inputs, vocab, gamma, lambda_g, hparams=None):
self._hparams = tx.HParams(hparams, None)
self._build_model(inputs, vocab, gamma, lambda_g)
def _build_model(self, inputs, vocab, gamma, lambda_g):
"""Builds the model.
"""
embedder = WordEmbedder(
vocab_size=vocab.size,
hparams=self._hparams.embedder)
encoder = UnidirectionalRNNEncoder(hparams=self._hparams.encoder)
# text_ids for encoder, with BOS token removed
enc_text_ids = inputs['text_ids'][:, 1:]
enc_outputs, final_state = encoder(embedder(enc_text_ids),
sequence_length=inputs['length']-1)
z = final_state[:, self._hparams.dim_c:]
# Encodes label
label_connector = MLPTransformConnector(self._hparams.dim_c)
# Gets the sentence representation: h = (c, z)
labels = tf.to_float(tf.reshape(inputs['labels'], [-1, 1]))
c = label_connector(labels)
c_ = label_connector(1 - labels)
h = tf.concat([c, z], 1)
h_ = tf.concat([c_, z], 1)
# Teacher-force decoding and the auto-encoding loss for G
decoder = AttentionRNNDecoder(
memory=enc_outputs,
memory_sequence_length=inputs['length']-1,
cell_input_fn=lambda inputs, attention: inputs,
vocab_size=vocab.size,
hparams=self._hparams.decoder)
connector = MLPTransformConnector(decoder.state_size)
g_outputs, _, _ = decoder(
initial_state=connector(h), inputs=inputs['text_ids'],
embedding=embedder, sequence_length=inputs['length']-1)
loss_g_ae = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=inputs['text_ids'][:, 1:],
logits=g_outputs.logits,
sequence_length=inputs['length']-1,
average_across_timesteps=True,
sum_over_timesteps=False)
# Gumbel-softmax decoding, used in training
start_tokens = tf.ones_like(inputs['labels']) * vocab.bos_token_id
end_token = vocab.eos_token_id
gumbel_helper = GumbelSoftmaxEmbeddingHelper(
embedder.embedding, start_tokens, end_token, gamma)
soft_outputs_, _, soft_length_, = decoder(
helper=gumbel_helper, initial_state=connector(h_))
# Greedy decoding, used in eval
outputs_, _, length_ = decoder(
decoding_strategy='infer_greedy', initial_state=connector(h_),
embedding=embedder, start_tokens=start_tokens, end_token=end_token)
# Creates classifier
classifier = Conv1DClassifier(hparams=self._hparams.classifier)
clas_embedder = WordEmbedder(vocab_size=vocab.size,
hparams=self._hparams.embedder)
# Classification loss for the classifier
clas_logits, clas_preds = classifier(
inputs=clas_embedder(ids=inputs['text_ids'][:, 1:]),
sequence_length=inputs['length']-1)
loss_d_clas = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.to_float(inputs['labels']), logits=clas_logits)
loss_d_clas = tf.reduce_mean(loss_d_clas)
accu_d = tx.evals.accuracy(labels=inputs['labels'], preds=clas_preds)
# Classification loss for the generator, based on soft samples
soft_logits, soft_preds = classifier(
inputs=clas_embedder(soft_ids=soft_outputs_.sample_id),
sequence_length=soft_length_)
loss_g_clas = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.to_float(1-inputs['labels']), logits=soft_logits)
loss_g_clas = tf.reduce_mean(loss_g_clas)
# Accuracy on soft samples, for training progress monitoring
accu_g = tx.evals.accuracy(labels=1-inputs['labels'], preds=soft_preds)
# Accuracy on greedy-decoded samples, for training progress monitoring
_, gdy_preds = classifier(
inputs=clas_embedder(ids=outputs_.sample_id),
sequence_length=length_)
accu_g_gdy = tx.evals.accuracy(
labels=1-inputs['labels'], preds=gdy_preds)
# Aggregates losses
loss_g = loss_g_ae + lambda_g * loss_g_clas
loss_d = loss_d_clas
# Creates optimizers
g_vars = collect_trainable_variables(
[embedder, encoder, label_connector, connector, decoder])
d_vars = collect_trainable_variables([clas_embedder, classifier])
train_op_g = get_train_op(
loss_g, g_vars, hparams=self._hparams.opt)
train_op_g_ae = get_train_op(
loss_g_ae, g_vars, hparams=self._hparams.opt)
train_op_d = get_train_op(
loss_d, d_vars, hparams=self._hparams.opt)
# Interface tensors
self.losses = {
"loss_g": loss_g,
"loss_g_ae": loss_g_ae,
"loss_g_clas": loss_g_clas,
"loss_d": loss_d_clas
}
self.metrics = {
"accu_d": accu_d,
"accu_g": accu_g,
"accu_g_gdy": accu_g_gdy,
}
self.train_ops = {
"train_op_g": train_op_g,
"train_op_g_ae": train_op_g_ae,
"train_op_d": train_op_d
}
self.samples = {
"original": inputs['text_ids'][:, 1:],
"transferred": outputs_.sample_id
}
self.fetches_train_g = {
"loss_g": self.train_ops["train_op_g"],
"loss_g_ae": self.losses["loss_g_ae"],
"loss_g_clas": self.losses["loss_g_clas"],
"accu_g": self.metrics["accu_g"],
"accu_g_gdy": self.metrics["accu_g_gdy"],
}
self.fetches_train_d = {
"loss_d": self.train_ops["train_op_d"],
"accu_d": self.metrics["accu_d"]
}
fetches_eval = {"batch_size": get_batch_size(inputs['text_ids'])}
fetches_eval.update(self.losses)
fetches_eval.update(self.metrics)
fetches_eval.update(self.samples)
self.fetches_eval = fetches_eval
================================================
FILE: texar_repo/examples/text_style_transfer/main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Text style transfer
This is a simplified implementation of:
Toward Controlled Generation of Text, ICML2017
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric Xing
Download the data with the cmd:
$ python prepare_data.py
Train the model with the cmd:
$ python main.py --config config
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=invalid-name, too-many-locals, too-many-arguments, no-member
import os
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from ctrl_gen_model import CtrlGenModel
flags = tf.flags
flags.DEFINE_string('config', 'config', 'The config to use.')
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
train_data = tx.data.MultiAlignedData(config.train_data)
val_data = tx.data.MultiAlignedData(config.val_data)
test_data = tx.data.MultiAlignedData(config.test_data)
vocab = train_data.vocab(0)
# Each training batch is used twice: once for updating the generator and
# once for updating the discriminator. Feedable data iterator is used for
# such case.
iterator = tx.data.FeedableDataIterator(
{'train_g': train_data, 'train_d': train_data,
'val': val_data, 'test': test_data})
batch = iterator.get_next()
# Model
gamma = tf.placeholder(dtype=tf.float32, shape=[], name='gamma')
lambda_g = tf.placeholder(dtype=tf.float32, shape=[], name='lambda_g')
model = CtrlGenModel(batch, vocab, gamma, lambda_g, config.model)
def _train_epoch(sess, gamma_, lambda_g_, epoch, verbose=True):
avg_meters_d = tx.utils.AverageRecorder(size=10)
avg_meters_g = tx.utils.AverageRecorder(size=10)
step = 0
while True:
try:
step += 1
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'train_d'),
gamma: gamma_,
lambda_g: lambda_g_
}
vals_d = sess.run(model.fetches_train_d, feed_dict=feed_dict)
avg_meters_d.add(vals_d)
feed_dict = {
iterator.handle: iterator.get_handle(sess, 'train_g'),
gamma: gamma_,
lambda_g: lambda_g_
}
vals_g = sess.run(model.fetches_train_g, feed_dict=feed_dict)
avg_meters_g.add(vals_g)
if verbose and (step == 1 or step % config.display == 0):
print('step: {}, {}'.format(step, avg_meters_d.to_str(4)))
print('step: {}, {}'.format(step, avg_meters_g.to_str(4)))
if verbose and step % config.display_eval == 0:
iterator.restart_dataset(sess, 'val')
_eval_epoch(sess, gamma_, lambda_g_, epoch)
except tf.errors.OutOfRangeError:
print('epoch: {}, {}'.format(epoch, avg_meters_d.to_str(4)))
print('epoch: {}, {}'.format(epoch, avg_meters_g.to_str(4)))
break
def _eval_epoch(sess, gamma_, lambda_g_, epoch, val_or_test='val'):
avg_meters = tx.utils.AverageRecorder()
while True:
try:
feed_dict = {
iterator.handle: iterator.get_handle(sess, val_or_test),
gamma: gamma_,
lambda_g: lambda_g_,
tx.context.global_mode(): tf.estimator.ModeKeys.EVAL
}
vals = sess.run(model.fetches_eval, feed_dict=feed_dict)
batch_size = vals.pop('batch_size')
# Computes BLEU
samples = tx.utils.dict_pop(vals, list(model.samples.keys()))
hyps = tx.utils.map_ids_to_strs(samples['transferred'], vocab)
refs = tx.utils.map_ids_to_strs(samples['original'], vocab)
refs = np.expand_dims(refs, axis=1)
bleu = tx.evals.corpus_bleu_moses(refs, hyps)
vals['bleu'] = bleu
avg_meters.add(vals, weight=batch_size)
# Writes samples
tx.utils.write_paired_text(
refs.squeeze(), hyps,
os.path.join(config.sample_path, 'val.%d'%epoch),
append=True, mode='v')
except tf.errors.OutOfRangeError:
print('{}: {}'.format(
val_or_test, avg_meters.to_str(precision=4)))
break
return avg_meters.avg()
tf.gfile.MakeDirs(config.sample_path)
tf.gfile.MakeDirs(config.checkpoint_path)
# Runs the logics
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
saver = tf.train.Saver(max_to_keep=None)
if config.restore:
print('Restore from: {}'.format(config.restore))
saver.restore(sess, config.restore)
iterator.initialize_dataset(sess)
gamma_ = 1.
lambda_g_ = 0.
for epoch in range(1, config.max_nepochs+1):
if epoch > config.pretrain_nepochs:
# Anneals the gumbel-softmax temperature
gamma_ = max(0.001, gamma_ * config.gamma_decay)
lambda_g_ = config.lambda_g
print('gamma: {}, lambda_g: {}'.format(gamma_, lambda_g_))
# Train
iterator.restart_dataset(sess, ['train_g', 'train_d'])
_train_epoch(sess, gamma_, lambda_g_, epoch)
# Val
iterator.restart_dataset(sess, 'val')
_eval_epoch(sess, gamma_, lambda_g_, epoch, 'val')
saver.save(
sess, os.path.join(config.checkpoint_path, 'ckpt'), epoch)
# Test
iterator.restart_dataset(sess, 'test')
_eval_epoch(sess, gamma_, lambda_g_, epoch, 'test')
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/text_style_transfer/prepare_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Downloads data.
"""
import texar as tx
# pylint: disable=invalid-name
def prepare_data():
"""Downloads data.
"""
tx.data.maybe_download(
urls='https://drive.google.com/file/d/'
'1HaUKEYDBEk6GlJGmXwqYteB-4rS9q8Lg/view?usp=sharing',
path='./',
filenames='yelp.zip',
extract=True)
def main():
"""Entrypoint.
"""
prepare_data()
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/torchtext/.gitignore
================================================
.data/
.vector_cache/
================================================
FILE: texar_repo/examples/torchtext/README.md
================================================
# Data loading with torchtext #
This example demonstrates the use of [torchtext](https://github.com/pytorch/text) package as data loader for Texar models.
## Usage ##
The following command trains a small-sized language model on PTB:
```
python lm_torchtext.py --config config_small
```
================================================
FILE: texar_repo/examples/torchtext/batchfirst_bptt.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from torchtext.data import BPTTIterator, Dataset, Batch
class BatchFirstBPTTIterator(BPTTIterator):
"""Defines an iterator for language modeling tasks that use BPTT.
Provides contiguous streams of examples together with targets that are
one timestep further forward, for language modeling training with
backpropagation through time (BPTT). Expects a Dataset with a single
example and a single field called 'text' and produces Batches with text and
target attributes.
All batches will have sizes [batch_size, bptt_len]
Attributes:
dataset: The Dataset object to load Examples from.
batch_size: Batch size.
bptt_len: Length of sequences for backpropagation through time.
sort_key: A key to use for sorting examples in order to batch together
examples with similar lengths and minimize padding. The sort_key
provided to the Iterator constructor overrides the sort_key
attribute of the Dataset, or defers to it if None.
train: Whether the iterator represents a train set.
repeat: Whether to repeat the iterator for multiple epochs.
shuffle: Whether to shuffle examples between epochs.
sort: Whether to sort examples according to self.sort_key.
Note that repeat, shuffle, and sort default to train, train, and
(not train).
device: Device to create batches on. Use -1 for CPU and None for the
currently active GPU device.
"""
def __len__(self):
return math.floor(
(len(self.dataset[0].text) / self.batch_size - 1) / self.bptt_len)
def __iter__(self):
text = self.dataset[0].text
TEXT = self.dataset.fields['text']
TEXT.eos_token = None
pad_num = int(math.ceil(len(text) / self.batch_size) * self.batch_size \
- len(text))
text = text + ([TEXT.pad_token] * pad_num)
data = TEXT.numericalize([text], device=self.device)
data = data.view(self.batch_size, -1).contiguous()
dataset = Dataset(examples=self.dataset.examples,
fields=[('text', TEXT), ('target', TEXT)])
while True:
for i in range(0, len(self) * self.bptt_len, self.bptt_len):
self.iterations += 1
seq_len = self.bptt_len
yield Batch.fromvars(
dataset, self.batch_size,
text=data[:, i:i + seq_len],
target=data[:, i + 1:i + 1 + seq_len])
if not self.repeat:
return
================================================
FILE: texar_repo/examples/torchtext/config_small.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PTB LM small size config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
init_scale = 0.1
num_epochs = 13
hidden_size = 200
keep_prob = 1.0
batch_size = 20
num_steps = 20
cell = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": keep_prob},
"num_layers": 2
}
emb = {
"dim": hidden_size
}
opt = {
"optimizer": {
"type": "GradientDescentOptimizer",
"kwargs": {"learning_rate": 1.0}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
},
"learning_rate_decay": {
"type": "exponential_decay",
"kwargs": {
"decay_steps": 1,
"decay_rate": 0.5,
"staircase": True
},
"start_decay_step": 3
}
}
================================================
FILE: texar_repo/examples/torchtext/lm_torchtext.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Language Modeling example using torchtext
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import importlib
import numpy as np
import tensorflow as tf
import texar as tx
from torchtext import data
from torchtext import datasets
from batchfirst_bptt import BatchFirstBPTTIterator
# pylint: disable=invalid-name, too-many-locals, no-member
flags = tf.flags
flags.DEFINE_string("data_path", "./",
"Directory containing PTB raw data (e.g., ptb.train.txt). "
"E.g., ./simple-examples/data. If not exists, "
"the directory will be created and PTB raw data will "
"be downloaded.")
flags.DEFINE_string("config", "config_small", "The config to use.")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def _main(_):
# Data
batch_size = config.batch_size
num_steps = config.num_steps
# setup vocabulary and data iterators with torchtext
TEXT = data.Field()
# make splits for data
train, valid, test = datasets.PennTreebank.splits(TEXT)
# build the vocabulary
TEXT.build_vocab(train, vectors=None)
vocab_size = len(TEXT.vocab)
# make iterator for splits
train_iter, valid_iter, test_iter = BatchFirstBPTTIterator.splits(
(train, valid, test), batch_size=batch_size, bptt_len=num_steps,
repeat=False)
inputs = tf.placeholder(tf.int32, [batch_size, num_steps])
targets = tf.placeholder(tf.int32, [batch_size, num_steps])
# Model architecture
initializer = tf.random_uniform_initializer(
-config.init_scale, config.init_scale)
with tf.variable_scope("model", initializer=initializer):
embedder = tx.modules.WordEmbedder(
vocab_size=vocab_size, hparams=config.emb)
emb_inputs = embedder(inputs)
if config.keep_prob < 1:
emb_inputs = tf.nn.dropout(
emb_inputs, tx.utils.switch_dropout(config.keep_prob))
decoder = tx.modules.BasicRNNDecoder(
vocab_size=vocab_size, hparams={"rnn_cell": config.cell})
initial_state = decoder.zero_state(batch_size, tf.float32)
outputs, final_state, seq_lengths = decoder(
decoding_strategy="train_greedy",
impute_finished=True,
inputs=emb_inputs,
sequence_length=[num_steps] * batch_size,
initial_state=initial_state)
# Losses & train ops
mle_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=targets,
logits=outputs.logits,
sequence_length=seq_lengths)
# Use global_step to pass epoch, for lr decay
global_step = tf.placeholder(tf.int32)
train_op = tx.core.get_train_op(
mle_loss, global_step=global_step, increment_global_step=False,
hparams=config.opt)
def _run_epoch(sess, data_iter, epoch, is_train=False, verbose=False):
start_time = time.time()
loss = 0.
iters = 0
state = sess.run(initial_state)
fetches = {
"mle_loss": mle_loss,
"final_state": final_state,
}
if is_train:
fetches["train_op"] = train_op
mode = (tf.estimator.ModeKeys.TRAIN
if is_train
else tf.estimator.ModeKeys.EVAL)
epoch_size = (len(train) // batch_size - 1) // num_steps
for step, data_batch in enumerate(data_iter):
feed_dict = {
inputs: data_batch.text,
targets: data_batch.target,
global_step: epoch,
tx.global_mode(): mode,
}
for i, (c, h) in enumerate(initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
rets = sess.run(fetches, feed_dict)
loss += rets["mle_loss"]
state = rets["final_state"]
iters += num_steps
ppl = np.exp(loss / iters)
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps" %
(step * 1.0 / epoch_size, ppl,
iters * batch_size / (time.time() - start_time)))
ppl = np.exp(loss / iters)
return ppl
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for epoch in range(config.num_epochs):
# Train
train_ppl = _run_epoch(
sess, train_iter, epoch, is_train=True, verbose=True)
print("Epoch: %d Train Perplexity: %.3f" % (epoch, train_ppl))
# Valid
valid_ppl = _run_epoch(sess, valid_iter, epoch)
print("Epoch: %d Valid Perplexity: %.3f" % (epoch, valid_ppl))
# Test
test_ppl = _run_epoch(sess, test_iter, 0)
print("Test Perplexity: %.3f" % (test_ppl))
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/examples/torchtext/requirements.txt
================================================
# also make sure install PyTorch 0.4.0 or newer.
torchtext >= 0.2.3
================================================
FILE: texar_repo/examples/transformer/README.md
================================================
# Transformer for Machine Translation #
This is an implementation of the Transformer model described in [Vaswani, Ashish, et al. "Attention is all you need."](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf).
[Quick Start](https://github.com/asyml/texar/tree/master/examples/transformer#quick-start): Prerequisites & use on machine translation datasets
[Run Your Customized Experiments](https://github.com/asyml/texar/tree/master/examples/transformer#run-your-customized-experiments): Hands-on tutorial of data preparation, configuration, and model training/test
## Quick Start ##
### Prerequisites ###
Run the following cmd to install necessary packages for the example:
```
pip install -r requirements.txt
```
### Datasets ###
Two example datasets are provided:
- IWSLT'15 **EN-VI** for English-Vietnamese translation
- WMT'14 **EN-DE** for English-German translation
Download and pre-process the **IWSLT'15 EN-VI** data with the following cmds:
```
sh scripts/iwslt15_en_vi.sh
sh preprocess_data.sh spm en vi
```
By default, the downloaded dataset is in `./data/en_vi`.
As with the [official implementation](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py), `spm` (`sentencepiece`) encoding is used to encode the raw text as data pre-processing. The encoded data is by default in `./temp/run_en_vi_spm`.
For the **WMT'14 EN-DE** data, download and pre-process with:
```
sh scripts/wmt14_en_de.sh
sh preprocess_data.sh bpe en de
```
By default, the downloaded dataset is in `./data/en_de`.
Note that for this dataset, `bpe` encoding (Byte pair encoding) is used instead. The encoded data is by default in `./temp/run_en_de_bpe`.
### Train and evaluate the model ###
Train the model with the cmd:
```
python transformer_main.py --run_mode=train_and_evaluate --config_model=config_model --config_data=config_iwslt15
```
* Specify `--model_dir` to dump model checkpoints, training logs, and tensorboard summaries to a desired directory. By default it is set to `./outputs`.
* Specifying `--model_dir` will also restore the latest model checkpoint under the directory, if any checkpoint is there.
* Specify `--config_data=config_wmt14` to train on the WMT'14 data.
### Test a trained model ###
To only evaluate a model checkpoint without training, first load the checkpoint and generate samples:
```
python transformer_main.py --run_mode=test --config_data=config_iwslt15 --model_dir=./outputs
```
The latest checkpoint in `./outputs` is used. Generated samples are in the file `./outputs/test.output.hyp`, and reference sentences are in the file `./outputs/test.output.ref`
Next, decode the samples with respective decoder, and evaluate with `bleu_tool`:
```
../../bin/utils/spm_decode --infile ./outputs/test.output.hyp --outfile temp/test.output.spm --model temp/run_en_vi_spm/data/spm-codes.32000.model --input_format=piece
python bleu_tool.py --reference=data/en_vi/test.vi --translation=temp/test.output.spm
```
For WMT'14, the corresponding cmds are:
```
# Loads model and generates samples
python transformer_main.py --run_mode=test --config_data=config_wmt14 --log_dir=./outputs
# BPE decoding
cat outputs/test.output.hyp | sed -E 's/(@@ )|(@@ ?$)//g' > temp/test.output.bpe
# Evaluates BLEU
python bleu_tool.py --reference=data/en_de/test.de --translation=temp/test.output.bpe
```
### Results
* On IWSLT'15, the implementation achieves around `BLEU_cased=28.54` and `BLEU_uncased=29.30` (by [bleu_tool.py](./bleu_tool.py)), which are comparable to the base_single_gpu results by the [official implementation](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py) (`28.12` and `28.97`, respectively, as reported [here](https://github.com/tensorflow/tensor2tensor/pull/611)).
* On WMT'14, the implementation achieves around `BLEU_cased=25.12` (setting: base_single_gpu, batch_size=3072).
### Example training log
```
12:02:02,686:INFO:step:500 loss: 7.3735
12:04:20,035:INFO:step:1000 loss:6.1502
12:06:37,550:INFO:step:1500 loss:5.4877
```
Using an Nvidia GTX 1080Ti, the model usually converges within 5 hours (~15 epochs) on IWSLT'15.
---
## Run Your Customized Experiments
Here is an hands-on tutorial on running Transformer with your own customized dataset.
### 1. Prepare raw data
Create a data directory and put the raw data in the directory. To be compatible with the data preprocessing in the next step, you may follow the convention below:
* The data directory should be named as `data/${src}_${tgt}/`. Take the data downloaded with `scripts/iwslt15_en_vi.sh` for example, the data directory is `data/en_vi`.
* The raw data should have 6 files, which contain source and target sentences of training/dev/test sets, respectively. In the `iwslt15_en_vi` example, `data/en_vi/train.en` contains the source sentences of the training set, where each line is a sentence. Other files are `train.vi`, `dev.en`, `dev.vi`, `test.en`, `test.vi`.
### 2. Preprocess the data
To obtain the processed dataset, run
```
preprocess_data.sh ${encoder} ${src} ${tgt} ${vocab_size} ${max_seq_length}
```
where
* The `encoder` parameter can be `bpe`(byte pairwise encoding), `spm` (sentence piece encoding), or
`raw`(no subword encoding).
* `vocab_size` is optional. The default is 32000.
- At this point, this parameter is used only when `encoder` is set to `bpe` or `spm`. For `raw` encoding, you'd have to truncate the vocabulary by yourself.
- For `spm` encoding, the preprocessing may fail (due to the Python sentencepiece module) if `vocab_size` is too large. So you may want to try smaller `vocab_size` if it happens.
* `max_seq_length` is optional. The default is 70.
In the `iwslt15_en_vi` example, the cmd is `sh preprocess_data.sh spm en vi`.
By default, the preprocessed data are dumped under `temp/run_${src}_${tgt}_${encoder}`. In the `iwslt15_en_vi` example, the directory is `temp/run_en_vi_spm`.
If you choose to use `raw` encoding method, notice that:
- By default, the word embedding layer is built with the combination of source vocabulary and target vocabulary. For example, if the source vocabulary is of size 3K and the target vocabulary of size 3K and there is no overlap between the two vocabularies, then the final vocabulary used in the model is of size 6K.
- By default, the final output layer of transformer decoder (hidden_state -> logits) shares the parameters with the word embedding layer.
### 3. Specify data and model configuration
Customize the Python configuration files to config the model and data.
Please refer to the example configuration files `config_model.py` for model configuration and `config_iwslt15.py` for data configuration.
### 4. Train the model
Train the model with the following cmd:
```
python transformer_main.py --run_mode=train_and_evaluate --config_model=custom_config_model --config_data=custom_config_data
```
where the model and data configuration files are `custom_config_model.py` and `custom_config_data.py`, respectively.
Outputs such as model checkpoints are by default under `outputs/`.
### 5. Test the model
Test with the following cmd:
```
python transformer_main.py --run_mode=test --config_data=custom_config_data --model_dir=./outputs
```
Generated samples on the test set are in `outputs/test.output.hyp`, and reference sentences are in `outputs/test.output.ref`. If you've used `bpe` or `spm` encoding in the data preprocessing step, the text in these files are in the respective encoding too. To decode, use the respective cmd:
```
# BPE decoding
cat outputs/test.output.hyp | sed -E 's/(@@ )|(@@ ?$)//g' > temp/test.output.hyp.final
# SPM decoding (take `iwslt15_en_vi` for example)
../../bin/utils/spm_decode --infile ./outputs/test.output.hyp --outfile temp/test.output.hyp.final --model temp/run_en_vi_spm/data/spm-codes.32000.model --input_format=piece
```
Finally, to evaluate the BLEU score against the ground truth on the test set:
```
python bleu_tool.py --reference=you_reference_file --translation=temp/test.output.hyp.final
```
E.g., in the `iwslt15_en_vi` example, with `--reference=data/en_vi/test.vi`
================================================
FILE: texar_repo/examples/transformer/bleu_tool.py
================================================
# Copyright 2018 The Tensor2Tensor Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modifications copyright (C) 2018 Texar
# ==============================================================================
"""BLEU metric utililities used for MT eval.
Usage: python bleu_tool.py --translation=my-wmt13.de --reference=wmt13_deen.de
"""
# This also:
# Put compounds in ATAT format (comparable to papers like GNMT, ConvS2S).
# See https://nlp.stanford.edu/projects/nmt/ :
# 'Also, for historical reasons, we split compound words, e.g.,
# "rich-text format" --> rich ##AT##-##AT## text format."'
# BLEU score will be similar to the one obtained using: mteval-v14.pl
# Note:compound splitting is not implemented in this module
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from argparse import ArgumentParser
from io import open
import collections
import math
import re
import sys
import unicodedata
# Dependency imports
import numpy as np
import six
# pylint: disable=redefined-builtin
from six.moves import xrange
from six.moves import zip
# pylint: enable=redefined-builtin
def _get_ngrams(segment, max_order):
"""Extracts all n-grams upto a given maximum order from an input segment.
Args:
segment: text segment from which n-grams will be extracted.
max_order: maximum length in tokens of the n-grams returned by this
methods.
Returns:
The Counter containing all n-grams upto max_order in segment
with a count of how many times each n-gram occurred.
"""
ngram_counts = collections.Counter()
for order in xrange(1, max_order + 1):
for i in xrange(0, len(segment) - order + 1):
ngram = tuple(segment[i:i + order])
ngram_counts[ngram] += 1
return ngram_counts
def compute_bleu(reference_corpus,
translation_corpus,
max_order=4,
use_bp=True):
"""Computes BLEU score of translated segments against references.
Args:
reference_corpus: list of references for each translation. Each
reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation
should be tokenized into a list of tokens.
max_order: Maximum n-gram order to use when computing BLEU score.
use_bp: boolean, whether to apply brevity penalty.
Returns:
BLEU score.
"""
reference_length = 0
translation_length = 0
bp = 1.0
geo_mean = 0
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
precisions = []
for (references, translations) in zip(reference_corpus, translation_corpus):
reference_length += len(references)
translation_length += len(translations)
ref_ngram_counts = _get_ngrams(references, max_order)
translation_ngram_counts = _get_ngrams(translations, max_order)
overlap = dict((ngram,
min(count, translation_ngram_counts[ngram]))
for ngram, count in ref_ngram_counts.items())
for ngram in overlap:
matches_by_order[len(ngram) - 1] += overlap[ngram]
for ngram in translation_ngram_counts:
possible_matches_by_order[len(ngram) - 1] += \
translation_ngram_counts[ngram]
precisions = [0] * max_order
smooth = 1.0
for i in xrange(0, max_order):
if possible_matches_by_order[i] > 0:
precisions[i] = matches_by_order[i] / possible_matches_by_order[i]
if matches_by_order[i] > 0:
precisions[i] = matches_by_order[i] / \
possible_matches_by_order[i]
else:
smooth *= 2
precisions[i] = 1.0 / (smooth * possible_matches_by_order[i])
else:
precisions[i] = 0.0
if max(precisions) > 0:
p_log_sum = sum(math.log(p) for p in precisions if p)
geo_mean = math.exp(p_log_sum / max_order)
if use_bp:
ratio = translation_length / reference_length
if ratio == 0:
bp = 0
bp = math.exp(1 - 1. / ratio) if ratio < 1.0 else 1.0
bleu = geo_mean * bp
return np.float32(bleu)
class UnicodeRegex(object):
"""Ad-hoc hack to recognize all punctuation and symbols."""
# pylint:disable=too-few-public-methods
def __init__(self):
punctuation = self.property_chars("P")
self.nondigit_punct_re = re.compile(r"([^\d])([" + punctuation + r"])")
self.punct_nondigit_re = re.compile(r"([" + punctuation + r"])([^\d])")
self.symbol_re = re.compile("([" + self.property_chars("S") + "])")
def property_chars(self, prefix):
#pylint:disable=no-self-use
return "".join(six.unichr(x) for x in range(sys.maxunicode) \
if unicodedata.category(six.unichr(x)).startswith(prefix))
uregex = UnicodeRegex()
def bleu_tokenize(string):
r"""Tokenize a string following the official BLEU implementation.
See https://github.com/moses-smt/mosesdecoder/"
"blob/master/scripts/generic/mteval-v14.pl#L954-L983
In our case, the input string is expected to be just one line
and no HTML entities de-escaping is needed.
So we just tokenize on punctuation and symbols,
except when a punctuation is preceded and followed by a digit
(e.g. a comma/dot as a thousand/decimal separator).
Note that a numer (e.g. a year) followed by a dot at the end of sentence
is NOT tokenized,
i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
does not match this case (unless we add a space after each sentence).
However, this error is already in the original mteval-v14.pl
and we want to be consistent with it.
Args:
string: the input string
Returns:
a list of tokens
"""
string = uregex.nondigit_punct_re.sub(r"\1 \2 ", string)
string = uregex.punct_nondigit_re.sub(r" \1 \2", string)
string = uregex.symbol_re.sub(r" \1 ", string)
return string.split()
def bleu_wrapper(ref_filename, hyp_filename, case_sensitive=False):
"""Compute BLEU for two files (reference and hypothesis translation)."""
ref_lines = open(ref_filename, encoding='utf-8').read().splitlines()
hyp_lines = open(hyp_filename, encoding='utf-8').read().splitlines()
assert len(ref_lines) == len(hyp_lines)
if not case_sensitive:
ref_lines = [x.lower() for x in ref_lines]
hyp_lines = [x.lower() for x in hyp_lines]
ref_tokens = [bleu_tokenize(x) for x in ref_lines]
hyp_tokens = [bleu_tokenize(x) for x in hyp_lines]
return compute_bleu(ref_tokens, hyp_tokens)
if __name__ == "__main__":
parser = ArgumentParser(description='Compute BLEU score. \
Usage: t2t-bleu --translation=my-wmt13.de --reference=wmt13_deen.de')
parser.add_argument('--translation', type=str)
parser.add_argument('--reference', type=str)
args = parser.parse_args()
bleu = 100 * bleu_wrapper(args.reference,
args.translation,
case_sensitive=False)
print("BLEU_uncased = %6.2f" % bleu)
bleu = 100 * bleu_wrapper(args.reference,
args.translation,
case_sensitive=True)
print("BLEU_cased = %6.2f" % bleu)
================================================
FILE: texar_repo/examples/transformer/config_iwslt15.py
================================================
batch_size = 2048
test_batch_size = 64
max_train_epoch = 20
display_steps = 500
eval_steps = 2000
max_decoding_length = 256
filename_prefix = "processed."
input_dir = 'temp/run_en_vi_spm/data'
vocab_file = input_dir + '/processed.vocab.pickle'
================================================
FILE: texar_repo/examples/transformer/config_model.py
================================================
"""Configurations of Transformer model
"""
import copy
import texar as tx
random_seed = 1234
beam_width = 5
alpha = 0.6
hidden_dim = 512
emb = {
'name': 'lookup_table',
'dim': hidden_dim,
'initializer': {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': hidden_dim**-0.5,
},
}
}
encoder = {
'dim': hidden_dim,
'num_blocks': 6,
'multihead_attention': {
'num_heads': 8,
'output_dim': hidden_dim
# See documentation for more optional hyperparameters
},
'position_embedder_hparams': {
'dim': hidden_dim
},
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode': 'fan_avg',
'distribution': 'uniform',
},
},
'poswise_feedforward': tx.modules.default_transformer_poswise_net_hparams(
output_dim=hidden_dim)
}
decoder = copy.deepcopy(encoder)
loss_label_confidence = 0.9
opt = {
'optimizer': {
'type': 'AdamOptimizer',
'kwargs': {
'beta1': 0.9,
'beta2': 0.997,
'epsilon': 1e-9
}
}
}
lr = {
'learning_rate_schedule': 'constant.linear_warmup.rsqrt_decay.rsqrt_depth',
'lr_constant': 2 * (hidden_dim ** -0.5),
'static_lr': 1e-3,
'warmup_steps': 16000,
}
================================================
FILE: texar_repo/examples/transformer/config_wmt14.py
================================================
batch_size = 3072
test_batch_size = 64
max_train_epoch = 10
display_steps = 500
eval_steps = 2000
max_decoding_length= 256
filename_prefix = "processed."
input_dir = 'temp/run_en_de_bpe/data'
vocab_file = input_dir + '/processed.vocab.pickle'
================================================
FILE: texar_repo/examples/transformer/preprocess_data.sh
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env bash
###########################################################################
# This file provides a script to preprocess raw text corpora to generate
# vocabulary with sentence piece encoding or byte pairwise encoding.
#
# By default, the vocab size is 32000 and maximum sequence length is 70.
###########################################################################
TF=$(pwd)
export PATH=$PATH:$TF/../../bin/utils/
encoder=$1
src_language=$2
tgt_language=$3
vocab_size=${4:-32000}
max_seq_length=${5:-70}
# update these variables
data=${TF}"/data/${src_language}_${tgt_language}"
name="run_${src_language}_${tgt_language}_${encoder}"
out="temp/${name}"
train_src=$data/train.${src_language}
train_tgt=$data/train.${tgt_language}
valid_src=$data/dev.${src_language}
valid_tgt=$data/dev.${tgt_language}
test_src=$data/test.${src_language}
test_tgt=$data/test.${tgt_language}
#====== EXPERIMENT BEGIN ======
echo "Output dir = $out"
[ -d $out ] || mkdir -p $out
[ -d $out/data ] || mkdir -p $out/data
[ -d $out/test ] || mkdir -p $out/test
echo "Step 1a: Preprocess inputs"
case ${encoder} in
'spm')
echo "Learning Word Piece on source and target combined"
spm_train --input=${train_src},${train_tgt} --vocab_size ${vocab_size} --model_prefix=$out/data/spm-codes.${vocab_size}
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile $train_src --outfile $out/data/train.${src_language}.spm
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile $valid_src --outfile $out/data/valid.${src_language}.spm
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile $test_src --outfile $out/data/test.${src_language}.spm
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile $train_tgt --outfile $out/data/train.${tgt_language}.spm
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile $valid_tgt --outfile $out/data/valid.${tgt_language}.spm
spm_encode --model $out/data/spm-codes.${vocab_size}.model --output_format=piece --infile ${test_tgt} --outfile $out/data/test.${tgt_language}.spm
cp ${test_tgt} ${out}/test/test.${tgt_language} ;;
'bpe'):
echo "Learning Byte Pairwise on source and target combined"
cat ${train_src} ${train_tgt} | learn_bpe -s ${vocab_size} > ${out}/data/bpe-codes.${vocab_size}
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${train_src} > $out/data/train.${src_language}.bpe
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${valid_src} > ${out}/data/valid.${src_language}.bpe
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${test_src} > ${out}/data/test.${src_language}.bpe
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${train_tgt} > $out/data/train.${tgt_language}.bpe
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${valid_tgt} > ${out}/data/valid.${tgt_language}.bpe
apply_bpe -c ${out}/data/bpe-codes.${vocab_size} < ${test_tgt} > ${out}/data/test.${tgt_language}.bpe
cp ${test_tgt} ${out}/test/test.${tgt_language} ;;
'raw'):
echo "No subword encoding is applied, just copy the corpus files into correct directory"
cp ${train_src} $out/data/train.${src_language}.raw
cp ${valid_src} $out/data/valid.${src_language}.raw
cp ${test_src} $out/data/test.${src_language}.raw
cp ${train_tgt} $out/data/train.${tgt_language}.raw
cp ${valid_tgt} $out/data/valid.${tgt_language}.raw
cp ${test_tgt} $out/data/test.${tgt_language}.raw
esac
# TODO(zhiting): Truncate vocab when encoder==raw
python ${TF}/utils/preprocess.py -i ${out}/data \
--src ${src_language}.${encoder} \
--tgt ${tgt_language}.${encoder} \
--save_data processed. \
--max_seq_length=${max_seq_length} \
--pre_encoding=${encoder}
================================================
FILE: texar_repo/examples/transformer/requirements.txt
================================================
torchtext
torch
sentencepiece
================================================
FILE: texar_repo/examples/transformer/scripts/iwslt15_en_vi.sh
================================================
#!/bin/sh
# Copied from https://github.com/tensorflow/nmt/blob/master/nmt/scripts/download_iwslt15.sh
#
# Download small-scale IWSLT15 Vietnames to English translation data for NMT
# model training.
#
# Usage:
# ./download_iwslt15.sh path-to-output-dir
#
# If output directory is not specified, "./iwslt15" will be used as the default
# output directory.
OUT_DIR="${1:-data/en_vi}"
SITE_PREFIX="https://nlp.stanford.edu/projects/nmt/data"
mkdir -v -p $OUT_DIR
# Download iwslt15 small dataset from standford website.
echo "Download training dataset train.en and train.vi."
curl -o "$OUT_DIR/train.en" "$SITE_PREFIX/iwslt15.en-vi/train.en"
curl -o "$OUT_DIR/train.vi" "$SITE_PREFIX/iwslt15.en-vi/train.vi"
echo "Download dev dataset tst2012.en and tst2012.vi."
curl -o "$OUT_DIR/dev.en" "$SITE_PREFIX/iwslt15.en-vi/tst2012.en"
curl -o "$OUT_DIR/dev.vi" "$SITE_PREFIX/iwslt15.en-vi/tst2012.vi"
echo "Download test dataset tst2013.en and tst2013.vi."
curl -o "$OUT_DIR/test.en" "$SITE_PREFIX/iwslt15.en-vi/tst2013.en"
curl -o "$OUT_DIR/test.vi" "$SITE_PREFIX/iwslt15.en-vi/tst2013.vi"
================================================
FILE: texar_repo/examples/transformer/scripts/wmt14_en_de.sh
================================================
#!/usr/bin/env bash
# This code was adapted from Tensorflow NMT toolkit on 03/24/2018.
# URL: https://raw.githubusercontent.com/tensorflow/nmt/master/nmt/scripts/wmt16_en_de.sh
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e
OUTPUT_DIR="data/en_de/"
DOWNLOADED_DATA_DIR="data/en_de_temp/"
OUTPUT_DIR_CACHE="${DOWNLOADED_DATA_DIR}/cache"
echo "Writing to ${OUTPUT_DIR_CACHE}. To change this, set the OUTPUT_DIR_CACHE environment variable."
mkdir -p $DOWNLOADED_DATA_DIR
mkdir -p ${OUTPUT_DIR}
if [ ! -f ${DOWNLOADED_DATA_DIR}/europarl-v7-de-en.tgz ]; then
echo "Downloading Europarl v7. This may take a while..."
curl -o ${DOWNLOADED_DATA_DIR}/europarl-v7-de-en.tgz \
http://www.statmt.org/europarl/v7/de-en.tgz
else
echo "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en.tgz already exists."
fi
if [ ! -f ${DOWNLOADED_DATA_DIR}/common-crawl.tgz ]; then
echo "Downloading Common Crawl corpus. This may take a while..."
curl -o ${DOWNLOADED_DATA_DIR}/common-crawl.tgz \
http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
else
echo "${DOWNLOADED_DATA_DIR}/common-crawl.tgz already exists."
fi
if [ ! -f ${DOWNLOADED_DATA_DIR}/nc-v11.tgz ]; then
echo "Downloading News Commentary v11. This may take a while..."
curl -o ${DOWNLOADED_DATA_DIR}/nc-v11.tgz \
http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
else
echo "${DOWNLOADED_DATA_DIR}/nc-v11.tgz already exists"
fi
if [ ! -f ${DOWNLOADED_DATA_DIR}/dev.tgz ]; then
echo "Downloading dev/test sets"
curl -o ${DOWNLOADED_DATA_DIR}/dev.tgz \
http://data.statmt.org/wmt16/translation-task/dev.tgz
else
echo "${DOWNLOADED_DATA_DIR}/dev.tgz already exists"
fi
if [ ! -f ${DOWNLOADED_DATA_DIR}/test.tgz ]; then
curl -o ${DOWNLOADED_DATA_DIR}/test.tgz \
http://data.statmt.org/wmt16/translation-task/test.tgz
else
echo "${DOWNLOADED_DATA_DIR}/test.tgz already exists"
fi
# Extract everything
echo "Extracting all files..."
if [ ! -d ${DOWNLOADED_DATA_DIR}/europarl-v7-de-en ]; then
mkdir -p "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en"
tar -xvzf "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en.tgz" -C "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en"
mkdir -p "${DOWNLOADED_DATA_DIR}/common-crawl"
tar -xvzf "${DOWNLOADED_DATA_DIR}/common-crawl.tgz" -C "${DOWNLOADED_DATA_DIR}/common-crawl"
mkdir -p "${DOWNLOADED_DATA_DIR}/nc-v11"
tar -xvzf "${DOWNLOADED_DATA_DIR}/nc-v11.tgz" -C "${DOWNLOADED_DATA_DIR}/nc-v11"
mkdir -p "${DOWNLOADED_DATA_DIR}/dev"
tar -xvzf "${DOWNLOADED_DATA_DIR}/dev.tgz" -C "${DOWNLOADED_DATA_DIR}/dev"
mkdir -p "${DOWNLOADED_DATA_DIR}/test"
tar -xvzf "${DOWNLOADED_DATA_DIR}/test.tgz" -C "${DOWNLOADED_DATA_DIR}/test"
else
echo "the tar files have been unzipped"
fi
# Concatenate Training data
wc -l ${DOWNLOADED_DATA_DIR}/europarl-v7-de-en/europarl-v7.de-en.en
wc -l ${DOWNLOADED_DATA_DIR}/common-crawl/commoncrawl.de-en.en
wc -l ${DOWNLOADED_DATA_DIR}/nc-v11/training-parallel-nc-v11/news-commentary-v11.de-en.en
cat "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en/europarl-v7.de-en.en" \
"${DOWNLOADED_DATA_DIR}/common-crawl/commoncrawl.de-en.en" \
"${DOWNLOADED_DATA_DIR}/nc-v11/training-parallel-nc-v11/news-commentary-v11.de-en.en" \
> "${OUTPUT_DIR_CACHE}/train.en" &&\
wc -l "${OUTPUT_DIR_CACHE}/train.en"
cat "${DOWNLOADED_DATA_DIR}/europarl-v7-de-en/europarl-v7.de-en.de" \
"${DOWNLOADED_DATA_DIR}/common-crawl/commoncrawl.de-en.de" \
"${DOWNLOADED_DATA_DIR}/nc-v11/training-parallel-nc-v11/news-commentary-v11.de-en.de" \
> "${OUTPUT_DIR_CACHE}/train.de" &&\
wc -l "${OUTPUT_DIR_CACHE}/train.de"
# Clone Moses
if [ ! -d "${OUTPUT_DIR_CACHE}/mosesdecoder" ]; then
echo "Cloning moses for data processing"
git clone https://github.com/moses-smt/mosesdecoder.git "${OUTPUT_DIR_CACHE}/mosesdecoder"
fi
${OUTPUT_DIR_CACHE}/mosesdecoder/scripts/ems/support/input-from-sgm.perl \
< ${DOWNLOADED_DATA_DIR}/dev/dev/newstest2014-deen-src.de.sgm \
> ${DOWNLOADED_DATA_DIR}/dev/dev/newstest2014.de
${OUTPUT_DIR_CACHE}/mosesdecoder/scripts/ems/support/input-from-sgm.perl \
< ${DOWNLOADED_DATA_DIR}/dev/dev/newstest2014-deen-ref.en.sgm \
> ${DOWNLOADED_DATA_DIR}/dev/dev/newstest2014.en
# Copy dev/test data to output dir
cp ${DOWNLOADED_DATA_DIR}/dev/dev/newstest20*.de ${OUTPUT_DIR_CACHE}
cp ${DOWNLOADED_DATA_DIR}/dev/dev/newstest20*.en ${OUTPUT_DIR_CACHE}
# Tokenize data
for f in ${OUTPUT_DIR_CACHE}/*.de; do
echo "Tokenizing $f..."
${OUTPUT_DIR_CACHE}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l de -threads 8 < $f > ${f%.*}.tok.de
done
for f in ${OUTPUT_DIR_CACHE}/*.en; do
echo "Tokenizing $f..."
${OUTPUT_DIR_CACHE}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l en -threads 8 < $f > ${f%.*}.tok.en
done
# Clean train corpora
for f in ${OUTPUT_DIR_CACHE}/train.tok.en; do
fbase=${f%.*}
echo "Cleaning ${fbase}..."
${OUTPUT_DIR_CACHE}/mosesdecoder/scripts/training/clean-corpus-n.perl $fbase de en "${fbase}.clean" 1 80
done
cp ${OUTPUT_DIR_CACHE}/train.tok.clean.en ${OUTPUT_DIR}/train.en
cp ${OUTPUT_DIR_CACHE}/train.tok.clean.de ${OUTPUT_DIR}/train.de
cp ${OUTPUT_DIR_CACHE}/newstest2013.tok.en ${OUTPUT_DIR}/dev.en
cp ${OUTPUT_DIR_CACHE}/newstest2013.tok.de ${OUTPUT_DIR}/dev.de
cp ${OUTPUT_DIR_CACHE}/newstest2014.tok.en ${OUTPUT_DIR}/test.en
cp ${OUTPUT_DIR_CACHE}/newstest2014.tok.de ${OUTPUT_DIR}/test.de
================================================
FILE: texar_repo/examples/transformer/transformer_main.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Transformer model.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import pickle
import random
import os
import importlib
from torchtext import data
import tensorflow as tf
import texar as tx
from texar.modules import TransformerEncoder, TransformerDecoder
from texar.utils import transformer_utils
from utils import data_utils, utils
from utils.preprocess import bos_token_id, eos_token_id
from bleu_tool import bleu_wrapper
# pylint: disable=invalid-name, too-many-locals
flags = tf.flags
flags.DEFINE_string("config_model", "config_model", "The model config.")
flags.DEFINE_string("config_data", "config_iwslt15", "The dataset config.")
flags.DEFINE_string("run_mode", "train_and_evaluate",
"Either train_and_evaluate or test.")
flags.DEFINE_string("model_dir", "./outputs",
"Directory to save the trained model and logs.")
FLAGS = flags.FLAGS
config_model = importlib.import_module(FLAGS.config_model)
config_data = importlib.import_module(FLAGS.config_data)
utils.set_random_seed(config_model.random_seed)
def main():
"""Entrypoint.
"""
# Load data
train_data, dev_data, test_data = data_utils.load_data_numpy(
config_data.input_dir, config_data.filename_prefix)
with open(config_data.vocab_file, 'rb') as f:
id2w = pickle.load(f)
vocab_size = len(id2w)
beam_width = config_model.beam_width
# Create logging
tx.utils.maybe_create_dir(FLAGS.model_dir)
logging_file = os.path.join(FLAGS.model_dir, 'logging.txt')
logger = utils.get_logger(logging_file)
print('logging file is saved in: %s', logging_file)
# Build model graph
encoder_input = tf.placeholder(tf.int64, shape=(None, None))
decoder_input = tf.placeholder(tf.int64, shape=(None, None))
# (text sequence length excluding padding)
encoder_input_length = tf.reduce_sum(
1 - tf.to_int32(tf.equal(encoder_input, 0)), axis=1)
decoder_input_length = tf.reduce_sum(
1 - tf.to_int32(tf.equal(decoder_input, 0)), axis=1)
labels = tf.placeholder(tf.int64, shape=(None, None))
is_target = tf.to_float(tf.not_equal(labels, 0))
global_step = tf.Variable(0, dtype=tf.int64, trainable=False)
learning_rate = tf.placeholder(tf.float64, shape=(), name='lr')
embedder = tx.modules.WordEmbedder(
vocab_size=vocab_size, hparams=config_model.emb)
encoder = TransformerEncoder(hparams=config_model.encoder)
encoder_output = encoder(inputs=embedder(encoder_input),
sequence_length=encoder_input_length)
# The decoder ties the input word embedding with the output logit layer.
# As the decoder masks out 's embedding, which in effect means
# has all-zero embedding, so here we explicitly set 's embedding
# to all-zero.
tgt_embedding = tf.concat(
[tf.zeros(shape=[1, embedder.dim]), embedder.embedding[1:, :]], axis=0)
decoder = TransformerDecoder(embedding=tgt_embedding,
hparams=config_model.decoder)
# For training
outputs = decoder(
memory=encoder_output,
memory_sequence_length=encoder_input_length,
inputs=embedder(decoder_input),
sequence_length=decoder_input_length,
decoding_strategy='train_greedy',
mode=tf.estimator.ModeKeys.TRAIN
)
mle_loss = transformer_utils.smoothing_cross_entropy(
outputs.logits, labels, vocab_size, config_model.loss_label_confidence)
mle_loss = tf.reduce_sum(mle_loss * is_target) / tf.reduce_sum(is_target)
train_op = tx.core.get_train_op(
mle_loss,
learning_rate=learning_rate,
global_step=global_step,
hparams=config_model.opt)
tf.summary.scalar('lr', learning_rate)
tf.summary.scalar('mle_loss', mle_loss)
summary_merged = tf.summary.merge_all()
# For inference
start_tokens = tf.fill([tx.utils.get_batch_size(encoder_input)],
bos_token_id)
predictions = decoder(
memory=encoder_output,
memory_sequence_length=encoder_input_length,
decoding_strategy='infer_greedy',
beam_width=beam_width,
alpha=config_model.alpha,
start_tokens=start_tokens,
end_token=eos_token_id,
max_decoding_length=config_data.max_decoding_length,
mode=tf.estimator.ModeKeys.PREDICT
)
if beam_width <= 1:
inferred_ids = predictions[0].sample_id
else:
# Uses the best sample by beam search
inferred_ids = predictions['sample_id'][:, :, 0]
saver = tf.train.Saver(max_to_keep=5)
best_results = {'score': 0, 'epoch': -1}
def _eval_epoch(sess, epoch, mode):
if mode == 'eval':
eval_data = dev_data
elif mode == 'test':
eval_data = test_data
else:
raise ValueError('`mode` should be either "eval" or "test".')
references, hypotheses = [], []
bsize = config_data.test_batch_size
for i in range(0, len(eval_data), bsize):
sources, targets = zip(*eval_data[i:i+bsize])
x_block = data_utils.source_pad_concat_convert(sources)
feed_dict = {
encoder_input: x_block,
tx.global_mode(): tf.estimator.ModeKeys.EVAL,
}
fetches = {
'inferred_ids': inferred_ids,
}
fetches_ = sess.run(fetches, feed_dict=feed_dict)
hypotheses.extend(h.tolist() for h in fetches_['inferred_ids'])
references.extend(r.tolist() for r in targets)
hypotheses = utils.list_strip_eos(hypotheses, eos_token_id)
references = utils.list_strip_eos(references, eos_token_id)
if mode == 'eval':
# Writes results to files to evaluate BLEU
# For 'eval' mode, the BLEU is based on token ids (rather than
# text tokens) and serves only as a surrogate metric to monitor
# the training process
fname = os.path.join(FLAGS.model_dir, 'tmp.eval')
hypotheses = tx.utils.str_join(hypotheses)
references = tx.utils.str_join(references)
hyp_fn, ref_fn = tx.utils.write_paired_text(
hypotheses, references, fname, mode='s')
eval_bleu = bleu_wrapper(ref_fn, hyp_fn, case_sensitive=True)
eval_bleu = 100. * eval_bleu
logger.info('epoch: %d, eval_bleu %.4f', epoch, eval_bleu)
print('epoch: %d, eval_bleu %.4f' % (epoch, eval_bleu))
if eval_bleu > best_results['score']:
logger.info('epoch: %d, best bleu: %.4f', epoch, eval_bleu)
best_results['score'] = eval_bleu
best_results['epoch'] = epoch
model_path = os.path.join(FLAGS.model_dir, 'best-model.ckpt')
logger.info('saving model to %s', model_path)
print('saving model to %s' % model_path)
saver.save(sess, model_path)
elif mode == 'test':
# For 'test' mode, together with the cmds in README.md, BLEU
# is evaluated based on text tokens, which is the standard metric.
fname = os.path.join(FLAGS.model_dir, 'test.output')
hwords, rwords = [], []
for hyp, ref in zip(hypotheses, references):
hwords.append([id2w[y] for y in hyp])
rwords.append([id2w[y] for y in ref])
hwords = tx.utils.str_join(hwords)
rwords = tx.utils.str_join(rwords)
hyp_fn, ref_fn = tx.utils.write_paired_text(
hwords, rwords, fname, mode='s',
src_fname_suffix='hyp', tgt_fname_suffix='ref')
logger.info('Test output writtn to file: %s', hyp_fn)
print('Test output writtn to file: %s' % hyp_fn)
def _train_epoch(sess, epoch, step, smry_writer):
random.shuffle(train_data)
train_iter = data.iterator.pool(
train_data,
config_data.batch_size,
key=lambda x: (len(x[0]), len(x[1])),
batch_size_fn=utils.batch_size_fn,
random_shuffler=data.iterator.RandomShuffler())
for _, train_batch in enumerate(train_iter):
in_arrays = data_utils.seq2seq_pad_concat_convert(train_batch)
feed_dict = {
encoder_input: in_arrays[0],
decoder_input: in_arrays[1],
labels: in_arrays[2],
learning_rate: utils.get_lr(step, config_model.lr)
}
fetches = {
'step': global_step,
'train_op': train_op,
'smry': summary_merged,
'loss': mle_loss,
}
fetches_ = sess.run(fetches, feed_dict=feed_dict)
step, loss = fetches_['step'], fetches_['loss']
if step and step % config_data.display_steps == 0:
logger.info('step: %d, loss: %.4f', step, loss)
print('step: %d, loss: %.4f' % (step, loss))
smry_writer.add_summary(fetches_['smry'], global_step=step)
if step and step % config_data.eval_steps == 0:
_eval_epoch(sess, epoch, mode='eval')
return step
# Run the graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
smry_writer = tf.summary.FileWriter(FLAGS.model_dir, graph=sess.graph)
if FLAGS.run_mode == 'train_and_evaluate':
logger.info('Begin running with train_and_evaluate mode')
if tf.train.latest_checkpoint(FLAGS.model_dir) is not None:
logger.info('Restore latest checkpoint in %s' % FLAGS.model_dir)
saver.restore(sess, tf.train.latest_checkpoint(FLAGS.model_dir))
step = 0
for epoch in range(config_data.max_train_epoch):
step = _train_epoch(sess, epoch, step, smry_writer)
elif FLAGS.run_mode == 'test':
logger.info('Begin running with test mode')
logger.info('Restore latest checkpoint in %s' % FLAGS.model_dir)
saver.restore(sess, tf.train.latest_checkpoint(FLAGS.model_dir))
_eval_epoch(sess, 0, mode='test')
else:
raise ValueError('Unknown mode: {}'.format(FLAGS.run_mode))
if __name__ == '__main__':
main()
================================================
FILE: texar_repo/examples/transformer/utils/__init__.py
================================================
================================================
FILE: texar_repo/examples/transformer/utils/data_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Data read/write utilities for Transformer.
"""
import os
import codecs
import six
import numpy as np
# pylint: disable=no-member
def load_data_numpy(input_dir, prefix):
train_data = np.load(os.path.join(input_dir,\
prefix + 'train.npy'), encoding='latin1').tolist()
dev_data = np.load(os.path.join(input_dir,\
prefix + 'valid.npy'), encoding='latin1').tolist()
test_data = np.load(os.path.join(input_dir,\
prefix + 'test.npy'), encoding='latin1').tolist()
print('train data size:{}'.format(len(train_data)))
return train_data, dev_data, test_data
def seq2seq_pad_concat_convert(xy_batch, eos_id=2, bos_id=1):
"""
Args:
xy_batch (list of tuple of two numpy.ndarray-s or cupy.ndarray-s):
xy_batch[i][0] is an array
of token ids of i-th input sentence in a minibatch.
xy_batch[i][1] is an array
of token ids of i-th target sentence in a minibatch.
The shape of each array is `(sentence length, )`.
eos_id: The index of end-of-sentence special token in the
dictionary.
Returns:
Tuple of Converted array.
(input_sent_batch_array, target_sent_batch_input_array,
target_sent_batch_output_array).
The shape of each array is `(batchsize, max_sentence_length)`.
All sentences are padded with 0 to reach max_sentence_length.
"""
x_seqs, y_seqs = zip(*xy_batch)
x_block = _concat_examples(x_seqs, padding=0)
y_block = _concat_examples(y_seqs, padding=0)
# Add EOS
x_block = np.pad(x_block, ((0, 0), (0, 1)), 'constant',
constant_values=0)
for i_batch, seq in enumerate(x_seqs):
x_block[i_batch, len(seq)] = eos_id
y_out_block = np.pad(y_block, ((0, 0), (0, 1)), 'constant',
constant_values=0)
for i_batch, seq in enumerate(y_seqs):
y_out_block[i_batch, len(seq)] = eos_id
# Add BOS in target language
y_in_block = np.pad(y_block, ((0, 0), (1, 0)), 'constant',
constant_values=bos_id)
return x_block, y_in_block, y_out_block
def source_pad_concat_convert(x_seqs, eos_id=2, bos_id=1):
"""
This function is used when testing the model without target input.
"""
x_block = _concat_examples(x_seqs, padding=0)
# add EOS
x_block = np.pad(x_block, ((0, 0), (0, 1)), 'constant', constant_values=0)
for i_batch, seq in enumerate(x_seqs):
x_block[i_batch, len(seq)] = eos_id
return x_block
def _concat_examples(arrays, padding=0):
if len(arrays) == 0:
raise ValueError('batch is empty')
first_elem = arrays[0]
assert isinstance(first_elem, np.ndarray)
shape = np.array(arrays[0].shape, dtype=int)
for array in arrays[1:]:
if np.any(shape != array.shape):
np.maximum(shape, array.shape, shape)
shape = tuple(np.insert(shape, 0, len(arrays)))
result = np.full(shape, padding, dtype=arrays[0].dtype)
for i in six.moves.range(len(arrays)):
src = arrays[i]
slices = tuple(slice(dim) for dim in src.shape)
result[(i,) + slices] = src
return result
def write_words(words_list, filename):
with codecs.open(filename, 'w+', 'utf-8') as myfile:
for words in words_list:
myfile.write(' '.join(words) + '\n')
================================================
FILE: texar_repo/examples/transformer/utils/preprocess.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
preprocessing text data. Generally it's to generate plain text vocab file,
truncate sequence by length, generate the preprocessed dataset.
"""
from __future__ import unicode_literals
import collections
import re
import json
import os
import numpy as np
import pickle
import argparse
from io import open
#pylint:disable=invalid-name
split_pattern = re.compile(r'([.,!?"\':;)(])')
digit_pattern = re.compile(r'\d')
# Refer to https://texar.readthedocs.io/en/latest/_modules/texar/data/vocabulary.html#SpecialTokens
# these tokens will by default have token ids 0, 1, 2, 3 respectively
pad_token_id, bos_token_id, eos_token_id, unk_token_id = 0, 1, 2, 3
def split_sentence(s, tok=False):
"""split sentence with some segmentation rules."""
if tok:
s = s.lower()
s = s.replace('\u2019', "'")
s = digit_pattern.sub('0', s)
words = []
for word in s.split():
if tok:
words.extend(split_pattern.split(word))
else:
words.append(word)
words = [w for w in words if w]
return words
def open_file(path):
"""more robust open function"""
return open(path, encoding='utf-8')
def read_file(path, tok=False):
"""a generator to generate each line of file."""
with open_file(path) as f:
for line in f.readlines():
words = split_sentence(line.strip(), tok)
yield words
def count_words(path, max_vocab_size=40000, tok=False):
"""count all words in the corpus and output a counter"""
counts = collections.Counter()
for words in read_file(path, tok):
for word in words:
counts[word] += 1
vocab = [word for (word, _) in counts.most_common(max_vocab_size)]
return vocab
def make_array(word_id, words):
"""generate id numpy array from plain text words."""
ids = [word_id.get(word, unk_token_id) for word in words]
return np.array(ids, 'i')
def make_dataset(path, w2id, tok=False):
"""generate dataset."""
dataset, npy_dataset = [], []
token_count, unknown_count = 0, 0
for words in read_file(path, tok):
array = make_array(w2id, words)
npy_dataset.append(array)
dataset.append(words)
token_count += array.size
unknown_count += (array == unk_token_id).sum()
print('# of tokens:{}'.format(token_count))
print('# of unknown {} {:.2}'.format(unknown_count,\
100. * unknown_count / token_count))
return dataset, npy_dataset
def get_preprocess_args():
"""Data preprocessing options."""
class Config(): pass
config = Config()
parser = argparse.ArgumentParser(description='Preprocessing Options')
parser.add_argument('--source_vocab', type=int, default=40000,
help='Vocabulary size of source language')
parser.add_argument('--target_vocab', type=int, default=40000,
help='Vocabulary size of target language')
parser.add_argument('--tok', dest='tok', action='store_true',
help='tokenized and lowercased')
parser.set_defaults(tok=False)
parser.add_argument('--max_seq_length', type=int, default=70)
parser.add_argument('--pre_encoding', type=str, default='spm')
parser.add_argument('--src', type=str, default='en')
parser.add_argument('--tgt', type=str, default='vi')
parser.add_argument('--input_dir', '-i', type=str, \
default='./data/en_vi/data/', help='Input directory')
parser.add_argument('--save_data', type=str, default='preprocess', \
help='Output file for the prepared data')
parser.parse_args(namespace=config)
#keep consistent with original implementation
#pylint:disable=attribute-defined-outside-init
config.input = config.input_dir
config.source_train = 'train.' + config.src
config.target_train = 'train.' + config.tgt
config.source_valid = 'valid.' + config.src
config.target_valid = 'valid.' + config.tgt
config.source_test = 'test.'+ config.src
config.target_test = 'test.' + config.tgt
return config
if __name__ == "__main__":
args = get_preprocess_args()
print(json.dumps(args.__dict__, indent=4))
#pylint:disable=no-member
# Vocab Construction
source_path = os.path.join(args.input_dir, args.source_train)
target_path = os.path.join(args.input_dir, args.target_train)
src_cntr = count_words(source_path, args.source_vocab, args.tok)
trg_cntr = count_words(target_path, args.target_vocab, args.tok)
all_words = sorted(list(set(src_cntr + trg_cntr)))
vocab = ['', '', '', ''] + all_words
w2id = {word: index for index, word in enumerate(vocab)}
# Train Dataset
source_data, source_npy = make_dataset(source_path, w2id, args.tok)
target_data, target_npy = make_dataset(target_path, w2id, args.tok)
assert len(source_data) == len(target_data)
train_data = [(s, t) for s, t in zip(source_data, target_data)
if s and len(s) < args.max_seq_length
and t and len(t) < args.max_seq_length]
train_npy = [(s, t) for s, t in zip(source_npy, target_npy)
if len(s) > 0 and len(s) < args.max_seq_length
and len(t) > 0 and len(t) < args.max_seq_length]
assert len(train_data) == len(train_npy)
# Display corpus statistics
print("Vocab: {} with special tokens".format(len(vocab)))
print('Original training data size: %d' % len(source_data))
print('Filtered training data size: %d' % len(train_data))
# Valid Dataset
source_path = os.path.join(args.input_dir, args.source_valid)
source_data, source_npy = make_dataset(source_path, w2id, args.tok)
target_path = os.path.join(args.input_dir, args.target_valid)
target_data, target_npy = make_dataset(target_path, w2id, args.tok)
assert len(source_data) == len(target_data)
valid_data = [(s, t) for s, t in zip(source_data, target_data)
if s and t]
valid_npy = [(s, t) for s, t in zip(source_npy, target_npy)
if len(s) > 0 and len(t) > 0]
assert len(valid_data) == len(valid_npy)
print('Original dev data size: %d' % len(source_data))
print('Filtered dev data size: %d' % len(valid_data))
# Test Dataset
source_path = os.path.join(args.input_dir, args.source_test)
source_data, source_npy = make_dataset(source_path, w2id, args.tok)
target_path = os.path.realpath(
os.path.join(args.input_dir, args.target_test))
target_data, target_npy = make_dataset(target_path, w2id, args.tok)
assert len(source_data) == len(target_data)
test_data = [(s, t) for s, t in zip(source_data, target_data)
if s and t]
test_npy = [(s, t) for s, t in zip(source_npy, target_npy)
if len(s)>0 and len(t)>0]
print('Original test data size: %d' % len(source_data))
print('Filtered test data size: %d' % len(test_data))
id2w = {i: w for w, i in w2id.items()}
# Save the dataset to numpy files
train_src_output = os.path.join(args.input_dir, \
args.save_data + 'train.' + args.src+ '.txt')
train_tgt_output = os.path.join(args.input_dir, \
args.save_data + 'train.' + args.tgt + '.txt')
dev_src_output = os.path.join(args.input_dir, \
args.save_data + 'dev.' + args.src+ '.txt')
dev_tgt_output = os.path.join(args.input_dir, \
args.save_data + 'dev.' + args.tgt+ '.txt')
test_src_output = os.path.join(args.input_dir, \
args.save_data + 'test.' + args.src+ '.txt')
test_tgt_output = os.path.join(args.input_dir, \
args.save_data + 'test.' + args.tgt + '.txt')
np.save(os.path.join(args.input, args.save_data + 'train.npy'),
train_npy)
np.save(os.path.join(args.input, args.save_data + 'valid.npy'),
valid_npy)
np.save(os.path.join(args.input, args.save_data + 'test.npy'),
test_npy)
with open(os.path.join(args.input, args.save_data + 'vocab.pickle'), 'wb')\
as f:
pickle.dump(id2w, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(train_src_output, 'w+', encoding='utf-8') as fsrc, \
open(train_tgt_output, 'w+', encoding='utf-8') as ftgt:
for words in train_data:
fsrc.write('{}\n'.format(' '.join(words[0])))
ftgt.write('{}\n'.format(' '.join(words[1])))
with open(dev_src_output, 'w+', encoding='utf-8') as fsrc, \
open(dev_tgt_output, 'w+', encoding='utf-8') as ftgt:
for words in valid_data:
fsrc.write('{}\n'.format(' '.join(words[0])))
ftgt.write('{}\n'.format(' '.join(words[1])))
with open(test_src_output, 'w+', encoding='utf-8') as fsrc, \
open(test_tgt_output, 'w+', encoding='utf-8') as ftgt:
for words in test_data:
fsrc.write('{}\n'.format(' '.join(words[0])))
ftgt.write('{}\n'.format(' '.join(words[1])))
with open(os.path.join(args.input_dir, \
args.save_data + args.pre_encoding + '.vocab.text'), 'w+', encoding='utf-8') as f:
max_size = len(id2w)
for idx in range(4, max_size):
f.write('{}\n'.format(id2w[idx]))
================================================
FILE: texar_repo/examples/transformer/utils/utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Helper functions for model training.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import random
import math
import logging
import numpy as np
import tensorflow as tf
def set_random_seed(myseed):
tf.set_random_seed(myseed)
np.random.seed(myseed)
random.seed(myseed)
def batch_size_fn(new, count, size_so_far):
max_src_in_batch, max_tgt_in_batch = 0, 0
max_src_in_batch = max(max_src_in_batch, len(new[0] + 1))
max_tgt_in_batch = max(max_tgt_in_batch, len(new[1] + 1))
src_elements = count * max_src_in_batch
tgt_elements = count * max_tgt_in_batch
return max(src_elements, tgt_elements)
def get_lr(fstep, opt_config):
if opt_config['learning_rate_schedule'] == 'static':
lr = opt_config['static_lr']
else:
lr = opt_config['lr_constant'] \
* min(1.0, (fstep / opt_config['warmup_steps'])) \
* (1 / math.sqrt(max(fstep, opt_config['warmup_steps'])))
return lr
def get_logger(log_path):
"""Returns a logger.
Args:
log_path (str): Path to the log file.
"""
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
fh = logging.FileHandler(log_path)
fh.setLevel(logging.DEBUG)
fh.setFormatter(
logging.Formatter('%(asctime)s:%(levelname)s:%(message)s'))
logger.addHandler(fh)
return logger
def list_strip_eos(list_, eos_token):
"""Strips EOS token from a list of lists of tokens.
"""
list_strip = []
for elem in list_:
if eos_token in elem:
elem = elem[:elem.index(eos_token)]
list_strip.append(elem)
return list_strip
================================================
FILE: texar_repo/examples/vae_text/README.md
================================================
# Variational Autoencoder (VAE) for Text Generation
This example builds a VAE for text generation, with an LSTM as encoder and an LSTM or [Transformer](https://arxiv.org/pdf/1706.03762.pdf) as decoder. Training is performed on the official PTB data and Yahoo data, respectively.
The VAE with LSTM decoder is first decribed in [(Bowman et al., 2015) Generating Sentences from a Continuous Space](https://arxiv.org/pdf/1511.06349.pdf)
The Yahoo dataset is from [(Yang et al., 2017) Improved Variational Autoencoders for Text Modeling using Dilated Convolutions](https://arxiv.org/pdf/1702.08139.pdf), which is created by sampling 100k documents from the original Yahoo Answer data. The average document length is 78 and the vocab size is 200k.
## Data
The datasets can be downloaded by running:
```shell
python prepare_data.py --data ptb
python prepare_data.py --data yahoo
```
## Training
Train with the following command:
```shell
python vae_train.py --config config_trans_ptb
```
Here:
* `--config` specifies the config file to use, including model hyperparameters and data paths. We provide 4 config files:
- [config_lstm_ptb.py](./config_lstm_ptb.py): LSTM decoder, on the PTB data
- [config_lstm_yahoo.py](./config_lstm_yahoo.py): LSTM decoder, on the Yahoo data
- [config_trans_ptb.py](./config_trans_ptb.py): Transformer decoder, on the PTB data
- [config_trans_yahoo.py](./config_trans_yahoo.py): Transformer decoder, on the Yahoo data
## Generation
Generating sentences with pre-trained model can be performed with the following command:
```shell
python vae_train.py --config config_file --mode predict --model /path/to/model.ckpt --out /path/to/output
```
Here `--model` specifies the saved model checkpoint, which is saved in `./models/dataset_name/` at training time. For example, the model path is `./models/ptb/ptb_lstmDecoder.ckpt` when generating with a LSTM decoder trained on PTB dataset. Generated sentences will be written to standard output if `--out` is not specifcied.
## Results
### Language Modeling
|Dataset |Metrics | VAE-LSTM |VAE-Transformer |
|---------------|-------------|----------------|------------------------|
|Yahoo | Test PPL Test NLL | 68.11 337.13 |59.95 326.93|
|PTB | Test PPL Test NLL | 104.61 101.92 | 103.68 101.72 |
### Generated Examples
We show the generated examples with transformer as decoder trained on PTB training data.
|Examples|
|:---------|
|i 'm always looking at a level of \$ N to \$ N billion \ |
|after four years ago president bush has federal regulators decided to file financing for the waiver\ |
|the savings & loan association said total asset revenue was about \$ N billion compared with \$ N billion \ |
|the trend would seem to be effective \ |
|chicago city 's computer bank of britain posted a N N jump in third-quarter net income \|
================================================
FILE: texar_repo/examples/vae_text/config_lstm_ptb.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VAE config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
dataset = "ptb"
num_epochs = 100
hidden_size = 256
dec_dropout_in = 0.5
dec_dropout_out = 0.5
enc_dropout_in = 0.
enc_dropout_out = 0.
word_keep_prob = 0.5
batch_size = 32
embed_dim = 256
latent_dims = 32
lr_decay_hparams = {
"init_lr": 0.001,
"threshold": 2,
"decay_factor": 0.5,
"max_decay": 5
}
decoder_hparams = {
"type": "lstm"
}
enc_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - enc_dropout_out},
"num_layers": 1
}
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - dec_dropout_out},
"num_layers": 1
}
enc_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": enc_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
dec_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": dec_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
# KL annealing
kl_anneal_hparams={
"warm_up": 10,
"start": 0.1
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './simple-examples/data/ptb.train.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './simple-examples/data/ptb.valid.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": './simple-examples/data/ptb.test.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
================================================
FILE: texar_repo/examples/vae_text/config_lstm_yahoo.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VAE config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
dataset = "yahoo"
num_epochs = 100
hidden_size = 550
dec_dropout_in = 0.5
dec_dropout_out = 0.5
enc_dropout_in = 0.
enc_dropout_out = 0.
batch_size = 32
embed_dim = 512
latent_dims = 32
lr_decay_hparams = {
"init_lr": 0.001,
"threshold": 2,
"decay_factor": 0.5,
"max_decay": 5
}
relu_dropout = 0.2
embedding_dropout = 0.2
attention_dropout = 0.2
residual_dropout = 0.2
num_blocks = 3
decoder_hparams = {
"type": "lstm"
}
enc_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - enc_dropout_out},
"num_layers": 1
}
dec_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - dec_dropout_out},
"num_layers": 1
}
enc_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": enc_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
dec_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": dec_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
# KL annealing
# kl_weight = 1.0 / (1 + np.exp(-k*(step-x0)))
kl_anneal_hparams={
"warm_up": 10,
"start": 0.1
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './data/yahoo/yahoo.train.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './data/yahoo/yahoo.valid.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": './data/yahoo/yahoo.test.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
================================================
FILE: texar_repo/examples/vae_text/config_trans_ptb.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VAE config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
dataset = "ptb"
num_epochs = 100
hidden_size = 256
dec_dropout_in = 0.
enc_dropout_in = 0.
enc_dropout_out = 0.
batch_size = 32
embed_dim = 256
latent_dims = 32
lr_decay_hparams = {
"init_lr": 0.001,
"threshold": 2,
"decay_factor": 0.5,
"max_decay": 5
}
relu_dropout = 0.2
embedding_dropout = 0.2
attention_dropout = 0.2
residual_dropout = 0.2
num_blocks = 3
decoder_hparams = {
"type": "transformer"
}
enc_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - enc_dropout_out},
"num_layers": 1
}
enc_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": enc_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
dec_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": dec_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
# due to the residual connection, the embed_dim should be equal to hidden_size
trans_hparams = {
'output_layer_bias': False,
'embedding_dropout': embedding_dropout,
'residual_dropout': residual_dropout,
'num_blocks': num_blocks,
'dim': hidden_size,
'position_embedder_hparams': {
'dim': hidden_size,
},
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode': 'fan_avg',
'distribution': 'uniform',
},
},
'multihead_attention': {
'dropout_rate': attention_dropout,
'num_heads': 8,
'num_units': hidden_size,
'output_dim': hidden_size
},
'poswise_feedforward': {
'name': 'fnn',
'layers': [
{
'type': 'Dense',
'kwargs': {
'name': 'conv1',
'units': hidden_size*4,
'activation': 'relu',
'use_bias': True,
},
},
{
'type': 'Dropout',
'kwargs': {
'rate': relu_dropout,
}
},
{
'type': 'Dense',
'kwargs': {
'name': 'conv2',
'units': hidden_size,
'use_bias': True,
}
}
],
}
}
# KL annealing
kl_anneal_hparams = {
"warm_up": 10,
"start": 0.1
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './simple-examples/data/ptb.train.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './simple-examples/data/ptb.valid.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": './simple-examples/data/ptb.test.txt',
"vocab_file": './simple-examples/data/vocab.txt'
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
================================================
FILE: texar_repo/examples/vae_text/config_trans_yahoo.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VAE config.
"""
# pylint: disable=invalid-name, too-few-public-methods, missing-docstring
dataset = "yahoo"
num_epochs = 100
hidden_size = 512
dec_dropout_in = 0.
enc_dropout_in = 0.
enc_dropout_out = 0.
batch_size = 32
embed_dim = 512
latent_dims = 32
lr_decay_hparams = {
"init_lr": 0.001,
"threshold": 2,
"decay_factor": 0.5,
"max_decay": 5
}
relu_dropout = 0.2
embedding_dropout = 0.2
attention_dropout = 0.2
residual_dropout = 0.2
num_blocks = 3
decoder_hparams = {
"type": "transformer"
}
enc_cell_hparams = {
"type": "LSTMBlockCell",
"kwargs": {
"num_units": hidden_size,
"forget_bias": 0.
},
"dropout": {"output_keep_prob": 1. - enc_dropout_out},
"num_layers": 1
}
enc_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": enc_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
dec_emb_hparams = {
'name': 'lookup_table',
"dim": embed_dim,
"dropout_rate": dec_dropout_in,
'initializer' : {
'type': 'random_normal_initializer',
'kwargs': {
'mean': 0.0,
'stddev': embed_dim**-0.5,
},
}
}
# due to the residual connection, the embed_dim should be equal to hidden_size
trans_hparams = {
'output_layer_bias': False,
'embedding_dropout': embedding_dropout,
'residual_dropout': residual_dropout,
'num_blocks': num_blocks,
'dim': hidden_size,
'initializer': {
'type': 'variance_scaling_initializer',
'kwargs': {
'scale': 1.0,
'mode':'fan_avg',
'distribution':'uniform',
},
},
'multihead_attention': {
'dropout_rate': attention_dropout,
'num_heads': 8,
'num_units': hidden_size,
'output_dim': hidden_size
},
'poswise_feedforward': {
'name':'fnn',
'layers':[
{
'type':'Dense',
'kwargs': {
'name':'conv1',
'units':hidden_size*4,
'activation':'relu',
'use_bias':True,
},
},
{
'type':'Dropout',
'kwargs': {
'rate': relu_dropout,
}
},
{
'type':'Dense',
'kwargs': {
'name':'conv2',
'units':hidden_size,
'use_bias':True,
}
}
],
}
}
# KL annealing
kl_anneal_hparams={
"warm_up": 10,
"start": 0.1
}
train_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './data/yahoo/yahoo.train.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
val_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"seed": 123,
"dataset": {
"files": './data/yahoo/yahoo.valid.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
test_data_hparams = {
"num_epochs": 1,
"batch_size": batch_size,
"dataset": {
"files": './data/yahoo/yahoo.test.txt',
"vocab_file": './data/yahoo/vocab.txt'
}
}
opt_hparams = {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"gradient_clip": {
"type": "clip_by_global_norm",
"kwargs": {"clip_norm": 5.}
}
}
================================================
FILE: texar_repo/examples/vae_text/prepare_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for downloading and preprocessing the PTB and Yohoo data.
"""
import os
import argparse
import tensorflow as tf
import texar as tx
def prepare_data(data_name):
"""Prepare datasets.
Args:
data_path: the path to save the data
data_name: the name of dataset, "ptb" and "yahoo"
are currently supported
"""
if data_name == "ptb":
data_path = "./simple-examples/data"
train_path = os.path.join(data_path, "ptb.train.txt")
if not tf.gfile.Exists(train_path):
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz'
tx.data.maybe_download(url, './', extract=True)
train_path = os.path.join(data_path, "ptb.train.txt")
vocab_path = os.path.join(data_path, "vocab.txt")
word_to_id = tx.data.make_vocab(
train_path, return_type="dict")
with open(vocab_path, 'w') as fvocab:
for word in word_to_id:
fvocab.write("%s\n" % word)
elif data_name == "yahoo":
data_path = "./data/yahoo"
train_path = os.path.join(data_path, "yahoo.train.txt")
if not tf.gfile.Exists(train_path):
url = 'https://drive.google.com/file/d/'\
'13IsiffVjcQ-wrrbBGMwiG3sYf-DFxtXH/view?usp=sharing'
tx.data.maybe_download(url, path='./', filenames='yahoo.zip',
extract=True)
else:
raise ValueError('Unknown data: {}'.format(data_name))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='prepare data')
parser.add_argument('--data', type=str, help='dataset to prepare')
args = parser.parse_args()
prepare_data(args.data)
================================================
FILE: texar_repo/examples/vae_text/vae_train.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example for building the Variational Autoencoder.
This is an impmentation of Variational Autoencoder for text generation
To run:
$ python vae_train.py
Hyperparameters and data path may be specified in config_trans.py
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=invalid-name, no-member, too-many-locals
# pylint: disable=too-many-branches, too-many-statements, redefined-variable-type
import os
import sys
import time
import importlib
from io import open
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import texar as tx
tfd = tfp.distributions
flags = tf.flags
flags.DEFINE_string("config", "config", "The config to use.")
flags.DEFINE_string("mode", "train", "train or predict")
flags.DEFINE_string("model", None, "model path for generating sentences")
flags.DEFINE_string("out", None, "generation output path")
FLAGS = flags.FLAGS
config = importlib.import_module(FLAGS.config)
def kl_dvg(means, logvars):
"""compute the KL divergence between Gaussian distribution
"""
kl_cost = -0.5 * (logvars - tf.square(means) -
tf.exp(logvars) + 1.0)
kl_cost = tf.reduce_mean(kl_cost, 0)
return tf.reduce_sum(kl_cost)
def _main(_):
# Data
train_data = tx.data.MonoTextData(config.train_data_hparams)
val_data = tx.data.MonoTextData(config.val_data_hparams)
test_data = tx.data.MonoTextData(config.test_data_hparams)
iterator = tx.data.TrainTestDataIterator(train=train_data,
val=val_data,
test=test_data)
data_batch = iterator.get_next()
opt_vars = {
'learning_rate': config.lr_decay_hparams["init_lr"],
'best_valid_nll': 1e100,
'steps_not_improved': 0,
'kl_weight': config.kl_anneal_hparams["start"]
}
decay_cnt = 0
max_decay = config.lr_decay_hparams["max_decay"]
decay_factor = config.lr_decay_hparams["decay_factor"]
decay_ts = config.lr_decay_hparams["threshold"]
save_dir = "./models/%s" % config.dataset
if not os.path.exists(save_dir):
os.makedirs(save_dir)
suffix = "%s_%sDecoder.ckpt" % \
(config.dataset, config.decoder_hparams["type"])
save_path = os.path.join(save_dir, suffix)
# KL term annealing rate
anneal_r = 1.0 / (config.kl_anneal_hparams["warm_up"] * \
(train_data.dataset_size() / config.batch_size))
# Model architecture
encoder_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.vocab.size, hparams=config.enc_emb_hparams)
decoder_embedder = tx.modules.WordEmbedder(
vocab_size=train_data.vocab.size, hparams=config.dec_emb_hparams)
input_embed = encoder_embedder(data_batch["text_ids"])
output_embed = decoder_embedder(data_batch["text_ids"][:, :-1])
encoder = tx.modules.UnidirectionalRNNEncoder(
hparams={"rnn_cell": config.enc_cell_hparams})
if config.decoder_hparams["type"] == "lstm":
decoder = tx.modules.BasicRNNDecoder(
vocab_size=train_data.vocab.size,
hparams={"rnn_cell": config.dec_cell_hparams})
decoder_initial_state_size = decoder.cell.state_size
elif config.decoder_hparams["type"] == 'transformer':
decoder = tx.modules.TransformerDecoder(
embedding=decoder_embedder.embedding,
hparams=config.trans_hparams)
decoder_initial_state_size = tf.TensorShape(
[1, config.dec_emb_hparams["dim"]])
else:
raise NotImplementedError
connector_mlp = tx.modules.MLPTransformConnector(
config.latent_dims * 2)
connector_stoch = tx.modules.ReparameterizedStochasticConnector(
decoder_initial_state_size)
_, ecdr_states = encoder(
input_embed,
sequence_length=data_batch["length"])
mean_logvar = connector_mlp(ecdr_states)
mean, logvar = tf.split(mean_logvar, 2, 1)
kl_loss = kl_dvg(mean, logvar)
dst = tfd.MultivariateNormalDiag(
loc=mean,
scale_diag=tf.exp(0.5 * logvar))
dcdr_states, latent_z = connector_stoch(dst)
# decoder
if config.decoder_hparams["type"] == "lstm":
# concat latent variable to input at every time step
latent_z = tf.expand_dims(latent_z, axis=1)
latent_z = tf.tile(latent_z, [1, tf.shape(output_embed)[1], 1])
output_embed = tf.concat([output_embed, latent_z], axis=2)
outputs, _, _ = decoder(
initial_state=dcdr_states,
decoding_strategy="train_greedy",
inputs=output_embed,
sequence_length=data_batch["length"]-1)
else:
outputs = decoder(
inputs=output_embed,
memory=dcdr_states,
memory_sequence_length=tf.ones(tf.shape(dcdr_states)[0]))
logits = outputs.logits
seq_lengths = data_batch["length"] - 1
# Losses & train ops
rc_loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=data_batch["text_ids"][:, 1:],
logits=logits,
sequence_length=data_batch["length"]-1)
# KL annealing
kl_weight = tf.placeholder(tf.float32, shape=())
nll = rc_loss + kl_weight * kl_loss
learning_rate = tf.placeholder(dtype=tf.float32, shape=(),
name='learning_rate')
train_op = tx.core.get_train_op(nll, learning_rate=learning_rate,
hparams=config.opt_hparams)
def _run_epoch(sess, epoch, mode_string, display=10):
if mode_string == 'train':
iterator.switch_to_train_data(sess)
elif mode_string == 'valid':
iterator.switch_to_val_data(sess)
elif mode_string == 'test':
iterator.switch_to_test_data(sess)
step = 0
start_time = time.time()
num_words = num_sents = 0
nll_ = 0.
kl_loss_ = rc_loss_ = 0.
while True:
try:
fetches = {"nll": nll,
"kl_loss": kl_loss,
"rc_loss": rc_loss,
"lengths": seq_lengths}
if mode_string == 'train':
fetches["train_op"] = train_op
opt_vars["kl_weight"] = min(
1.0, opt_vars["kl_weight"] + anneal_r)
kl_weight_ = opt_vars["kl_weight"]
else:
kl_weight_ = 1.0
mode = (tf.estimator.ModeKeys.TRAIN if mode_string == 'train'
else tf.estimator.ModeKeys.EVAL)
feed = {tx.global_mode(): mode,
kl_weight: kl_weight_,
learning_rate: opt_vars["learning_rate"]}
fetches_ = sess.run(fetches, feed_dict=feed)
batch_size = len(fetches_["lengths"])
num_sents += batch_size
num_words += sum(fetches_["lengths"])
nll_ += fetches_["nll"] * batch_size
kl_loss_ += fetches_["kl_loss"] * batch_size
rc_loss_ += fetches_["rc_loss"] * batch_size
if step % display == 0 and mode_string == 'train':
print('%s: epoch %d, step %d, nll %.4f, klw: %.4f, ' \
'KL %.4f, rc %.4f, log_ppl %.4f, ppl %.4f, ' \
'time elapsed: %.1fs' % \
(mode_string, epoch, step, nll_ / num_sents,
opt_vars["kl_weight"], kl_loss_ / num_sents,
rc_loss_ / num_sents, nll_ / num_words,
np.exp(nll_ / num_words), time.time() - start_time))
sys.stdout.flush()
step += 1
except tf.errors.OutOfRangeError:
print('\n%s: epoch %d, nll %.4f, KL %.4f, rc %.4f, ' \
'log_ppl %.4f, ppl %.4f\n' %
(mode_string, epoch, nll_ / num_sents,
kl_loss_ / num_sents, rc_loss_ / num_sents,
nll_ / num_words, np.exp(nll_ / num_words)))
break
return nll_ / num_sents, np.exp(nll_ / num_words)
def generate(sess, saver, fname=None):
if tf.train.checkpoint_exists(FLAGS.model):
saver.restore(sess, FLAGS.model)
else:
raise ValueError("cannot find checkpoint model")
batch_size = train_data.batch_size
dst = tfd.MultivariateNormalDiag(
loc=tf.zeros([batch_size, config.latent_dims]),
scale_diag=tf.ones([batch_size, config.latent_dims]))
dcdr_states, latent_z = connector_stoch(dst)
# to concatenate latent variable to input word embeddings
def _cat_embedder(ids):
embedding = decoder_embedder(ids)
return tf.concat([embedding, latent_z], axis=1)
vocab = train_data.vocab
start_tokens = tf.ones(batch_size, tf.int32) * vocab.bos_token_id;
end_token = vocab.eos_token_id;
if config.decoder_hparams["type"] == "lstm":
outputs, _, _ = decoder(
initial_state=dcdr_states,
decoding_strategy="infer_sample",
embedding=_cat_embedder,
max_decoding_length=100,
start_tokens=start_tokens,
end_token=end_token)
else:
outputs, _ = decoder(
memory=dcdr_states,
decoding_strategy="infer_sample",
memory_sequence_length=tf.ones(tf.shape(dcdr_states)[0]),
max_decoding_length=100,
start_tokens=start_tokens,
end_token=end_token)
sample_tokens = vocab.map_ids_to_tokens(outputs.sample_id)
sess.run(tf.tables_initializer())
mode_key = tf.estimator.ModeKeys.EVAL
feed = {tx.global_mode():mode_key}
sample_tokens_ = sess.run(sample_tokens, feed_dict=feed)
if fname is None:
fh = sys.stdout
else:
fh = open(fname, 'w', encoding='utf-8')
for sent in sample_tokens_:
sent = list(sent)
end_id = sent.index(vocab.eos_token)
fh.write(' '.join(sent[:end_id+1]) + '\n')
fh.close()
saver = tf.train.Saver()
with tf.Session() as sess:
# generate samples from prior
if FLAGS.mode == "predict":
generate(sess, saver, FLAGS.out)
return
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
# Counts trainable parameters
total_parameters = 0
for variable in tf.trainable_variables():
shape = variable.get_shape() # shape is an array of tf.Dimension
variable_parameters = 1
for dim in shape:
variable_parameters *= dim.value
total_parameters += variable_parameters
print("%d total parameters" % total_parameters)
best_nll = best_ppl = 0.
for epoch in range(config.num_epochs):
_, _ = _run_epoch(sess, epoch, 'train', display=200)
val_nll, _ = _run_epoch(sess, epoch, 'valid')
test_nll, test_ppl = _run_epoch(sess, epoch, 'test')
if val_nll < opt_vars['best_valid_nll']:
opt_vars['best_valid_nll'] = val_nll
opt_vars['steps_not_improved'] = 0
best_nll = test_nll
best_ppl = test_ppl
saver.save(sess, save_path)
else:
opt_vars['steps_not_improved'] += 1
if opt_vars['steps_not_improved'] == decay_ts:
old_lr = opt_vars['learning_rate']
opt_vars['learning_rate'] *= decay_factor
opt_vars['steps_not_improved'] = 0
new_lr = opt_vars['learning_rate']
print('-----\nchange lr, old lr: %f, new lr: %f\n-----' %
(old_lr, new_lr))
saver.restore(sess, save_path)
decay_cnt += 1
if decay_cnt == max_decay:
break
print('\nbest testing nll: %.4f, best testing ppl %.4f\n' %
(best_nll, best_ppl))
if __name__ == '__main__':
tf.app.run(main=_main)
================================================
FILE: texar_repo/requirements.txt
================================================
tensorflow >= 1.7.0
tensorflow-gpu >= 1.7.0
tensorflow-probability >= 0.3.0
tensorflow-probability-gpu >= 0.3.0
funcsigs >= 1.0.2
================================================
FILE: texar_repo/setup.py
================================================
import setuptools
long_description = '''
Texar is an open-source toolkit based on Tensorflow,
aiming to support a broad set of machine learning especially text generation tasks,
such as machine translation, dialog, summarization, content manipulation, language modeling, and so on.
Texar is designed for both researchers and practitioners for fast prototyping and experimentation.
'''
setuptools.setup(
name="texar",
version="0.1",
url="https://github.com/asyml/texar",
description="Toolkit for Text Generation and Beyond",
long_description=long_description,
license='Apache License Version 2.0',
packages=setuptools.find_packages(),
platforms='any',
install_requires=[
'numpy',
'pyyaml',
'requests',
'funcsigs',
],
extras_require={
'tensorflow-cpu': ['tensorflow>=1.7.0', 'tensorflow-probability >= 0.3.0'],
'tensorflow-gpu': ['tensorflow-gpu>=1.7.0', 'tensorflow-probability-gpu >= 0.3.0']
},
package_data={
"texar": [
"../bin/utils/multi-bleu.perl",
]
},
classifiers=[
'Intended Audience :: Developers',
'Intended Audience :: Education',
'Intended Audience :: Science/Research',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
],
)
================================================
FILE: texar_repo/texar/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.module_base import *
from texar.hyperparams import *
from texar.context import *
from texar import modules
from texar import core
from texar import losses
from texar import models
from texar import data
from texar import evals
from texar import agents
from texar import run
from texar import utils
================================================
FILE: texar_repo/texar/agents/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various RL Agents
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.agents.pg_agent import *
from texar.agents.seq_pg_agent import *
from texar.agents.dqn_agent import *
from texar.agents.ac_agent import *
from texar.agents.agent_utils import *
try:
from texar.agents.agent_gym_utils import *
except ImportError:
pass
================================================
FILE: texar_repo/texar/agents/ac_agent.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Actor-critic agent.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import numpy as np
from texar.agents.episodic_agent_base import EpisodicAgentBase
from texar.utils import utils
# pylint: disable=too-many-instance-attributes, protected-access
# pylint: disable=too-many-arguments
__all__ = [
"ActorCriticAgent"
]
class ActorCriticAgent(EpisodicAgentBase):
"""Actor-critic agent for episodic setting.
An actor-critic algorithm consists of several components:
- **Actor** is the policy to optimize. As a temporary implementation,\
here by default we use a :class:`~texar.agents.PGAgent` instance \
that wraps a `policy net` and provides proper interfaces to perform \
the role of an actor.
- **Critic** that provides learning signals to the actor. Again, as \
a temporary implemetation, here by default we use a \
:class:`~texar.agents.DQNAgent` instance that wraps a `Q net` and \
provides proper interfaces to perform the role of a critic.
Args:
env_config: An instance of :class:`~texar.agents.EnvConfig` specifying
action space, observation space, and reward range, etc. Use
:func:`~texar.agents.get_gym_env_config` to create an EnvConfig
from a gym environment.
sess (optional): A tf session.
Can be `None` here and set later with `agent.sess = session`.
actor (optional): An instance of :class:`~texar.agents.PGAgent` that
performs as actor in the algorithm.
If not provided, an actor is created based on :attr:`hparams`.
actor_kwargs (dict, optional): Keyword arguments for actor
constructor. Note that the `hparams` argument for actor
constructor is specified in the "actor_hparams" field of
:attr:`hparams` and should not be included in `actor_kwargs`.
Ignored if :attr:`actor` is given.
critic (optional): An instance of :class:`~texar.agents.DQNAgent` that
performs as critic in the algorithm.
If not provided, a critic is created based on :attr:`hparams`.
critic_kwargs (dict, optional): Keyword arguments for critic
constructor. Note that the `hparams` argument for critic
constructor is specified in the "critic_hparams" field of
:attr:`hparams` and should not be included in `critic_kwargs`.
Ignored if :attr:`critic` is given.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerters will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self,
env_config,
sess=None,
actor=None,
actor_kwargs=None,
critic=None,
critic_kwargs=None,
hparams=None):
EpisodicAgentBase.__init__(self, env_config=env_config, hparams=hparams)
self._sess = sess
self._num_actions = self._env_config.action_space.high - \
self._env_config.action_space.low
with tf.variable_scope(self.variable_scope):
if actor is None:
kwargs = utils.get_instance_kwargs(
actor_kwargs, self._hparams.actor_hparams)
kwargs.update(dict(env_config=env_config, sess=sess))
actor = utils.get_instance(
class_or_name=self._hparams.actor_type,
kwargs=kwargs,
module_paths=['texar.agents', 'texar.custom'])
self._actor = actor
if critic is None:
kwargs = utils.get_instance_kwargs(
critic_kwargs, self._hparams.critic_hparams)
kwargs.update(dict(env_config=env_config, sess=sess))
critic = utils.get_instance(
class_or_name=self._hparams.critic_type,
kwargs=kwargs,
module_paths=['texar.agents', 'texar.custom'])
self._critic = critic
if self._actor._discount_factor != self._critic._discount_factor:
raise ValueError('discount_factor of the actor and the critic '
'must be the same.')
self._discount_factor = self._actor._discount_factor
self._observs = []
self._actions = []
self._rewards = []
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
.. role:: python(code)
:language: python
.. code-block:: python
{
'actor_type': 'PGAgent',
'actor_hparams': None,
'critic_type': 'DQNAgent',
'critic_hparams': None,
'name': 'actor_critic_agent'
}
Here:
"actor_type" : str or class or instance
Actor. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module :mod:`texar.agents` or
:mod:`texar.custom`. Ignored if a `actor` is given to
the agent constructor.
"actor_kwargs" : dict, optional
Hyperparameters for the actor class. With the :attr:`actor_kwargs`
argument to the constructor, an actor is created with
:python:`actor_class(**actor_kwargs, hparams=actor_hparams)`.
"critic_type" : str or class or instance
Critic. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module :mod:`texar.agents` or
:mod:`texar.custom`. Ignored if a `critic` is given to
the agent constructor.
"critic_kwargs" : dict, optional
Hyperparameters for the critic class. With the :attr:`critic_kwargs`
argument to the constructor, an critic is created with
:python:`critic_class(**critic_kwargs, hparams=critic_hparams)`.
"name" : str
Name of the agent.
"""
return {
'actor_type': 'PGAgent',
'actor_hparams': None,
'critic_type': 'DQNAgent',
'critic_hparams': None,
'name': 'actor_critic_agent'
}
def _reset(self):
self._actor._reset()
self._critic._reset()
def _observe(self, reward, terminal, train_policy, feed_dict):
self._train_actor(
observ=self._observ,
action=self._action,
feed_dict=feed_dict)
self._critic._observe(reward, terminal, train_policy, feed_dict)
def _train_actor(self, observ, action, feed_dict):
qvalues = self._critic._qvalues_from_target(observ=observ)
advantage = qvalues[0][action] - np.mean(qvalues)
# TODO (bowen): should be a funciton to customize?
feed_dict_ = {
self._actor._observ_inputs: [observ],
self._actor._action_inputs: [action],
self._actor._advantage_inputs: [advantage]
}
feed_dict_.update(feed_dict)
self._actor._train_policy(feed_dict=feed_dict_)
def get_action(self, observ, feed_dict=None):
self._observ = observ
self._action = self._actor.get_action(observ, feed_dict=feed_dict)
self._critic._update_observ_action(self._observ, self._action)
return self._action
@property
def sess(self):
"""The tf session.
"""
return self._sess
@sess.setter
def sess(self, session):
self._sess = session
self._actor._sess = session
self._critic._sess = session
================================================
FILE: texar_repo/texar/agents/agent_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for reinforcement learning agents.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.hyperparams import HParams
from texar.utils.variables import get_unique_named_variable_scope
# pylint: disable=too-many-instance-attributes
__all__ = [
"AgentBase"
]
class AgentBase(object):
"""
Base class inherited by RL agents.
Args:
TODO
"""
def __init__(self, hparams=None):
self._hparams = HParams(hparams, self.default_hparams())
name = self._hparams.name
self._variable_scope = get_unique_named_variable_scope(name)
self._unique_name = self._variable_scope.name.split("/")[-1]
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
TODO
"""
return {
'name': 'agent'
}
@property
def variable_scope(self):
"""The variable scope of the agent.
"""
return self._variable_scope
@property
def name(self):
"""The name of the module (not uniquified).
"""
return self._unique_name
@property
def hparams(self):
"""A :class:`~texar.hyperparams.HParams` instance. The hyperparameters
of the module.
"""
return self._hparams
================================================
FILE: texar_repo/texar/agents/agent_gym_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various agent utilities based on OpenAI Gym.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import gym
__all__ = [
"convert_gym_space",
"get_gym_env_config"
]
def convert_gym_space(spc):
"""Converts a :gym:`gym.Space <#spaces>` instance to a
:class:`~texar.agents.Space` instance.
Args:
spc: An instance of `gym.Space` or
:class:`~texar.agents.Space`.
"""
from texar.agents.agent_utils import Space
if isinstance(spc, Space):
return spc
if isinstance(spc, gym.spaces.Discrete):
return Space(shape=(), low=0, high=spc.n, dtype=spc.dtype)
elif isinstance(spc, gym.spaces.Box):
return Space(
shape=spc.shape, low=spc.low, high=spc.high, dtype=spc.dtype)
def get_gym_env_config(env):
"""Creates an instance of :class:`~texar.agents.EnvConfig`
from a :gym:`gym env <#environments>`.
Args:
env: An instance of OpenAI gym Environment.
Returns:
An instance of :class:`~texar.agents.EnvConfig`.
"""
from texar.agents.agent_utils import EnvConfig
return EnvConfig(
action_space=env.action_space,
observ_space=env.observation_space,
reward_range=env.reward_range)
================================================
FILE: texar_repo/texar/agents/agent_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various agent utilities.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=too-many-arguments, too-few-public-methods, no-member
# pylint: disable=invalid-name, wrong-import-position
import numpy as np
gym_utils = None
try:
from texar.agents import agent_gym_utils as gym_utils
except ImportError:
pass
__all__ = [
"Space",
"EnvConfig"
]
class Space(object):
"""Observation and action spaces. Describes valid actions and observations.
Similar to :gym:`gym.Space <#spaces>`.
Args:
shape (optional): Shape of the space, a tuple. If not
given, infers from :attr:`low` and :attr:`high`.
low (optional): Lower bound (inclusive) of each dimension of the
space. Must have
shape as specified by :attr:`shape`, and of the same shape with
with :attr:`high` (if given). If `None`, set to `-inf` for each
dimension.
high (optional): Upper bound (inclusive) of each dimension of the
space. Must have
shape as specified by :attr:`shape`, and of the same shape with
with :attr:`low` (if given). If `None`, set to `inf` for each
dimension.
dtype (optional): Data type of elements in the space. If not given,
infers from :attr:`low` (if given) or set to `float`.
Example:
.. code-block:: python
s = Space(low=0, high=10, dtype=np.int32)
#s.contains(2) == True
#s.contains(10) == True
#s.contains(11) == False
#s.shape == ()
s2 = Space(shape=(2,2), high=np.ones([2,2]), dtype=np.float)
#s2.low == [[-inf, -inf], [-inf, -inf]]
#s2.high == [[1., 1.], [1., 1.]]
"""
def __init__(self, shape=None, low=None, high=None, dtype=None):
if low is None:
low = -float('inf')
if high is None:
high = float('inf')
if shape is None:
low = np.asarray(low)
high = np.asarray(high)
if low.shape != high.shape:
raise ValueError('`low` and `high` must have the same shape.')
shape = low.shape
else:
shape = tuple(shape)
if np.isscalar(low):
low = low + np.zeros(shape, dtype=dtype)
if np.isscalar(high):
high = high + np.zeros(shape, dtype=dtype)
if shape != low.shape or shape != high.shape:
raise ValueError(
'Shape inconsistent: shape={}, low.shape={}, high.shape={}'
.format(shape, low.shape, high.shape))
if dtype is None:
dtype = low.dtype
dtype = np.dtype(dtype)
low = low.astype(dtype)
high = high.astype(dtype)
self._shape = shape
self._low = low
self._high = high
self._dtype = dtype
def contains(self, x):
"""Checks if x is contained in the space. Returns a `bool`.
"""
x = np.asarray(x)
dtype_match = True
if self._dtype.kind in np.typecodes['AllInteger']:
if x.dtype.kind not in np.typecodes['AllInteger']:
dtype_match = False
shape_match = x.shape == self._shape
low_match = (x >= self._low).all()
high_match = (x <= self._high).all()
return dtype_match and shape_match and low_match and high_match
@property
def shape(self):
"""Shape of the space.
"""
return self._shape
@property
def low(self):
"""Lower bound of the space.
"""
return self._low
@property
def high(self):
"""Upper bound of the space.
"""
return self._high
@property
def dtype(self):
"""Data type of the element.
"""
return self._dtype
class EnvConfig(object):
"""Configurations of an environment.
Args:
action_space: An instance of :class:`~texar.agents.Space` or
:gym:`gym.Space <#spaces>`, the action space.
observ_space: An instance of :class:`~texar.agents.Space` or
:gym:`gym.Space <#spaces>`, the observation space.
reward_range: A tuple corresponding to the min and max possible
rewards, e.g., `reward_range=(-1.0, 1.0)`.
"""
def __init__(self,
action_space,
observ_space,
reward_range):
if gym_utils:
action_space = gym_utils.convert_gym_space(action_space)
observ_space = gym_utils.convert_gym_space(observ_space)
self.action_space = action_space
self.action_dtype = action_space.dtype
self.action_shape = action_space.shape
self.observ_space = observ_space
self.observ_dtype = observ_space.dtype
self.observ_shape = observ_space.shape
self.reward_range = reward_range
================================================
FILE: texar_repo/texar/agents/agent_utils_test.py
================================================
#
"""
Unit tests for agent utilities.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=no-member, invalid-name, too-many-arguments
import numpy as np
import tensorflow as tf
from texar.agents.agent_utils import Space
class SpaceTest(tf.test.TestCase):
"""Tests the Space class.
"""
def _test_space(self, s, shape, low, high, dtype):
self.assertEqual(s.shape, shape)
self.assertEqual(s.low, low)
self.assertEqual(s.high, high)
self.assertEqual(s.dtype, dtype)
def test_space(self):
"""Tests descrete space.
"""
s = Space(shape=(), low=0, high=10, dtype=np.int32)
self._test_space(s, (), 0, 10, np.dtype(np.int32))
self.assertTrue(s.contains(5))
self.assertFalse(s.contains(5.))
self.assertFalse(s.contains(15))
s = Space(low=0, high=10, dtype=np.int32)
self._test_space(s, (), 0, 10, np.dtype(np.int32))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/agents/dqn_agent.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Deep Q learning Agent.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import random
import numpy as np
import tensorflow as tf
import texar as tx
from texar.agents.episodic_agent_base import EpisodicAgentBase
from texar.utils import utils
from texar.core import optimization as opt
# pylint: disable=too-many-instance-attributes, too-many-arguments
# pylint: disable=invalid-name
__all__ = [
"DQNAgent"
]
class DQNAgent(EpisodicAgentBase):
"""Deep Q learning agent for episodic setting.
A Q learning algorithm consists of several components:
- A **Q-net** takes in a state and returns Q-value for action sampling.\
See :class:`~texar.modules.CategoricalQNet` for an example Q-net class\
and required interface.
- A **replay memory** manages past experience for Q-net updates. See\
:class:`~texar.core.DequeReplayMemory` for an example replay memory\
class and required interface.
- An **exploration** that specifies the exploration strategy used\
to train the Q-net. See\
:class:`~texar.core.EpsilonLinearDecayExploration` for an example\
class and required interface.
Args:
env_config: An instance of :class:`~texar.agents.EnvConfig` specifying
action space, observation space, and reward range, etc. Use
:func:`~texar.agents.get_gym_env_config` to create an EnvConfig
from a gym environment.
sess (optional): A tf session.
Can be `None` here and set later with `agent.sess = session`.
qnet (optional): A Q network that predicts Q values given states.
If not given, a Q network is created based on :attr:`hparams`.
target (optional): A target network to compute target Q values.
qnet_kwargs (dict, optional): Keyword arguments for qnet
constructor. Note that the `hparams` argument for network
constructor is specified in the "policy_hparams" field of
:attr:`hparams` and should not be included in `policy_kwargs`.
Ignored if :attr:`qnet` is given.
qnet_caller_kwargs (dict, optional): Keyword arguments for
calling `qnet` to get Q values. The `qnet` is called with
:python:`outputs=qnet(inputs=observation, **qnet_caller_kwargs)`
replay_memory (optional): A replay memory instance.
If not given, a replay memory is created based on :attr:`hparams`.
replay_memory_kwargs (dict, optional): Keyword arguments for
replay_memory constructor.
Ignored if :attr:`replay_memory` is given.
exploration (optional): An exploration instance used in the algorithm.
If not given, an exploration instance is created based on
:attr:`hparams`.
exploration_kwargs (dict, optional): Keyword arguments for exploration
class constructor. Ignored if :attr:`exploration` is given.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerters will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self,
env_config,
sess=None,
qnet=None,
target=None,
qnet_kwargs=None,
qnet_caller_kwargs=None,
replay_memory=None,
replay_memory_kwargs=None,
exploration=None,
exploration_kwargs=None,
hparams=None):
EpisodicAgentBase.__init__(self, env_config, hparams)
self._sess = sess
self._cold_start_steps = self._hparams.cold_start_steps
self._sample_batch_size = self._hparams.sample_batch_size
self._update_period = self._hparams.update_period
self._discount_factor = self._hparams.discount_factor
self._target_update_strategy = self._hparams.target_update_strategy
self._num_actions = self._env_config.action_space.high - \
self._env_config.action_space.low
with tf.variable_scope(self.variable_scope):
if qnet is None:
kwargs = utils.get_instance_kwargs(
qnet_kwargs, self._hparams.qnet_hparams)
qnet = utils.check_or_get_instance(
ins_or_class_or_name=self._hparams.qnet_type,
kwargs=kwargs,
module_paths=['texar.modules', 'texar.custom'])
target = utils.check_or_get_instance(
ins_or_class_or_name=self._hparams.qnet_type,
kwargs=kwargs,
module_paths=['texar.modules', 'texar.custom'])
self._qnet = qnet
self._target = target
self._qnet_caller_kwargs = qnet_caller_kwargs or {}
if replay_memory is None:
kwargs = utils.get_instance_kwargs(
replay_memory_kwargs, self._hparams.replay_memory_hparams)
replay_memory = utils.check_or_get_instance(
ins_or_class_or_name=self._hparams.replay_memory_type,
kwargs=kwargs,
module_paths=['texar.core', 'texar.custom'])
self._replay_memory = replay_memory
if exploration is None:
kwargs = utils.get_instance_kwargs(
exploration_kwargs, self._hparams.exploration_hparams)
exploration = utils.check_or_get_instance(
ins_or_class_or_name=self._hparams.exploration_type,
kwargs=kwargs,
module_paths=['texar.core', 'texar.custom'])
self._exploration = exploration
self._build_graph()
self._observ = None
self._action = None
self._timestep = 0
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
.. role:: python(code)
:language: python
.. code-block:: python
{
'qnet_type': 'CategoricalQNet',
'qnet_hparams': None,
'replay_memory_type': 'DequeReplayMemory',
'replay_memory_hparams': None,
'exploration_type': 'EpsilonLinearDecayExploration',
'exploration_hparams': None,
'optimization': opt.default_optimization_hparams(),
'target_update_strategy': 'copy',
'cold_start_steps': 100,
'sample_batch_size': 32,
'update_period': 100,
'discount_factor': 0.95,
'name': 'dqn_agent'
}
Here:
"qnet_type" : str or class or instance
Q-value net. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module :mod:`texar.modules` or
:mod:`texar.custom`. Ignored if a `qnet` is given to
the agent constructor.
"qnet_hparams" : dict, optional
Hyperparameters for the Q net. With the :attr:`qnet_kwargs`
argument to the constructor, a network is created with
:python:`qnet_class(**qnet_kwargs, hparams=qnet_hparams)`.
"replay_memory_type" : str or class or instance
Replay memory class. Can be class, its name or module path,
or a class instance.
If class name is given, the class must be from module
:mod:`texar.core` or :mod:`texar.custom`.
Ignored if a `replay_memory` is given to the agent constructor.
"replay_memory_hparams" : dict, optional
Hyperparameters for the replay memory. With the
:attr:`replay_memory_kwargs` argument to the constructor,
a network is created with
:python:`replay_memory_class(
**replay_memory_kwargs, hparams=replay_memory_hparams)`.
"exploration_type" : str or class or instance
Exploration class. Can be class,
its name or module path, or a class instance. If class name is
given, the class must be from module :mod:`texar.core` or
:mod:`texar.custom`. Ignored if a `exploration` is given to
the agent constructor.
"exploration_hparams" : dict, optional
Hyperparameters for the exploration class.
With the :attr:`exploration_kwargs` argument to the constructor,
a network is created with :python:`exploration_class(
**exploration_kwargs, hparams=exploration_hparams)`.
"optimization" : dict
Hyperparameters of optimization for updating the Q-net.
See :func:`~texar.core.default_optimization_hparams` for details.
"cold_start_steps": int
In the beginning, Q-net is not trained in the first few steps.
"sample_batch_size": int
The number of samples taken in replay memory when training.
"target_update_strategy": string
- If **"copy"**, the target network is assigned with the parameter \
of Q-net every :attr:`"update_period"` steps.
- If **"tau"**, target will be updated by assigning as
``` (1 - 1/update_period) * target + 1/update_period * qnet ```
"update_period": int
Frequecy of updating the target network, i.e., updating
the target once for every "update_period" steps.
"discount_factor" : float
The discount factor of reward.
"name" : str
Name of the agent.
"""
return {
'qnet_type': 'CategoricalQNet',
'qnet_hparams': None,
'replay_memory_type': 'DequeReplayMemory',
'replay_memory_hparams': None,
'exploration_type': 'EpsilonLinearDecayExploration',
'exploration_hparams': None,
'optimization': opt.default_optimization_hparams(),
'target_update_strategy': 'copy',
'cold_start_steps': 100,
'sample_batch_size': 32,
'update_period': 100,
'discount_factor': 0.95,
'name': 'dqn_agent'
}
def _build_graph(self):
with tf.variable_scope(self.variable_scope):
self._observ_inputs = tf.placeholder(
dtype=self._env_config.observ_dtype,
shape=[None, ] + list(self._env_config.observ_shape),
name='observ_inputs')
self._action_inputs = tf.placeholder(
dtype=self._env_config.action_dtype,
shape=[None, self._num_actions],
name='action_inputs')
self._y_inputs = tf.placeholder(
dtype=tf.float32,
shape=[None, ],
name='y_inputs')
self._qnet_outputs = self._get_qnet_outputs(self._observ_inputs)
self._target_outputs = self._get_target_outputs(self._observ_inputs)
self._td_error = self._get_td_error(
qnet_qvalues=self._qnet_outputs['qvalues'],
actions=self._action_inputs,
y=self._y_inputs)
self._train_op = self._get_train_op()
if self._target_update_strategy == 'copy':
self._update_op = self._get_copy_update_op()
elif self._target_update_strategy == 'tau':
self._update_op = self._get_tau_update_op()
def _get_qnet_outputs(self, state_inputs):
return self._qnet(inputs=state_inputs, **self._qnet_caller_kwargs)
def _get_target_outputs(self, state_inputs):
return self._target(inputs=state_inputs, **self._qnet_caller_kwargs)
def _get_td_error(self, qnet_qvalues, actions, y):
return y - tf.reduce_sum(qnet_qvalues * tf.to_float(actions), axis=1)
def _get_train_op(self):
train_op = opt.get_train_op(
loss=tf.reduce_sum(self._td_error ** 2),
variables=self._qnet.trainable_variables,
hparams=self._hparams.optimization.todict())
return train_op
def _get_copy_update_op(self):
op = []
for i in range(len(self._qnet.trainable_variables)):
op.append(tf.assign(ref=self._target.trainable_variables[i],
value=self._qnet.trainable_variables[i]))
return op
def _get_tau_update_op(self):
tau = 1. / self._update_period
op = []
for i in range(len(self._qnet.trainable_variables)):
value_ = (1. - tau) * self._target.trainable_variables[i] + \
tau * self._qnet.trainable_variables[i]
op.append(tf.assign(
ref=self._target.trainable_variables[i], value=value_))
return op
def _observe(self, reward, terminal, train_policy, feed_dict):
if self._timestep > self._cold_start_steps and train_policy:
self._train_qnet(feed_dict)
action_one_hot = [0.] * self._num_actions
action_one_hot[self._action] = 1.
self._replay_memory.add(dict(
observ=self._observ,
action=action_one_hot,
reward=reward,
terminal=terminal,
next_observ=None))
self._timestep += 1
def _train_qnet(self, feed_dict):
minibatch = self._replay_memory.get(self._sample_batch_size)
observ_batch = np.array([data['observ'] for data in minibatch])
action_batch = np.array([data['action'] for data in minibatch])
reward_batch = np.array([data['reward'] for data in minibatch])
terminal_batch = np.array([data['terminal'] for data in minibatch])
next_observ_batch = \
np.array([data['next_observ'] for data in minibatch])
target_qvalue = self._sess.run(
self._target_outputs['qvalues'], feed_dict={
self._observ_inputs: next_observ_batch,
tx.global_mode(): tf.estimator.ModeKeys.PREDICT})
y_batch = reward_batch
for i in range(self._sample_batch_size):
if not terminal_batch[i]:
y_batch[i] += self._discount_factor * np.max(target_qvalue[i])
feed_dict_ = {
self._observ_inputs: observ_batch,
self._y_inputs: y_batch,
self._action_inputs: action_batch
}
feed_dict_.update(feed_dict or {})
self._sess.run(self._train_op, feed_dict=feed_dict_)
self._update_target(feed_dict)
def _update_target(self, feed_dict):
if self._target_update_strategy == 'tau' or (
self._target_update_strategy == 'copy' and
self._timestep % self._update_period == 0):
self._sess.run(self._update_op, feed_dict=feed_dict)
def _qvalues_from_qnet(self, observ):
return self._sess.run(
self._qnet_outputs['qvalues'],
feed_dict={self._observ_inputs: np.array([observ]),
tx.global_mode(): tf.estimator.ModeKeys.PREDICT})
def _qvalues_from_target(self, observ):
return self._sess.run(
self._target_outputs['qvalues'],
feed_dict={self._observ_inputs: np.array([observ]),
tx.global_mode(): tf.estimator.ModeKeys.PREDICT})
def _update_observ_action(self, observ, action):
self._observ = observ
self._action = action
if self._replay_memory.size() > 0:
self._replay_memory.last()['next_observ'] = self._observ
def _get_action(self, observ, feed_dict=None):
qvalue = self._qvalues_from_qnet(observ)
if random.random() < self._exploration.get_epsilon(self._timestep):
action = random.randrange(self._num_actions)
else:
action = np.argmax(qvalue)
self._update_observ_action(observ, action)
return action
def _reset(self):
self._observ = None
self._action = None
@property
def sess(self):
"""The tf session.
"""
return self._sess
@sess.setter
def sess(self, session):
self._sess = session
================================================
FILE: texar_repo/texar/agents/episodic_agent_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for episodic reinforcement learning agents.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.agents.agent_base import AgentBase
# pylint: disable=too-many-instance-attributes
class EpisodicAgentBase(AgentBase):
"""Base class inherited by episodic RL agents.
An agent is a wrapper of the **training process** that trains a model
with RL algorithms. Agent itself does not create new trainable variables.
An episodic RL agent typically provides 3 interfaces, namely, :meth:`reset`,
:meth:`get_action` and :meth:`observe`, and is used as the following
example.
Example:
.. code-block:: python
env = SomeEnvironment(...)
agent = PGAgent(...)
while True:
# Starts one episode
agent.reset()
observ = env.reset()
while True:
action = agent.get_action(observ)
next_observ, reward, terminal = env.step(action)
agent.observe(reward, terminal)
observ = next_observ
if terminal:
break
Args:
env_config: An instance of :class:`~texar.agents.EnvConfig` specifying
action space, observation space, and reward range, etc. Use
:func:`~texar.agents.get_gym_env_config` to create an EnvConfig
from a gym environment.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self, env_config, hparams=None):
AgentBase.__init__(self, hparams)
self._env_config = env_config
self._reset_tmplt_fn = tf.make_template(
"{}_reset".format(self.name), self._reset)
self._observe_tmplt_fn = tf.make_template(
"{}_observe".format(self.name), self._observe)
self._get_action_tmplt_fn = tf.make_template(
"{}_get_action".format(self.name), self._get_action)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"name": "agent"
}
"""
return {
'name': 'agent'
}
def reset(self):
"""Resets the states to begin new episode.
"""
self._reset_tmplt_fn()
def _reset(self):
raise NotImplementedError
def observe(self, reward, terminal, train_policy=True, feed_dict=None):
"""Observes experience from environment.
Args:
reward: Reward of the action. The configuration (e.g., shape) of
the reward is defined in :attr:`env_config`.
terminal (bool): Whether the episode is terminated.
train_policy (bool): Wether to update the policy for this step.
feed_dict (dict, optional): Any stuffs fed to running the training
operator.
"""
return self._observe_tmplt_fn(reward, terminal, train_policy, feed_dict)
def _observe(self, reward, terminal, train_policy, feed_dict):
raise NotImplementedError
def get_action(self, observ, feed_dict=None):
"""Gets action according to observation.
Args:
observ: Observation from the environment.
Returns:
action from the policy.
"""
return self._get_action_tmplt_fn(observ, feed_dict)
def _get_action(self, observ, feed_dict):
raise NotImplementedError
@property
def env_config(self):
"""Environment configuration.
"""
return self._env_config
================================================
FILE: texar_repo/texar/agents/pg_agent.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Policy Gradient agent.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=too-many-instance-attributes, too-many-arguments
import tensorflow as tf
from texar.agents.episodic_agent_base import EpisodicAgentBase
from texar.utils import utils
from texar.core import optimization as opt
from texar.losses import pg_losses as losses
from texar.losses.rewards import discount_reward
class PGAgent(EpisodicAgentBase):
"""Policy gradient agent for episodic setting. This agent here supports
**un-batched** training, i.e., each time generates one action, takes one
observation, and updates the policy.
The policy must take in an observation of shape `[1] + observation_shape`,
where the first dimension 1 stands for batch dimension, and output a `dict`
containing:
- Key **"action"** whose value is a Tensor of shape \
`[1] + action_shape` containing a single action.
- One of keys "log_prob" or "dist":
- **"log_prob"**: A Tensor of shape `[1]`, the log probability of the \
"action".
- **"dist"**: A \
tf_main:`tf.distributions.Distribution `\
with the `log_prob` interface and \
`log_prob = dist.log_prob(outputs["action"])`.
.. role:: python(code)
:language: python
Args:
env_config: An instance of :class:`~texar.agents.EnvConfig` specifying
action space, observation space, and reward range, etc. Use
:func:`~texar.agents.get_gym_env_config` to create an EnvConfig
from a gym environment.
sess (optional): A tf session.
Can be `None` here and set later with `agent.sess = session`.
policy (optional): A policy net that takes in observation and outputs
actions and probabilities.
If not given, a policy network is created based on :attr:`hparams`.
policy_kwargs (dict, optional): Keyword arguments for policy
constructor. Note that the `hparams` argument for network
constructor is specified in the "policy_hparams" field of
:attr:`hparams` and should not be included in `policy_kwargs`.
Ignored if :attr:`policy` is given.
policy_caller_kwargs (dict, optional): Keyword arguments for
calling the policy to get actions. The policy is called with
:python:`outputs=policy(inputs=observation, **policy_caller_kwargs)`
learning_rate (optional): Learning rate for policy optimization. If
not given, determine the learning rate from :attr:`hparams`.
See :func:`~texar.core.get_train_op` for more details.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self,
env_config,
sess=None,
policy=None,
policy_kwargs=None,
policy_caller_kwargs=None,
learning_rate=None,
hparams=None):
EpisodicAgentBase.__init__(self, env_config, hparams)
self._sess = sess
self._lr = learning_rate
self._discount_factor = self._hparams.discount_factor
with tf.variable_scope(self.variable_scope):
if policy is None:
kwargs = utils.get_instance_kwargs(
policy_kwargs, self._hparams.policy_hparams)
policy = utils.check_or_get_instance(
self._hparams.policy_type,
kwargs,
module_paths=['texar.modules', 'texar.custom'])
self._policy = policy
self._policy_caller_kwargs = policy_caller_kwargs or {}
self._observs = []
self._actions = []
self._rewards = []
self._train_outputs = None
self._build_graph()
def _build_graph(self):
with tf.variable_scope(self.variable_scope):
self._observ_inputs = tf.placeholder(
dtype=self._env_config.observ_dtype,
shape=[None, ] + list(self._env_config.observ_shape),
name='observ_inputs')
self._action_inputs = tf.placeholder(
dtype=self._env_config.action_dtype,
shape=[None, ] + list(self._env_config.action_shape),
name='action_inputs')
self._advantage_inputs = tf.placeholder(
dtype=tf.float32,
shape=[None, ],
name='advantages_inputs')
self._outputs = self._get_policy_outputs()
self._pg_loss = self._get_pg_loss()
self._train_op = self._get_train_op()
def _get_policy_outputs(self):
outputs = self._policy(
inputs=self._observ_inputs, **self._policy_caller_kwargs)
return outputs
def _get_pg_loss(self):
if 'log_prob' in self._outputs:
log_probs = self._outputs['log_prob']
elif 'dist' in self._outputs:
log_probs = self._outputs['dist'].log_prob(self._action_inputs)
else:
raise ValueError('Outputs of the policy must have one of '
'"log_prob" or "dist".')
pg_loss = losses.pg_loss_with_log_probs(
log_probs=log_probs,
advantages=self._advantage_inputs,
average_across_timesteps=True,
sum_over_timesteps=False)
return pg_loss
def _get_train_op(self):
train_op = opt.get_train_op(
loss=self._pg_loss,
variables=self._policy.trainable_variables,
learning_rate=self._lr,
hparams=self._hparams.optimization.todict())
return train_op
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
.. role:: python(code)
:language: python
.. code-block:: python
{
'policy_type': 'CategoricalPolicyNet',
'policy_hparams': None,
'discount_factor': 0.95,
'normalize_reward': False,
'optimization': default_optimization_hparams(),
'name': 'pg_agent',
}
Here:
"policy_type" : str or class or instance
Policy net. Can be class, its name or module path, or a class
instance. If class name is given, the class must be from module
:mod:`texar.modules` or :mod:`texar.custom`. Ignored if a
`policy` is given to the agent constructor.
"policy_hparams" : dict, optional
Hyperparameters for the policy net. With the :attr:`policy_kwargs`
argument to the constructor, a network is created with
:python:`policy_class(**policy_kwargs, hparams=policy_hparams)`.
"discount_factor" : float
The discount factor of reward.
"normalize_reward" : bool
Whether to normalize the discounted reward, by
`(discounted_reward - mean) / std`.
"optimization" : dict
Hyperparameters of optimization for updating the policy net.
See :func:`~texar.core.default_optimization_hparams` for details.
"name" : str
Name of the agent.
"""
return {
'policy_type': 'CategoricalPolicyNet',
'policy_hparams': None,
'discount_factor': 0.95,
'normalize_reward': False,
'optimization': opt.default_optimization_hparams(),
'name': 'pg_agent',
}
def _reset(self):
self._observs = []
self._actions = []
self._rewards = []
def _get_action(self, observ, feed_dict):
fetches = {
"action": self._outputs['action']
}
feed_dict_ = {self._observ_inputs: [observ, ]}
feed_dict_.update(feed_dict or {})
vals = self._sess.run(fetches, feed_dict=feed_dict_)
action = vals['action']
action = action[0] # Removes the batch dimension
self._observs.append(observ)
self._actions.append(action)
return action
def _observe(self, reward, terminal, train_policy, feed_dict):
self._rewards.append(reward)
if terminal and train_policy:
self._train_policy(feed_dict=feed_dict)
def _train_policy(self, feed_dict=None):
"""Updates the policy.
Args:
TODO
"""
qvalues = discount_reward(
[self._rewards], discount=self._hparams.discount_factor,
normalize=self._hparams.normalize_reward)
qvalues = qvalues[0, :]
fetches = dict(loss=self._train_op)
feed_dict_ = {
self._observ_inputs: self._observs,
self._action_inputs: self._actions,
self._advantage_inputs: qvalues}
feed_dict_.update(feed_dict or {})
self._train_outputs = self._sess.run(fetches, feed_dict=feed_dict_)
@property
def sess(self):
"""The tf session.
"""
return self._sess
@sess.setter
def sess(self, session):
self._sess = session
@property
def policy(self):
"""The policy model.
"""
return self._policy
================================================
FILE: texar_repo/texar/agents/seq_agent_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for reinforcement learning agents for sequence prediction.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.agents.agent_base import AgentBase
# pylint: disable=too-many-instance-attributes
class SeqAgentBase(AgentBase):
"""
Base class inherited by sequence prediction RL agents.
Args:
TODO
"""
def __init__(self, hparams=None):
AgentBase.__init__(self, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
TODO
"""
return {
'name': 'agent'
}
================================================
FILE: texar_repo/texar/agents/seq_pg_agent.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Policy Gradient agent for sequence prediction.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=too-many-instance-attributes, too-many-arguments, no-member
import tensorflow as tf
from texar.agents.seq_agent_base import SeqAgentBase
from texar.core import optimization as opt
from texar.losses.pg_losses import pg_loss_with_logits
from texar.losses.rewards import discount_reward
from texar.losses.entropy import sequence_entropy_with_logits
__all__ = [
"SeqPGAgent"
]
class SeqPGAgent(SeqAgentBase):
"""Policy Gradient agent for sequence prediction.
This is a wrapper of the **training process** that trains a model
with policy gradient. Agent itself does not create new trainable variables.
Args:
samples: An `int` Tensor of shape `[batch_size, max_time]` containing
sampled sequences from the model.
logits: A float Tenosr of shape `[batch_size, max_time, vocab_size]`
containing the logits of samples from the model.
sequence_length: A Tensor of shape `[batch_size]`.
Time steps beyond the respective sequence lengths are masked out.
trainable_variables (optional): Trainable variables of the model to
update during training. If `None`, all trainable variables in the
graph are used.
learning_rate (optional): Learning rate for policy optimization. If
not given, determine the learning rate from :attr:`hparams`.
See :func:`~texar.core.get_train_op` for more details.
sess (optional): A tf session.
Can be `None` here and set later with `agent.sess = session`.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
Example:
.. code-block:: python
## Train a decoder with policy gradient
decoder = BasicRNNDecoder(...)
outputs, _, sequence_length = decoder(
decoding_strategy='infer_sample', ...)
sess = tf.Session()
agent = SeqPGAgent(
samples=outputs.sample_id,
logits=outputs.logits,
sequence_length=sequence_length,
sess=sess)
while training:
# Generate samples
vals = agent.get_samples()
# Evaluate reward
sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
reward_bleu = []
for y, y_ in zip(ground_truth, sample_text)
reward_bleu.append(tx.evals.sentence_bleu(y, y_)
# Update
agent.observe(reward=reward_bleu)
"""
def __init__(self,
samples,
logits,
sequence_length,
trainable_variables=None,
learning_rate=None,
sess=None,
hparams=None):
SeqAgentBase.__init__(self, hparams)
self._lr = learning_rate
# Tensors
self._samples = samples
self._logits = logits
self._sequence_length = sequence_length
self._trainable_variables = trainable_variables
# Python values
self._samples_py = None
self._sequence_length_py = None
self._rewards = None
self._sess = sess
# For session partial run
self._partial_run_handle = None
self._qvalue_inputs_fed = False
self._build_graph()
def _build_graph(self):
with tf.variable_scope(self.variable_scope):
self._qvalue_inputs = tf.placeholder(
dtype=tf.float32,
shape=[None, None],
name='qvalue_inputs')
self._pg_loss = self._get_pg_loss()
self._train_op = self._get_train_op()
def _get_pg_loss(self):
loss_hparams = self._hparams.loss
pg_loss = pg_loss_with_logits(
actions=self._samples,
logits=self._logits,
sequence_length=self._sequence_length,
advantages=self._qvalue_inputs,
batched=True,
average_across_batch=loss_hparams.average_across_batch,
average_across_timesteps=loss_hparams.average_across_timesteps,
sum_over_batch=loss_hparams.sum_over_batch,
sum_over_timesteps=loss_hparams.sum_over_timesteps,
time_major=loss_hparams.time_major)
if self._hparams.entropy_weight > 0:
entropy = self._get_entropy()
pg_loss -= self._hparams.entropy_weight * entropy
return pg_loss
def _get_entropy(self):
loss_hparams = self._hparams.loss
return sequence_entropy_with_logits(
self._logits,
sequence_length=self._sequence_length,
average_across_batch=loss_hparams.average_across_batch,
average_across_timesteps=loss_hparams.average_across_timesteps,
sum_over_batch=loss_hparams.sum_over_batch,
sum_over_timesteps=loss_hparams.sum_over_timesteps,
time_major=loss_hparams.time_major)
def _get_train_op(self):
train_op = opt.get_train_op(
loss=self._pg_loss,
variables=self._trainable_variables,
learning_rate=self._lr,
hparams=self._hparams.optimization.todict())
return train_op
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
.. role:: python(code)
:language: python
.. code-block:: python
{
'discount_factor': 0.95,
'normalize_reward': False,
'entropy_weight': 0.,
'loss': {
'average_across_batch': True,
'average_across_timesteps': False,
'sum_over_batch': False,
'sum_over_timesteps': True,
'time_major': False
},
'optimization': default_optimization_hparams(),
'name': 'pg_agent',
}
Here:
"discount_factor" : float
The discount factor of reward.
"normalize_reward" : bool
Whether to normalize the discounted reward, by
`(discounted_reward - mean) / std`. Here `mean` and `std` are
over all time steps and all samples in the batch.
"entropy_weight" : float
The weight of entropy loss of the sample distribution, to encourage
maximizing the Shannon entropy. Set to 0 to disable the loss.
"loss" : dict
Extra keyword arguments for
:func:`~texar.losses.pg_loss_with_logits`, including the
reduce arguments (e.g., `average_across_batch`) and `time_major`
"optimization" : dict
Hyperparameters of optimization for updating the policy net.
See :func:`~texar.core.default_optimization_hparams` for details.
"name" : str
Name of the agent.
"""
return {
'discount_factor': 0.95,
'normalize_reward': False,
'entropy_weight': 0.,
'loss': {
'average_across_batch': True,
'average_across_timesteps': False,
'sum_over_batch': False,
'sum_over_timesteps': True,
'time_major': False
},
'optimization': opt.default_optimization_hparams(),
'name': 'pg_agent',
}
def _get_partial_run_feeds(self, feeds=None):
if feeds is None:
feeds = []
feeds += [self._qvalue_inputs]
return feeds
def _setup_partial_run(self, fetches=None, feeds=None):
fetches_ = [self._samples, self._sequence_length, self._pg_loss,
self._train_op]
if fetches is not None:
for fet in fetches:
if fet not in fetches_:
fetches_.append(fet)
feeds = self._get_partial_run_feeds(feeds)
self._partial_run_handle = self._sess.partial_run_setup(
fetches_, feeds=feeds)
self._qvalue_inputs_fed = False
def _check_extra_fetches(self, extra_fetches):
fetch_values = None
if extra_fetches is not None:
fetch_values = list(extra_fetches.values())
if fetch_values is not None:
if self._samples in fetch_values:
raise ValueError(
"`samples` must not be included in `extra_fetches`. "
"It is added automatically.")
if self._sequence_length in fetch_values:
raise ValueError(
"`sequence_length` must not be included in `extra_fetches`."
" It is added automatically.")
if "samples" in extra_fetches:
raise ValueError(
"Key 'samples' is preserved and must not be used "
"in `extra_fetches`.")
if "sequence_length" in extra_fetches:
raise ValueError(
"Key 'sequence_length' is preserved and must not be used "
"in `extra_fetches`.")
def get_samples(self, extra_fetches=None, feed_dict=None):
"""Returns sequence samples and extra results.
Args:
extra_fetches (dict, optional): Extra tensors to fetch values,
besides `samples` and `sequence_length`. Same as the
`fetches` argument of
:tf_main:`tf.Session.run ` and
tf_main:`partial_run `.
feed_dict (dict, optional): A `dict` that maps tensor to
values. Note that all placeholder values used in
:meth:`get_samples` and subsequent :meth:`observe` calls
should be fed here.
Returns:
A `dict` with keys **"samples"** and **"sequence_length"**
containing the fetched values of :attr:`samples` and
:attr:`sequence_length`, as well as other fetched values
as specified in :attr:`extra_fetches`.
Example:
.. code-block:: python
extra_fetches = {'truth_ids': data_batch['text_ids']}
vals = agent.get_samples()
sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
truth_text = tx.utils.map_ids_to_strs(vals['truth_ids'], vocab)
reward = reward_fn_in_python(truth_text, sample_text)
"""
if self._sess is None:
raise ValueError("`sess` must be specified before sampling.")
self._check_extra_fetches(extra_fetches)
# Sets up partial_run
fetch_values = None
if extra_fetches is not None:
fetch_values = list(extra_fetches.values())
feeds = None
if feed_dict is not None:
feeds = list(feed_dict.keys())
self._setup_partial_run(fetches=fetch_values, feeds=feeds)
# Runs the sampling
fetches = {
"samples": self._samples,
"sequence_length": self._sequence_length
}
if extra_fetches is not None:
fetches.update(extra_fetches)
feed_dict_ = feed_dict
vals = self._sess.partial_run(
self._partial_run_handle, fetches, feed_dict=feed_dict_)
self._samples_py = vals['samples']
self._sequence_length_py = vals['sequence_length']
return vals
def observe(self, reward, train_policy=True, compute_loss=True):
"""Observes the reward, and updates the policy or computes loss
accordingly.
Args:
reward: A Python array/list of shape `[batch_size]` containing
the reward for the samples generated in last call of
:meth:`get_samples`.
train_policy (bool): Whether to update the policy model according
to the reward.
compute_loss (bool): If `train_policy` is False, whether to
compute the policy gradient loss (but does not update the
policy).
Returns:
If `train_policy` or `compute_loss` is True, returns the loss
(a python float scalar). Otherwise returns `None`.
"""
self._rewards = reward
if train_policy:
return self._train_policy()
elif compute_loss:
return self._evaluate_pg_loss()
else:
return None
def _get_qvalues(self):
qvalues = discount_reward(
self._rewards,
self._sequence_length_py,
discount=self._hparams.discount_factor,
normalize=self._hparams.normalize_reward)
return qvalues
def _evaluate_pg_loss(self):
fetches = {
"loss": self._pg_loss
}
feed_dict_ = None
if not self._qvalue_inputs_fed:
qvalues = self._get_qvalues()
feed_dict_ = {self._qvalue_inputs: qvalues}
vals = self._sess.partial_run(
self._partial_run_handle, fetches, feed_dict=feed_dict_)
self._qvalue_inputs_fed = True
return vals['loss']
def _train_policy(self):
"""Updates the policy.
"""
fetches = {
"loss": self._train_op,
}
feed_dict_ = None
if not self._qvalue_inputs_fed:
qvalues = self._get_qvalues()
feed_dict_ = {self._qvalue_inputs: qvalues}
vals = self._sess.partial_run(
self._partial_run_handle, fetches, feed_dict=feed_dict_)
self._qvalue_inputs_fed = True
return vals['loss']
@property
def sess(self):
"""The tf session.
"""
return self._sess
@sess.setter
def sess(self, sess):
self._sess = sess
@property
def pg_loss(self):
"""The scalar tensor of policy gradient loss.
"""
return self._pg_loss
@property
def sequence_length(self):
"""The tensor of sample sequence length, of shape `[batch_size]`.
"""
return self._sequence_length
@property
def samples(self):
"""The tensor of sequence samples.
"""
return self._samples
@property
def logits(self):
"""The tensor of sequence logits.
"""
return self._logits
================================================
FILE: texar_repo/texar/agents/seq_pg_agent_test.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Unit tests for sequence prediction policy gradient agents.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.modules.decoders.rnn_decoders import BasicRNNDecoder
from texar.agents import SeqPGAgent
from texar import context
class SeqPGAgentTest(tf.test.TestCase):
"""Tests :class:`texar.agents.SeqPGAgent`
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._vocab_size = 4
self._max_time = 8
self._batch_size = 16
self._emb_dim = 20
self._inputs = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1., dtype=tf.float32)
self._embedding = tf.random_uniform(
[self._vocab_size, self._emb_dim], maxval=1., dtype=tf.float32)
def test_seq_pg_agent(self):
"""Tests logits.
"""
decoder = BasicRNNDecoder(vocab_size=self._vocab_size)
outputs, _, sequence_length = decoder(
decoding_strategy="infer_greedy",
max_decoding_length=10,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2)
agent = SeqPGAgent(
outputs.sample_id, outputs.logits, sequence_length,
decoder.trainable_variables)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
agent.sess = sess
feed_dict = {context.global_mode(): tf.estimator.ModeKeys.TRAIN}
for _ in range(2):
vals = agent.get_samples(feed_dict=feed_dict)
self.assertEqual(vals['samples'].shape[0], self._batch_size)
loss_1 = agent.observe([1.]*self._batch_size)
loss_2 = agent.observe(
[1.]*self._batch_size, train_policy=False)
self.assertEqual(loss_1.shape, ())
self.assertEqual(loss_2.shape, ())
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/context.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Global context manager that handles train/infer mode, etc
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import tensorflow as tf
__all__ = [
"global_mode",
"global_mode_train",
"global_mode_eval",
"global_mode_predict",
"valid_modes"
]
_GLOBAL_MODE_KEY = "GLOBAL_MODE"
def global_mode():
"""Returns the Tensor of global mode.
This is a placeholder with default value of
:tf_main:`tf.estimator.ModeKeys.TRAIN `.
Example:
.. code-block:: python
mode = session.run(global_mode())
# mode == tf.estimator.ModeKeys.TRAIN
mode = session.run(
global_mode(),
feed_dict={tf.global_mode(): tf.estimator.ModeKeys.PREDICT})
# mode == tf.estimator.ModeKeys.PREDICT
"""
mode = tf.get_collection_ref(_GLOBAL_MODE_KEY)
if len(mode) < 1:
#mode_tensor = tf.placeholder(tf.string, name="global_mode")
mode_tensor = tf.placeholder_with_default(
input=tf.estimator.ModeKeys.TRAIN,
shape=(),
name="global_mode")
#mode_tensor = tf.constant(
# value=tf.estimator.ModeKeys.TRAIN,
# dtype=tf.string,
# name="global_mode")
mode.append(mode_tensor)
return mode[0]
def global_mode_train():
"""Returns a bool Tensor indicating whether the global mode is TRAIN.
Example:
.. code-block:: python
is_train = session.run(global_mode_train())
# is_train == True
is_train = session.run(
global_mode_train()
feed_dict={tf.global_mode(): tf.estimator.ModeKeys.PREDICT})
# is_train == False
"""
mode = global_mode()
return tf.equal(mode, tf.estimator.ModeKeys.TRAIN)
def global_mode_eval():
"""Returns a bool Tensor indicating whether the global mode is EVAL.
"""
mode = global_mode()
return tf.equal(mode, tf.estimator.ModeKeys.EVAL)
def global_mode_predict():
"""Returns a bool Tensor indicating whether the global mode is PREDICT.
"""
mode = global_mode()
return tf.equal(mode, tf.estimator.ModeKeys.PREDICT)
def valid_modes():
"""Returns a set of possible values of mode.
"""
return {tf.estimator.ModeKeys.TRAIN,
tf.estimator.ModeKeys.EVAL,
tf.estimator.ModeKeys.PREDICT}
================================================
FILE: texar_repo/texar/context_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for various context functionalities.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar import context
# pylint: disable=protected-access
class ContextTest(tf.test.TestCase):
"""Tests context.
"""
def test_global_mode(self):
"""Tests the mode context manager.
"""
global_mode = context.global_mode()
self.assertIsInstance(global_mode, tf.Tensor)
mode_train = context.global_mode_train()
mode_eval = context.global_mode_eval()
mode_predict = context.global_mode_predict()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
global_mode_ = sess.run(global_mode)
self.assertEqual(tf.compat.as_str(global_mode_),
tf.estimator.ModeKeys.TRAIN)
global_mode_, mode_train_, mode_eval_, mode_predict_ = sess.run(
[global_mode, mode_train, mode_eval, mode_predict],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.TRAIN})
self.assertEqual(global_mode_, tf.estimator.ModeKeys.TRAIN)
self.assertTrue(mode_train_)
self.assertFalse(mode_eval_)
self.assertFalse(mode_predict_)
global_mode_, mode_train_, mode_eval_, mode_predict_ = sess.run(
[global_mode, mode_train, mode_eval, mode_predict],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.EVAL})
self.assertEqual(global_mode_, tf.estimator.ModeKeys.EVAL)
self.assertFalse(mode_train_)
self.assertTrue(mode_eval_)
self.assertFalse(mode_predict_)
global_mode_, mode_train_, mode_eval_, mode_predict_ = sess.run(
[global_mode, mode_train, mode_eval, mode_predict],
feed_dict={context.global_mode():
tf.estimator.ModeKeys.PREDICT})
self.assertEqual(global_mode_, tf.estimator.ModeKeys.PREDICT)
self.assertFalse(mode_train_)
self.assertFalse(mode_eval_)
self.assertTrue(mode_predict_)
global_mode_values = tf.get_collection_ref(context._GLOBAL_MODE_KEY)
self.assertEqual(len(global_mode_values), 1)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/core/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar core.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.core.layers import *
from texar.core.replay_memories import *
from texar.core.explorations import *
from texar.core.optimization import *
================================================
FILE: texar_repo/texar/core/explorations.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Classes and utilities for exploration in RL.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.hyperparams import HParams
# pylint: disable=invalid-name
__all__ = [
"ExplorationBase",
"EpsilonLinearDecayExploration"
]
class ExplorationBase(object):
"""Base class inherited by all exploration classes.
Args:
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparameters are set to default values. See
:meth:`default_hparams` for the defaults.
"""
def __init__(self, hparams=None):
self._hparams = HParams(hparams, self.default_hparams())
@staticmethod
def default_hparams():
"""Returns a `dict` of hyperparameters and their default values.
.. code-block:: python
{
'name': 'exploration_base'
}
"""
return {
'name': 'exploration_base'
}
def get_epsilon(self, timestep):
"""Returns the epsilon value.
Args:
timestep (int): The time step.
Returns:
float: the epsilon value.
"""
raise NotImplementedError
@property
def hparams(self):
"""The hyperparameter.
"""
return self._hparams
class EpsilonLinearDecayExploration(ExplorationBase):
"""Decays epsilon linearly.
Args:
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparameters are set to default values. See
:meth:`default_hparams` for the defaults.
"""
def __init__(self, hparams=None):
ExplorationBase.__init__(self, hparams=hparams)
@staticmethod
def default_hparams():
"""Returns a `dict` of hyperparameters and their default values.
.. code-block:: python
{
'initial_epsilon': 0.1,
'final_epsilon': 0.0,
'decay_timesteps': 20000,
'start_timestep': 0,
'name': 'epsilon_linear_decay_exploration',
}
This specifies the decay process that starts at
"start_timestep" with the value "initial_epsilon", and decays for
steps "decay_timesteps" to reach the final epsilon value
"final_epsilon".
"""
return {
'name': 'epsilon_linear_decay_exploration',
'initial_epsilon': 0.1,
'final_epsilon': 0.0,
'decay_timesteps': 20000,
'start_timestep': 0
}
def get_epsilon(self, timestep):
nsteps = self._hparams.decay_timesteps
st = self._hparams.start_timestep
et = st + nsteps
if timestep <= st:
return self._hparams.initial_epsilon
if timestep > et:
return self._hparams.final_epsilon
r = (timestep - st) * 1.0 / nsteps
epsilon = (1 - r) * self._hparams.initial_epsilon + \
r * self._hparams.final_epsilon
return epsilon
================================================
FILE: texar_repo/texar/core/layers.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various neural network layers
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import copy
import tensorflow as tf
import tensorflow.contrib.rnn as rnn
from texar.hyperparams import HParams
from texar.utils import utils
from texar.utils.dtypes import is_str
from texar.utils.variables import add_variable
from texar.utils.mode import is_train_mode, switch_dropout
# pylint: disable=redefined-variable-type, invalid-name
# pylint: disable=too-many-branches, too-many-arguments, too-many-lines
# pylint: disable=protected-access
__all__ = [
"default_rnn_cell_hparams",
"get_rnn_cell",
"get_rnn_cell_trainable_variables",
"default_regularizer_hparams",
"get_regularizer",
"get_initializer",
"get_activation_fn",
"get_constraint_fn",
"get_layer",
"_ReducePooling1D",
"MaxReducePooling1D",
"AverageReducePooling1D",
"get_pooling_layer_hparams",
"MergeLayer",
"SequentialLayer",
"default_conv1d_kwargs",
"default_conv2d_kwargs",
"default_conv3d_kwargs",
"default_conv2d_transpose_kwargs",
"default_conv3d_transpose_kwargs",
"default_dense_kwargs",
"default_dropout_kwargs",
"default_flatten_kwargs",
"default_max_pooling1d_kwargs",
"default_max_pooling2d_kwargs",
"default_max_pooling3d_kwargs",
"default_separable_conv2d_kwargs",
"default_batch_normalization_kwargs",
"default_average_pooling1d_kwargs",
"default_average_pooling2d_kwargs",
"default_average_pooling3d_kwargs",
"layer_normalize",
]
def default_rnn_cell_hparams():
"""Returns a `dict` of RNN cell hyperparameters and their default values.
.. role:: python(code)
:language: python
.. code-block:: python
{
"type": "LSTMCell",
"kwargs": {
"num_units": 256
},
"num_layers": 1,
"dropout": {
"input_keep_prob": 1.0,
"output_keep_prob": 1.0,
"state_keep_prob": 1.0,
"variational_recurrent": False,
"input_size": []
},
"residual": False,
"highway": False,
}
Here:
"type" : str or cell class or cell instance
The RNN cell type. This can be
- The string name or full module path of a cell class. If class \
name is provided, the class must be in module \
:tf_main:`tf.nn.rnn_cell `, \
:tf_main:`tf.contrib.rnn `, or :mod:`texar.custom`.
- A cell class.
- An instance of a cell class. This is not valid if \
"num_layers" > 1.
For example
.. code-block:: python
"type": "LSTMCell" # class name
"type": "tensorflow.contrib.rnn.Conv1DLSTMCell" # module path
"type": "my_module.MyCell" # module path
"type": tf.nn.rnn_cell.GRUCell # class
"type": BasicRNNCell(num_units=100) # cell instance
"type": MyCell(...) # cell instance
"kwargs" : dict
Keyword arguments for the constructor of the cell class.
A cell is created by :python:`cell_class(**kwargs)`, where
`cell_class` is specified in "type" above.
Ignored if "type" is a cell instance.
"num_layers" : int
Number of cell layers. Each layer is a cell created as above, with
the same hyperparameters specified in "kwargs".
"dropout" : dict
Dropout applied to the cell in **each** layer. See
:tf_main:`DropoutWrapper ` for details of
the hyperparameters. If all "\*_keep_prob" = 1, no dropout is applied.
Specifically, if "variational_recurrent" = `True`,
the same dropout mask is applied across all time steps per run call.
If `True`, "input_size" is required, which is a list of input
size of each cell layer. The input size of a cell layer is the last
dimension size of its input tensor. For example, the
input size of the first layer is usually the dimension of
word embeddings, while the input size of subsequent layers
are usually the `num_units` of the preceding-layer cell. E.g.,
.. code-block:: python
# Assume embedding_dim = 100
"type": "LSTMCell",
"kwargs": { "num_units": 123 },
"num_layers": 3,
"dropout": {
"output_keep_prob": 0.5,
"variational_recurrent": True,
"input_size": [100, 123, 123]
}
"residual" : bool
If `True`, apply residual connection on the inputs and
outputs of cell in **each** layer except the first layer. Ignored
if "num_layers" = 1.
"highway" : bool
If True, apply highway connection on the inputs and
outputs of cell in each layer except the first layer. Ignored if
"num_layers" = 1.
"""
return {
"type": "LSTMCell",
"kwargs": {
"num_units": 256,
},
"num_layers": 1,
"dropout": {
"input_keep_prob": 1.0,
"output_keep_prob": 1.0,
"state_keep_prob": 1.0,
"variational_recurrent": False,
"input_size": [],
"@no_typecheck": [
"input_keep_prob", "output_keep_prob", "state_keep_prob"
]
},
"residual": False,
"highway": False,
"@no_typecheck": ["type"]
}
def get_rnn_cell(hparams=None, mode=None):
"""Creates an RNN cell.
See :func:`~texar.core.default_rnn_cell_hparams` for all
hyperparameters and default values.
Args:
hparams (dict or HParams, optional): Cell hyperparameters. Missing
hyperparameters are set to default values.
mode (optional): A Tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. If `None`, dropout will be
controlled by :func:`texar.global_mode`.
Returns:
A cell instance.
Raises:
ValueError: If hparams["num_layers"]>1 and hparams["type"] is a class
instance.
ValueError: The cell is not an
:tf_main:`RNNCell ` instance.
"""
if hparams is None or isinstance(hparams, dict):
hparams = HParams(hparams, default_rnn_cell_hparams())
d_hp = hparams["dropout"]
if d_hp["variational_recurrent"] and \
len(d_hp["input_size"]) != hparams["num_layers"]:
raise ValueError(
"If variational_recurrent=True, input_size must be a list of "
"num_layers(%d) integers. Got len(input_size)=%d." %
(hparams["num_layers"], len(d_hp["input_size"])))
cells = []
cell_kwargs = hparams["kwargs"].todict()
num_layers = hparams["num_layers"]
for layer_i in range(num_layers):
# Create the basic cell
cell_type = hparams["type"]
if not is_str(cell_type) and not isinstance(cell_type, type):
if num_layers > 1:
raise ValueError(
"If 'num_layers'>1, then 'type' must be a cell class or "
"its name/module path, rather than a cell instance.")
cell_modules = ['tensorflow.nn.rnn_cell', 'tensorflow.contrib.rnn',
'texar.custom']
cell = utils.check_or_get_instance(
cell_type, cell_kwargs, cell_modules, rnn.RNNCell)
# Optionally add dropout
if d_hp["input_keep_prob"] < 1.0 or \
d_hp["output_keep_prob"] < 1.0 or \
d_hp["state_keep_prob"] < 1.0:
vr_kwargs = {}
if d_hp["variational_recurrent"]:
vr_kwargs = {
"variational_recurrent": True,
"input_size": d_hp["input_size"][layer_i],
"dtype": tf.float32
}
input_keep_prob = switch_dropout(d_hp["input_keep_prob"],
mode)
output_keep_prob = switch_dropout(d_hp["output_keep_prob"],
mode)
state_keep_prob = switch_dropout(d_hp["state_keep_prob"],
mode)
cell = rnn.DropoutWrapper(
cell=cell,
input_keep_prob=input_keep_prob,
output_keep_prob=output_keep_prob,
state_keep_prob=state_keep_prob,
**vr_kwargs)
# Optionally add residual and highway connections
if layer_i > 0:
if hparams["residual"]:
cell = rnn.ResidualWrapper(cell)
if hparams["highway"]:
cell = rnn.HighwayWrapper(cell)
cells.append(cell)
if hparams["num_layers"] > 1:
cell = rnn.MultiRNNCell(cells)
else:
cell = cells[0]
return cell
def get_rnn_cell_trainable_variables(cell):
"""Returns the list of trainable variables of an RNN cell.
Args:
cell: an instance of :tf_main:`RNNCell `.
Returns:
list: trainable variables of the cell.
"""
cell_ = cell
while True:
try:
return cell_.trainable_variables
except AttributeError:
# Cell wrappers (e.g., `DropoutWrapper`) cannot directly access to
# `trainable_variables` as they don't initialize superclass
# (tf==v1.3). So try to access through the cell in the wrapper.
cell_ = cell._cell # pylint: disable=protected-access
def default_regularizer_hparams():
"""Returns the hyperparameters and their default values of a variable
regularizer:
.. code-block:: python
{
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
}
The default value corresponds to :tf_main:`L1L2 `
and, with `(l1=0, l2=0)`, disables regularization.
"""
return {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
}
def get_regularizer(hparams=None):
"""Returns a variable regularizer instance.
See :func:`~texar.core.default_regularizer_hparams` for all
hyperparameters and default values.
The "type" field can be a subclass
of :tf_main:`Regularizer `, its string name
or module path, or a class instance.
Args:
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparameters are set to default values.
Returns:
A :tf_main:`Regularizer ` instance.
`None` if :attr:`hparams` is `None` or taking the default
hyperparameter value.
Raises:
ValueError: The resulting regularizer is not an instance of
:tf_main:`Regularizer `.
"""
if hparams is None:
return None
if isinstance(hparams, dict):
hparams = HParams(hparams, default_regularizer_hparams())
rgl = utils.check_or_get_instance(
hparams.type, hparams.kwargs.todict(),
["tensorflow.keras.regularizers", "texar.custom"])
if not isinstance(rgl, tf.keras.regularizers.Regularizer):
raise ValueError("The regularizer must be an instance of "
"tf.keras.regularizers.Regularizer.")
if isinstance(rgl, tf.keras.regularizers.L1L2) and \
rgl.l1 == 0. and rgl.l2 == 0.:
return None
return rgl
def get_initializer(hparams=None):
"""Returns an initializer instance.
.. role:: python(code)
:language: python
Args:
hparams (dict or HParams, optional): Hyperparameters with the structure
.. code-block:: python
{
"type": "initializer_class_or_function",
"kwargs": {
#...
}
}
The "type" field can be a initializer class, its name or module
path, or class instance. If class name is provided, the class must
be from one the following modules:
:tf_main:`tf.initializers `,
:tf_main:`tf.keras.initializers `,
:tf_main:`tf < >`, and :mod:`texar.custom`. The class is created
by :python:`initializer_class(**kwargs)`. If a class instance
is given, "kwargs" is ignored and can be omitted.
Besides, the "type" field can also be an initialization function
called with :python:`initialization_fn(**kwargs)`. In this case
"type" can be the function, or its name or module path. If
function name is provided, the function must be from one of the
above modules or module `tf.contrib.layers`. If no
keyword argument is required, "kwargs" can be omitted.
Returns:
An initializer instance. `None` if :attr:`hparams` is `None`.
"""
if hparams is None:
return None
kwargs = hparams.get("kwargs", {})
if isinstance(kwargs, HParams):
kwargs = kwargs.todict()
modules = ["tensorflow.initializers", "tensorflow.keras.initializers",
"tensorflow", "texar.custom"]
try:
initializer = utils.check_or_get_instance(hparams["type"], kwargs,
modules)
except TypeError:
modules += ['tensorflow.contrib.layers']
initializer_fn = utils.get_function(hparams["type"], modules)
initializer = initializer_fn(**kwargs)
return initializer
def get_activation_fn(fn_name="identity", kwargs=None):
"""Returns an activation function `fn` with the signature
`output = fn(input)`.
If the function specified by :attr:`fn_name` has more than one arguments
without default values, then all these arguments except the input feature
argument must be specified in :attr:`kwargs`. Arguments with default values
can also be specified in :attr:`kwargs` to take values other than the
defaults. In this case a partial function is returned with the above
signature.
Args:
fn_name (str or callable): An activation function, or its name or
module path. The function can be:
- Built-in function defined in :tf_main:`tf < >` or \
:tf_main:`tf.nn `, e.g., :tf_main:`tf.identity `.
- User-defined activation functions in module :mod:`texar.custom`.
- External activation functions. Must provide the full module path,\
e.g., "my_module.my_activation_fn".
kwargs (optional): A `dict` or instance of :class:`~texar.HParams`
containing the keyword arguments of the activation function.
Returns:
An activation function. `None` if :attr:`fn_name` is `None`.
"""
if fn_name is None:
return None
fn_modules = ['tensorflow', 'tensorflow.nn', 'texar.custom', 'texar.core.layers']
activation_fn_ = utils.get_function(fn_name, fn_modules)
activation_fn = activation_fn_
# Make a partial function if necessary
if kwargs is not None:
if isinstance(kwargs, HParams):
kwargs = kwargs.todict()
def _partial_fn(features):
return activation_fn_(features, **kwargs)
activation_fn = _partial_fn
return activation_fn
def get_constraint_fn(fn_name="NonNeg"):
"""Returns a constraint function.
.. role:: python(code)
:language: python
The function must follow the signature:
:python:`w_ = constraint_fn(w)`.
Args:
fn_name (str or callable): The name or full path to a
constraint function, or the function itself.
The function can be:
- Built-in constraint functions defined in modules \
:tf_main:`tf.keras.constraints ` \
(e.g., :tf_main:`NonNeg `) \
or :tf_main:`tf < >` or :tf_main:`tf.nn ` \
(e.g., activation functions).
- User-defined function in :mod:`texar.custom`.
- Externally defined function. Must provide the full path, \
e.g., `"my_module.my_constraint_fn"`.
If a callable is provided, then it is returned directly.
Returns:
The constraint function. `None` if :attr:`fn_name` is `None`.
"""
if fn_name is None:
return None
fn_modules = ['tensorflow.keras.constraints', 'tensorflow',
'tensorflow.nn', 'texar.custom']
constraint_fn = utils.get_function(fn_name, fn_modules)
return constraint_fn
def get_layer(hparams):
"""Makes a layer instance.
The layer must be an instance of :tf_main:`tf.layers.Layer `.
Args:
hparams (dict or HParams): Hyperparameters of the layer, with
structure:
.. code-block:: python
{
"type": "LayerClass",
"kwargs": {
# Keyword arguments of the layer class
# ...
}
}
Here:
"type" : str or layer class or layer instance
The layer type. This can be
- The string name or full module path of a layer class. If \
the class name is provided, the class must be in module \
:tf_main:`tf.layers `, :mod:`texar.core`, \
or :mod:`texar.custom`.
- A layer class.
- An instance of a layer class.
For example
.. code-block:: python
"type": "Conv1D" # class name
"type": "texar.core.MaxReducePooling1D" # module path
"type": "my_module.MyLayer" # module path
"type": tf.layers.Conv2D # class
"type": Conv1D(filters=10, kernel_size=2) # cell instance
"type": MyLayer(...) # cell instance
"kwargs" : dict
A dictionary of keyword arguments for constructor of the
layer class. Ignored if :attr:`"type"` is a layer instance.
- Arguments named "activation" can be a callable, \
or a `str` of \
the name or module path to the activation function.
- Arguments named "\*_regularizer" and "\*_initializer" \
can be a class instance, or a `dict` of \
hyperparameters of \
respective regularizers and initializers. See
- Arguments named "\*_constraint" can be a callable, or a \
`str` of the name or full path to the constraint function.
Returns:
A layer instance. If hparams["type"] is a layer instance, returns it
directly.
Raises:
ValueError: If :attr:`hparams` is `None`.
ValueError: If the resulting layer is not an instance of
:tf_main:`tf.layers.Layer `.
"""
if hparams is None:
raise ValueError("`hparams` must not be `None`.")
layer_type = hparams["type"]
if not is_str(layer_type) and not isinstance(layer_type, type):
layer = layer_type
else:
layer_modules = ["tensorflow.layers", "texar.core", "texar.costum"]
layer_class = utils.check_or_get_class(layer_type, layer_modules)
if isinstance(hparams, dict):
default_kwargs = _layer_class_to_default_kwargs_map.get(layer_class,
{})
default_hparams = {"type": layer_type, "kwargs": default_kwargs}
hparams = HParams(hparams, default_hparams)
kwargs = {}
for k, v in hparams.kwargs.items():
if k.endswith('_regularizer'):
kwargs[k] = get_regularizer(v)
elif k.endswith('_initializer'):
kwargs[k] = get_initializer(v)
elif k.endswith('activation'):
kwargs[k] = get_activation_fn(v)
elif k.endswith('_constraint'):
kwargs[k] = get_constraint_fn(v)
else:
kwargs[k] = v
layer = utils.get_instance(layer_type, kwargs, layer_modules)
if not isinstance(layer, tf.layers.Layer):
raise ValueError("layer must be an instance of `tf.layers.Layer`.")
return layer
def _compute_concat_output_shape(input_shape, axis):
"""Infers the output shape of concat given the input shape.
The code is adapted from the ConcatLayer of lasagne
(https://github.com/Lasagne/Lasagne/blob/master/lasagne/layers/merge.py)
Args:
input_shape (list): A list of shapes, each of which is in turn a
list or TensorShape.
axis (int): Axis of the concat operation.
Returns:
list: Output shape of concat.
"""
# The size of each axis of the output shape equals the first
# input size of respective axis that is not `None`
input_shape = [tf.TensorShape(s).as_list() for s in input_shape]
output_shape = [next((s for s in sizes if s is not None), None)
for sizes in zip(*input_shape)]
axis_sizes = [s[axis] for s in input_shape]
concat_axis_size = None if any(s is None for s in axis_sizes) \
else sum(axis_sizes)
output_shape[axis] = concat_axis_size
return output_shape
class _ReducePooling1D(tf.layers.Layer):
"""Pooling layer for arbitrary reduce functions for 1D inputs.
The same as `tf.python.layers.pooling._Pooling1D` except that the pooling
dimension is entirely reduced (i.e., `pool_size=length`).
This class is for code reuse, rather than an exposed API.
"""
def __init__(self, reduce_function, data_format='channels_last',
name=None, **kwargs):
super(_ReducePooling1D, self).__init__(name=name, **kwargs)
self._reduce_function = reduce_function
if data_format not in {'channels_last', 'channels_first'}:
raise ValueError("`data_format must be either 'channels_last' or` "
"'channels_first'. Got: {}".format(data_format))
self._data_format = data_format
def compute_output_shape(self, input_shape):
input_shape = tf.TensorShape(input_shape).as_list()
if self._data_format == 'channels_last':
return tf.TensorShape([input_shape[0], input_shape[2]])
else:
return tf.TensorShape([input_shape[0], input_shape[1]])
def call(self, inputs):
if self._data_format == 'channels_last':
return self._reduce_function(inputs, axis=1)
else:
return self._reduce_function(inputs, axis=2)
class MaxReducePooling1D(_ReducePooling1D):
"""A subclass of :tf_main:`tf.layers.Layer `.
Max Pooling layer for 1D inputs. The same as
:tf_main:`MaxPooling1D ` except that the pooling
dimension is entirely reduced (i.e., `pool_size=input_length`).
"""
def __init__(self, data_format='channels_last', name=None, **kwargs):
super(MaxReducePooling1D, self).__init__(
tf.reduce_max, data_format=data_format, name=name, **kwargs)
class AverageReducePooling1D(_ReducePooling1D):
"""A subclass of :tf_main:`tf.layers.Layer `.
Average Pooling layer for 1D inputs. The same as
:tf_main:`AveragePooling1D ` except that the
pooling dimension is entirely reduced (i.e., `pool_size=input_length`).
"""
def __init__(self, data_format='channels_last', name=None, **kwargs):
super(AverageReducePooling1D, self).__init__(
tf.reduce_mean, data_format=data_format, name=name, **kwargs)
_POOLING_TO_REDUCE = {
"MaxPooling1D": "MaxReducePooling1D",
"AveragePooling1D": "AverageReducePooling1D",
tf.layers.MaxPooling1D: MaxReducePooling1D,
tf.layers.AveragePooling1D: AverageReducePooling1D
}
def get_pooling_layer_hparams(hparams):
"""Creates pooling layer hparams `dict` usable for :func:`get_layer`.
If the :attr:`hparams` sets `'pool_size'` to `None`, the layer will be
changed to the respective reduce-pooling layer. For example,
:class:`tf.layers.MaxPooling1D ` is replaced with
:class:`~texar.core.MaxReducePooling1D`.
"""
if isinstance(hparams, HParams):
hparams = hparams.todict()
new_hparams = copy.copy(hparams)
kwargs = new_hparams.get('kwargs', None)
if kwargs and kwargs.get('pool_size', None) is None:
pool_type = hparams['type']
new_hparams['type'] = _POOLING_TO_REDUCE.get(pool_type, pool_type)
kwargs.pop('pool_size', None)
kwargs.pop('strides', None)
kwargs.pop('padding', None)
return new_hparams
class MergeLayer(tf.layers.Layer):
"""A subclass of :tf_main:`tf.layers.Layer `.
A layer that consists of multiple layers in parallel. Input is fed to
each of the parallel layers, and the outputs are merged with a
specified mode.
Args:
layers (list, optional): A list of :tf_main:`tf.layers.Layer
` instances, or a list of hyperparameter dicts
each of which specifies type and kwargs of each layer (see
the `hparams` argument of :func:`get_layer`).
If `None`, this layer degenerates to a merging operator that merges
inputs directly.
mode (str): Mode of the merge op. This can be:
- :attr:`'concat'`: Concatenates layer outputs along one axis. \
Tensors must have the same shape except for the dimension \
specified in `axis`, which can have different sizes.
- :attr:`'elemwise_sum'`: Outputs element-wise sum.
- :attr:`'elemwise_mul'`: Outputs element-wise product.
- :attr:`'sum'`: Computes the sum of layer outputs along the \
dimension given by `axis`. E.g., given `axis=1`, \
two tensors of shape `[a, b]` and `[a, c]` respectively \
will result in a merged tensor of shape `[a]`.
- :attr:`'mean'`: Computes the mean of layer outputs along the \
dimension given in `axis`.
- :attr:`'prod'`: Computes the product of layer outputs along the \
dimension given in `axis`.
- :attr:`'max'`: Computes the maximum of layer outputs along the \
dimension given in `axis`.
- :attr:`'min'`: Computes the minimum of layer outputs along the \
dimension given in `axis`.
- :attr:`'and'`: Computes the `logical and` of layer outputs along \
the dimension given in `axis`.
- :attr:`'or'`: Computes the `logical or` of layer outputs along \
the dimension given in `axis`.
- :attr:`'logsumexp'`: Computes \
log(sum(exp(elements across the dimension of layer outputs)))
axis (int): The axis to use in merging. Ignored in modes
:attr:`'elemwise_sum'` and :attr:`'elemwise_mul'`.
trainable (bool): Whether the layer should be trained.
name (str, optional): Name of the layer.
"""
def __init__(self,
layers=None,
mode='concat',
axis=1,
trainable=True,
name=None,
**kwargs):
super(MergeLayer, self).__init__(
trainable=trainable, name=name, **kwargs)
self._mode = mode
self._axis = axis
self._layers = None
if layers is not None:
if len(layers) == 0:
raise ValueError(
"'layers' must be either None or a non-empty list.")
self._layers = []
for layer in layers:
if isinstance(layer, tf.layers.Layer):
self._layers.append(layer)
else:
self._layers.append(get_layer(hparams=layer))
# Keep tracks of whether trainable variables have been created
self._vars_built = False
def compute_output_shape(self, input_shape):
if self._layers is None:
_shapes = input_shape
if not isinstance(_shapes, (list, tuple)):
_shapes = [_shapes]
else:
_shapes = []
for layer in self._layers:
layer_output_shape = layer.compute_output_shape(input_shape)
_shapes.append(layer_output_shape)
_shapes = [tf.TensorShape(s) for s in _shapes]
if self._mode == 'concat':
output_shape = _compute_concat_output_shape(_shapes, self._axis)
elif self._mode in ['sum', 'mean', 'prod', 'max', 'min',
'and', 'or', 'logsumexp']:
output_shape = _compute_concat_output_shape(_shapes, self._axis)
output_shape.pop(self._axis)
elif self._mode in ['elemwise_sum', 'elemwise_mul']:
# Simply infer the output shape as the input shape of highest rank
_ranks = [s.ndims for s in _shapes]
max_rank = max(_ranks)
max_ranked_shapes = []
for i, s in enumerate(_shapes):
if _ranks[i] == max_rank:
max_ranked_shapes.append(s.as_list())
# Grab the first size of each axis that is not `None`
output_shape = [next((s for s in sizes if s is not None), None)
for sizes in zip(*max_ranked_shapes)]
else:
raise ValueError("Unknown merge mode: '%s'" % self._mode)
return tf.TensorShape(output_shape)
def _collect_weights(self):
"""Collects (non-)trainable weights of each of the parallel layers.
"""
if self._layers is None:
pass
for layer in self._layers:
if self.trainable:
add_variable(
layer._trainable_weights, self._trainable_weights)
else:
add_variable(
layer._trainable_weights, self._non_trainable_weights)
add_variable(
layer._non_trainable_weights, self._non_trainable_weights)
def call(self, inputs):
if self._layers is None:
layer_outputs = inputs
if not isinstance(layer_outputs, (list, tuple)):
layer_outputs = [layer_outputs]
else:
layer_outputs = []
for layer in self._layers:
layer_output = layer(inputs)
layer_outputs.append(layer_output)
if self._mode == 'concat':
outputs = tf.concat(values=layer_outputs, axis=self._axis)
elif self._mode == 'elemwise_sum':
outputs = layer_outputs[0]
for i in range(1, len(layer_outputs)):
outputs = tf.add(outputs, layer_outputs[i])
elif self._mode == 'elemwise_mul':
outputs = layer_outputs[0]
for i in range(1, len(layer_outputs)):
outputs = tf.multiply(outputs, layer_outputs[i])
elif self._mode == 'sum':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_sum(_concat, axis=self._axis)
elif self._mode == 'mean':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_mean(_concat, axis=self._axis)
elif self._mode == 'prod':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_prod(_concat, axis=self._axis)
elif self._mode == 'max':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_max(_concat, axis=self._axis)
elif self._mode == 'min':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_min(_concat, axis=self._axis)
elif self._mode == 'and':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_all(_concat, axis=self._axis)
elif self._mode == 'or':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_any(_concat, axis=self._axis)
elif self._mode == 'logsumexp':
_concat = tf.concat(values=layer_outputs, axis=self._axis)
outputs = tf.reduce_logsumexp(_concat, axis=self._axis)
else:
raise ValueError("Unknown merge mode: '%s'" % self._mode)
if not self.built or not self._vars_built:
self._collect_weights()
self._vars_built = True
return outputs
@property
def layers(self):
"""The list of parallel layers.
"""
return self._layers
class SequentialLayer(tf.layers.Layer):
"""A subclass of :tf_main:`tf.layers.Layer `.
A layer that consists of multiple layers connected sequentially.
Args:
layers (list): A list of :tf_main:`tf.layers.Layer
` instances, or a list of hyperparameter dicts
each of which specifying type and kwargs of each layer (see
the `hparams` argument of :func:`get_layer`). The layers are
connected sequentially.
"""
def __init__(self,
layers,
trainable=True,
name=None,
**kwargs):
super(SequentialLayer, self).__init__(
trainable=trainable, name=name, **kwargs)
if len(layers) == 0:
raise ValueError("'layers' must be a non-empty list.")
self._layers = []
for layer in layers:
if isinstance(layer, tf.layers.Layer):
self._layers.append(layer)
else:
self._layers.append(get_layer(hparams=layer))
# Keep tracks of whether trainable variables have been created
self._vars_built = False
def compute_output_shape(self, input_shape):
input_shape = tf.TensorShape(input_shape)
for layer in self._layers:
output_shape = layer.compute_output_shape(input_shape)
input_shape = output_shape
return output_shape
def _collect_weights(self):
"""Collects (non-)trainable weights of each of the layers.
"""
for layer in self._layers:
if self.trainable:
add_variable(
layer._trainable_weights, self._trainable_weights)
else:
add_variable(
layer._trainable_weights, self._non_trainable_weights)
add_variable(
layer._non_trainable_weights, self._non_trainable_weights)
def call(self, inputs, mode=None): # pylint: disable=arguments-differ
training = is_train_mode(mode)
outputs = inputs
for layer in self._layers:
if isinstance(layer, tf.layers.Dropout) or \
isinstance(layer, tf.layers.BatchNormalization):
outputs = layer(outputs, training=training)
else:
outputs = layer(inputs)
inputs = outputs
if not self.built or not self._vars_built:
self._collect_weights()
self._vars_built = True
return outputs
@property
def layers(self):
"""The list of layers connected sequentially.
"""
return self._layers
def _common_default_conv_dense_kwargs():
"""Returns the default keyword argument values that are common to
convolution layers.
"""
return {
"activation": None,
"use_bias": True,
"kernel_initializer": {
"type": "glorot_uniform_initializer",
"kwargs": {}
},
"bias_initializer": {
"type": "zeros_initializer",
"kwargs": {}
},
"kernel_regularizer": default_regularizer_hparams(),
"bias_regularizer": default_regularizer_hparams(),
"activity_regularizer": default_regularizer_hparams(),
"kernel_constraint": None,
"bias_constraint": None,
"trainable": True,
"name": None
}
def default_conv1d_kwargs():
"""Returns the default keyword argument values of the constructor
of 1D-convolution layer class
:tf_main:`tf.layers.Conv1D `.
.. code-block:: python
{
"filters": 100,
"kernel_size": 3,
"strides": 1,
"padding": 'valid',
"data_format": 'channels_last',
"dilation_rate": 1
"activation": "identity",
"use_bias": True,
"kernel_initializer": {
"type": "glorot_uniform_initializer",
"kwargs": {}
},
"bias_initializer": {
"type": "zeros_initializer",
"kwargs": {}
},
"kernel_regularizer": {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
},
"bias_regularizer": {
# same as in "kernel_regularizer"
# ...
},
"activity_regularizer": {
# same as in "kernel_regularizer"
# ...
},
"kernel_constraint": None,
"bias_constraint": None,
"trainable": True,
"name": None
}
"""
kwargs = _common_default_conv_dense_kwargs()
kwargs.update({
"kernel_size": 3,
"filters": 100,
"strides": 1,
"dilation_rate": 1,
"data_format": "channels_last"
})
return kwargs
def default_conv2d_kwargs():
"""TODO
"""
return {}
def default_conv3d_kwargs():
"""TODO
"""
return {}
def default_conv2d_transpose_kwargs():
"""TODO
"""
return {}
def default_conv3d_transpose_kwargs():
"""TODO
"""
return {}
def default_dense_kwargs():
"""Returns the default keyword argument values of the constructor
of the dense layer class :tf_main:`tf.layers.Dense `.
.. code-block:: python
{
"units": 256,
"activation": "identity",
"use_bias": True,
"kernel_initializer": {
"type": "glorot_uniform_initializer",
"kwargs": {}
},
"bias_initializer": {
"type": "zeros_initializer",
"kwargs": {}
},
"kernel_regularizer": {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
},
"bias_regularizer": {
# same as in "kernel_regularizer"
# ...
},
"activity_regularizer": {
# same as in "kernel_regularizer"
# ...
},
"kernel_constraint": None,
"bias_constraint": None,
"trainable": True,
"name": None
}
"""
kwargs = _common_default_conv_dense_kwargs()
kwargs.update({
"units": 256
})
return kwargs
def default_dropout_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_flatten_kwargs():
"""TODO
"""
return {}
def default_max_pooling1d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_max_pooling2d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_max_pooling3d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_separable_conv2d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_batch_normalization_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_average_pooling1d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_average_pooling2d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
def default_average_pooling3d_kwargs():
"""TODO
"""
return {}
#raise NotImplementedError
_layer_class_to_default_kwargs_map = {
tf.layers.Conv1D: default_conv1d_kwargs(),
tf.layers.Conv2D: default_conv2d_kwargs(),
tf.layers.Conv3D: default_conv3d_kwargs(),
tf.layers.Conv2DTranspose: default_conv2d_transpose_kwargs(),
tf.layers.Conv3DTranspose: default_conv3d_transpose_kwargs(),
tf.layers.Dense: default_dense_kwargs(),
tf.layers.Dropout: default_dropout_kwargs(),
tf.layers.Flatten: default_flatten_kwargs(),
tf.layers.MaxPooling1D: default_max_pooling1d_kwargs(),
tf.layers.MaxPooling2D: default_max_pooling2d_kwargs(),
tf.layers.MaxPooling3D: default_max_pooling3d_kwargs(),
tf.layers.SeparableConv2D: default_separable_conv2d_kwargs(),
tf.layers.BatchNormalization: default_batch_normalization_kwargs(),
tf.layers.AveragePooling1D: default_average_pooling1d_kwargs(),
tf.layers.AveragePooling2D: default_average_pooling2d_kwargs(),
tf.layers.AveragePooling3D: default_average_pooling3d_kwargs(),
}
def layer_normalize(inputs,
scope=None):
'''Applies layer normalization. averaging over the last dimension
Args:
inputs: A tensor with 2 or more dimensions, where the first
dimension has `batch_size`.
epsilon: A floating number. A very small number for preventing
ZeroDivision Error.
scope: Optional scope for `variable_scope`.
Returns:
A tensor with the same shape and data dtype as `inputs`.
'''
return tf.contrib.layers.layer_norm(
inputs=inputs, begin_norm_axis=-1, begin_params_axis=-1, scope=scope
)
def gelu(input_tensor):
"""Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper: https://arxiv.org/abs/1606.08415
Args:
input_tensor: float Tensor to perform activation.
Returns:
`input_tensor` with the GELU activation applied.
"""
cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
return input_tensor * cdf
================================================
FILE: texar_repo/texar/core/layers_test.py
================================================
#
"""
Unit tests for various layers.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
import tensorflow.contrib.rnn as rnn
import texar as tx
from texar import context
from texar.hyperparams import HParams
from texar.core import layers
# pylint: disable=no-member, protected-access, invalid-name
# pylint: disable=redefined-variable-type
class GetRNNCellTest(tf.test.TestCase):
"""Tests RNN cell creator.
"""
def test_get_rnn_cell(self):
"""Tests :func:`texar.core.layers.get_rnn_cell`.
"""
emb_dim = 4
num_units = 64
# Given instance
hparams = {
"type": rnn.LSTMCell(num_units)
}
cell = layers.get_rnn_cell(hparams)
self.assertTrue(isinstance(cell, rnn.LSTMCell))
# Given class
hparams = {
"type": rnn.LSTMCell,
"kwargs": {"num_units": 10}
}
cell = layers.get_rnn_cell(hparams)
self.assertTrue(isinstance(cell, rnn.LSTMCell))
# Given string, and complex hyperparameters
keep_prob_x = tf.placeholder(
name='keep_prob', shape=[], dtype=tf.float32)
hparams = {
"type": "tensorflow.contrib.rnn.GRUCell",
"kwargs": {
"num_units": num_units
},
"num_layers": 2,
"dropout": {
"input_keep_prob": 0.8,
"state_keep_prob": keep_prob_x,
"variational_recurrent": True,
"input_size": [emb_dim, num_units]
},
"residual": True,
"highway": True
}
hparams_ = HParams(hparams, layers.default_rnn_cell_hparams())
cell = layers.get_rnn_cell(hparams_)
batch_size = 16
inputs = tf.zeros([batch_size, emb_dim], dtype=tf.float32)
output, state = cell(inputs,
cell.zero_state(batch_size, dtype=tf.float32))
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
feed_dict = {
keep_prob_x: 1.0,
context.global_mode(): tf.estimator.ModeKeys.TRAIN
}
output_, state_ = sess.run([output, state], feed_dict=feed_dict)
self.assertEqual(output_.shape[0], batch_size)
if isinstance(state_, (list, tuple)):
self.assertEqual(state_[0].shape[0], batch_size)
self.assertEqual(state_[0].shape[1],
hparams_.kwargs.num_units)
else:
self.assertEqual(state_.shape[0], batch_size)
self.assertEqual(state_.shape[1],
hparams_.kwargs.num_units)
def test_switch_dropout(self):
"""Tests dropout mode.
"""
emb_dim = 4
num_units = 64
hparams = {
"kwargs": {
"num_units": num_units
},
"num_layers": 2,
"dropout": {
"input_keep_prob": 0.8,
},
}
mode = tf.placeholder(tf.string)
hparams_ = HParams(hparams, layers.default_rnn_cell_hparams())
cell = layers.get_rnn_cell(hparams_, mode)
batch_size = 16
inputs = tf.zeros([batch_size, emb_dim], dtype=tf.float32)
output, state = cell(inputs,
cell.zero_state(batch_size, dtype=tf.float32))
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
output_train, _ = sess.run(
[output, state],
feed_dict={mode: tf.estimator.ModeKeys.TRAIN})
self.assertEqual(output_train.shape[0], batch_size)
output_test, _ = sess.run(
[output, state],
feed_dict={mode: tf.estimator.ModeKeys.EVAL})
self.assertEqual(output_test.shape[0], batch_size)
class GetActivationFnTest(tf.test.TestCase):
"""Tests :func:`texar.core.layers.get_activation_fn`.
"""
def test_get_activation_fn(self):
"""Tests.
"""
fn = layers.get_activation_fn()
self.assertEqual(fn, tf.identity)
fn = layers.get_activation_fn('relu')
self.assertEqual(fn, tf.nn.relu)
inputs = tf.random_uniform([64, 100], -5, 20, dtype=tf.int32)
fn = layers.get_activation_fn('leaky_relu')
fn_output = fn(inputs)
ref_output = tf.nn.leaky_relu(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
fn_output_, ref_output_ = sess.run([fn_output, ref_output])
np.testing.assert_array_equal(fn_output_, ref_output_)
fn = layers.get_activation_fn('leaky_relu', kwargs={'alpha': 0.1})
fn_output = fn(inputs)
ref_output = tf.nn.leaky_relu(inputs, alpha=0.1)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
fn_output_, ref_output_ = sess.run([fn_output, ref_output])
np.testing.assert_array_equal(fn_output_, ref_output_)
class GetLayerTest(tf.test.TestCase):
"""Tests layer creator.
"""
def test_get_layer(self):
"""Tests :func:`texar.core.layers.get_layer`.
"""
hparams = {
"type": "Conv1D"
}
layer = layers.get_layer(hparams)
self.assertTrue(isinstance(layer, tf.layers.Conv1D))
hparams = {
"type": "MergeLayer",
"kwargs": {
"layers": [
{"type": "Conv1D"},
{"type": "Conv1D"}
]
}
}
layer = layers.get_layer(hparams)
self.assertTrue(isinstance(layer, tx.core.MergeLayer))
hparams = {
"type": tf.layers.Conv1D
}
layer = layers.get_layer(hparams)
self.assertTrue(isinstance(layer, tf.layers.Conv1D))
hparams = {
"type": tf.layers.Conv1D(filters=10, kernel_size=2)
}
layer = layers.get_layer(hparams)
self.assertTrue(isinstance(layer, tf.layers.Conv1D))
class ReducePoolingLayerTest(tf.test.TestCase):
"""Tests reduce pooling layer.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._batch_size = 64
self._seq_length = 16
self._emb_dim = 100
def test_max_reduce_pooling_layer(self):
"""Tests :class:`texar.core.MaxReducePooling1D`.
"""
pool_layer = layers.MaxReducePooling1D()
inputs = tf.random_uniform(
[self._batch_size, self._seq_length, self._emb_dim])
output_shape = pool_layer.compute_output_shape(inputs.get_shape())
output = pool_layer(inputs)
output_reduce = tf.reduce_max(inputs, axis=1)
self.assertEqual(output.get_shape(), output_shape)
self.assertEqual(output.get_shape(), [self._batch_size, self._emb_dim])
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
output_, output_reduce_ = sess.run([output, output_reduce])
np.testing.assert_array_equal(output_, output_reduce_)
def test_average_reduce_pooling_layer(self):
"""Tests :class:`texar.core.AverageReducePooling1D`.
"""
pool_layer = layers.AverageReducePooling1D()
inputs = tf.random_uniform(
[self._batch_size, self._seq_length, self._emb_dim])
output_shape = pool_layer.compute_output_shape(inputs.get_shape())
output = pool_layer(inputs)
output_reduce = tf.reduce_mean(inputs, axis=1)
self.assertEqual(output.get_shape(), output_shape)
self.assertEqual(output.get_shape(), [self._batch_size, self._emb_dim])
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
output_, output_reduce_ = sess.run([output, output_reduce])
np.testing.assert_array_equal(output_, output_reduce_)
class MergeLayerTest(tf.test.TestCase):
"""Tests MergeLayer.
"""
def test_output_shape(self):
"""Tests MergeLayer.compute_output_shape function.
"""
input_shapes = [[None, 1, 2], [64, 2, 2], [None, 3, 2]]
concat_layer = layers.MergeLayer(mode='concat', axis=1)
concat_output_shape = concat_layer.compute_output_shape(input_shapes)
self.assertEqual(concat_output_shape, [64, 6, 2])
sum_layer = layers.MergeLayer(mode='sum', axis=1)
sum_output_shape = sum_layer.compute_output_shape(input_shapes)
self.assertEqual(sum_output_shape, [64, 2])
input_shapes = [[None, 5, 2], [64, None, 2], [2]]
esum_layer = layers.MergeLayer(mode='elemwise_sum')
esum_output_shape = esum_layer.compute_output_shape(input_shapes)
self.assertEqual(esum_output_shape, [64, 5, 2])
def test_layer_logics(self):
"""Test the logic of MergeLayer.
"""
layers_ = []
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=3))
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=4))
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=5))
layers_.append(tf.layers.Dense(200))
layers_.append(tf.layers.Dense(200))
m_layer = layers.MergeLayer(layers_)
inputs = tf.zeros([64, 16, 1024], dtype=tf.float32)
outputs = m_layer(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertEqual(outputs_.shape[0], 64)
self.assertEqual(outputs_.shape[2], 200)
self.assertEqual(
outputs_.shape,
m_layer.compute_output_shape(inputs.shape.as_list()))
def test_trainable_variables(self):
"""Test the trainable_variables of the layer.
"""
layers_ = []
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=3))
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=4))
layers_.append(tf.layers.Conv1D(filters=200, kernel_size=5))
layers_.append(tf.layers.Dense(200))
layers_.append(tf.layers.Dense(200))
m_layer = layers.MergeLayer(layers_)
inputs = tf.zeros([64, 16, 1024], dtype=tf.float32)
_ = m_layer(inputs)
num_vars = sum([len(layer.trainable_variables) for layer in layers_])
self.assertEqual(num_vars, len(m_layer.trainable_variables))
class SequentialLayerTest(tf.test.TestCase):
"""Tests sequential layer.
"""
def test_seq_layer(self):
"""Test sequential layer.
"""
layers_ = []
layers_.append(tf.layers.Dense(100))
layers_.append(tf.layers.Dense(200))
seq_layer = layers.SequentialLayer(layers_)
output_shape = seq_layer.compute_output_shape([None, 10])
self.assertEqual(output_shape[1].value, 200)
inputs = tf.zeros([10, 20], dtype=tf.float32)
outputs = seq_layer(inputs)
num_vars = sum([len(layer.trainable_variables) for layer in layers_])
self.assertEqual(num_vars, len(seq_layer.trainable_variables))
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertEqual(outputs_.shape[0], 10)
self.assertEqual(outputs_.shape[1], 200)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/core/optimization.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various optimization related utilities.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import re
import tensorflow as tf
from texar.hyperparams import HParams
from texar.utils import utils
# pylint: disable=too-many-arguments, no-member
__all__ = [
"default_optimization_hparams",
"get_optimizer_fn",
"get_learning_rate_decay_fn",
"get_gradient_clip_fn",
"get_optimizer",
"get_train_op",
"AdamWeightDecayOptimizer",
]
def default_optimization_hparams():
"""Returns a `dict` of default hyperparameters of training op
and their default values
.. role:: python(code)
:language: python
.. code-block:: python
{
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"learning_rate_decay": {
"type": "",
"kwargs": {},
"min_learning_rate": 0.,
"start_decay_step": 0,
"end_decay_step": inf
},
"gradient_clip": {
"type": "",
"kwargs": {}
},
"gradient_noise_scale": None,
"name": None
}
Here:
"optimizer" : dict
Hyperparameters of a :tf_main:`tf.train.Optimizer `.
- **"type"** specifies the optimizer class. This can be
- The string name or full module path of an optimizer class. \
If the class name is provided, the class must be in module \
:tf_main:`tf.train `, \
:tf_main:`tf.contrib.opt ` or :mod:`texar.custom` \
, :mod:`texar.core.optimization`
- An optimizer class.
- An instance of an optimizer class.
For example
.. code-block:: python
"type": "AdamOptimizer" # class name
"type": "my_module.MyOptimizer" # module path
"type": tf.contrib.opt.AdamWOptimizer # class
"type": my_module.MyOptimizer # class
"type": GradientDescentOptimizer(learning_rate=0.1) # instance
"type": MyOptimizer(...) # instance
- **"kwargs"** is a `dict` specifying keyword arguments for creating \
the optimizer class instance, with :python:`opt_class(**kwargs)`. \
Ignored if "type" is a class instance.
"learning_rate_decay" : dict
Hyperparameters of learning rate decay function. The learning rate
starts decay from :attr:`"start_decay_step"` and keeps unchanged after
:attr:`"end_decay_step"` or reaching :attr:`"min_learning_rate"`.
The decay function is specified in "type" and "kwargs".
- "type" can be a decay function or its name or module path. If \
function name is provided, it must be from module \
:tf_main:`tf.train ` or :mod:`texar.custom`, \
:mod:`texar.core.optimization`.
- "kwargs" is a `dict` of keyword arguments for the function \
excluding arguments named "global_step" and "learning_rate".
The function is called with
:python:`lr = decay_fn(learning_rate=lr, global_step=offset_step,
**kwargs)`, where `offset_step` is the global step offset as above.
The only exception is :tf_main:`tf.train.piecewise_constant
` which is called with
:python:`lr = piecewise_constant(x=offset_step, **kwargs)`.
"gradient_clip" : dict
Hyperparameters of gradient clipping. The gradient clipping function
takes a list of `(gradients, variables)` tuples and returns a list
of `(clipped_gradients, variables)` tuples. Typical examples include
:tf_main:`tf.clip_by_global_norm `,
:tf_main:`tf.clip_by_value `,
:tf_main:`tf.clip_by_norm `,
:tf_main:`tf.clip_by_average_norm `, etc.
"type" specifies the gradient clip function, and can be a function,
or its name or mudule path. If function name is provided, the
function must be from module :tf_main:`tf < >` or :mod:`texar.custom`,
:mod:`texar.core.optimization`.
"kwargs" specifies keyword arguments to the function, except arguments
named "t" or "t_list".
The function is called with
:python:`clipped_grads(, _) = clip_fn(t_list=grads, **kwargs)`
(e.g., for :tf_main:`tf.clip_by_global_norm `) or
:python:`clipped_grads = [clip_fn(t=grad, **kwargs) for grad in grads]`
(e.g., for :tf_main:`tf.clip_by_value `).
"gradient_noise_scale" : float, optional
Adds 0-mean normal noise scaled by this value to gradient.
"""
return {
"optimizer": {
"type": "AdamOptimizer",
"kwargs": {
"learning_rate": 0.001
}
},
"learning_rate_decay": {
"type": "",
"kwargs": {},
"min_learning_rate": 0.,
"start_decay_step": 0,
"end_decay_step": utils.MAX_SEQ_LENGTH,
},
"gradient_clip": {
"type": "",
"kwargs": {}
},
"gradient_noise_scale": None,
# TODO(zhiting): allow module-level control of gradient_multipliers
"name": None
}
def get_optimizer_fn(hparams=None):
"""Returns a function `optimizer_fn` of making optimizer instance, along
with the optimizer class.
.. role:: python(code)
:language: python
The function has the signiture
:python:`optimizer_fn(learning_rate=None) -> optimizer class instance`
See the :attr:`"optimizer"` field of
:meth:`~texar.core.default_optimization_hparams` for all
hyperparameters and default values.
The optimizer class must be a subclass of
:tf_main:`tf.train.Optimizer `.
Args:
hparams (dict or HParams, optional): hyperparameters. Missing
hyperparameters are set to default values automatically.
Returns:
- If hparams["type"] is a string or optimizer class, returns\
`(optimizer_fn, optimizer class)`,
- If hparams["type"] is an optimizer instance, returns \
`(the optimizer instance, optimizer class)`
"""
if hparams is None or isinstance(hparams, dict):
hparams = HParams(
hparams, default_optimization_hparams()["optimizer"])
opt = hparams["type"]
if isinstance(opt, tf.train.Optimizer):
return opt, type(opt)
opt_modules = ['tensorflow.train',
'tensorflow.contrib.opt',
'texar.core.optimization',
'texar.custom']
try:
opt_class = utils.check_or_get_class(opt, opt_modules,
tf.train.Optimizer)
except TypeError:
raise ValueError(
"Unrecognized optimizer. Must be string name of the "
"optimizer class, or the class which is a subclass of "
"tf.train.Optimizer, or an instance of the subclass of "
"Optimizer.")
def _get_opt(learning_rate=None):
opt_kwargs = hparams["kwargs"].todict()
fn_args = set(utils.get_args(opt_class.__init__))
if 'learning_rate' in fn_args and learning_rate is not None:
opt_kwargs["learning_rate"] = learning_rate
return opt_class(**opt_kwargs)
return _get_opt, opt_class
def get_learning_rate_decay_fn(hparams=None):
"""Creates learning rate decay function based on the hyperparameters.
See the :attr:`learning_rate_decay` field in
:meth:`~texar.core.default_optimization_hparams` for all
hyperparameters and default values.
Args:
hparams (dict or HParams, optional): hyperparameters. Missing
hyperparameters are set to default values automatically.
Returns:
function or None: If hparams["type"] is specified, returns a
function that takes `(learning_rate, step, **kwargs)` and
returns a decayed learning rate. If
hparams["type"] is empty, returns `None`.
"""
if hparams is None or isinstance(hparams, dict):
hparams = HParams(
hparams, default_optimization_hparams()["learning_rate_decay"])
fn_type = hparams["type"]
if fn_type is None or fn_type == "":
return None
fn_modules = ["tensorflow.train", "texar.custom"]
decay_fn = utils.get_function(fn_type, fn_modules)
fn_kwargs = hparams["kwargs"]
if fn_kwargs is HParams:
fn_kwargs = fn_kwargs.todict()
start_step = tf.to_int32(hparams["start_decay_step"])
end_step = tf.to_int32(hparams["end_decay_step"])
def lr_decay_fn(learning_rate, global_step):
"""Learning rate decay function.
Args:
learning_rate (float or Tensor): The original learning rate.
global_step (int or scalar int Tensor): optimization step counter.
Returns:
scalar float Tensor: decayed learning rate.
"""
offset_global_step = tf.maximum(
tf.minimum(tf.to_int32(global_step), end_step) - start_step, 0)
if decay_fn == tf.train.piecewise_constant:
decayed_lr = decay_fn(x=offset_global_step, **fn_kwargs)
else:
fn_kwargs_ = {
"learning_rate": learning_rate,
"global_step": offset_global_step}
fn_kwargs_.update(fn_kwargs)
decayed_lr = utils.call_function_with_redundant_kwargs(
decay_fn, fn_kwargs_)
decayed_lr = tf.maximum(decayed_lr, hparams["min_learning_rate"])
return decayed_lr
return lr_decay_fn
def get_gradient_clip_fn(hparams=None):
"""Creates a gradient clipping function based on the hyperparameters.
See the :attr:`gradient_clip` field in
:meth:`~texar.core.default_optimization_hparams` for all
hyperparameters and default values.
The gradient clipping function takes a list of `(gradients, variables)`
tuples and returns a list of `(clipped_gradients, variables)` tuples.
Typical examples include
:tf_main:`tf.clip_by_global_norm `,
:tf_main:`tf.clip_by_value `,
:tf_main:`tf.clip_by_norm `,
:tf_main:`tf.clip_by_average_norm `, etc.
Args:
hparams (dict or HParams, optional): hyperparameters. Missing
hyperparameters are set to default values automatically.
Returns:
function or `None`: If hparams["type"] is specified, returns
the respective function. If hparams["type"] is empty,
returns `None`.
"""
if hparams is None or isinstance(hparams, dict):
hparams = HParams(
hparams, default_optimization_hparams()["gradient_clip"])
fn_type = hparams["type"]
if fn_type is None or fn_type == "":
return None
fn_modules = ["tensorflow", "texar.custom"]
clip_fn = utils.get_function(fn_type, fn_modules)
clip_fn_args = utils.get_args(clip_fn)
fn_kwargs = hparams["kwargs"]
if isinstance(fn_kwargs, HParams):
fn_kwargs = fn_kwargs.todict()
def grad_clip_fn(grads_and_vars):
"""Gradient clipping function.
Args:
grads_and_vars (list): A list of `(gradients, variables)` tuples.
Returns:
list: A list of `(clipped_gradients, variables)` tuples.
"""
grads, vars_ = zip(*grads_and_vars)
if clip_fn == tf.clip_by_global_norm:
clipped_grads, _ = clip_fn(t_list=grads, **fn_kwargs)
elif 't_list' in clip_fn_args:
clipped_grads = clip_fn(t_list=grads, **fn_kwargs)
elif 't' in clip_fn_args: # e.g., tf.clip_by_value
clipped_grads = [clip_fn(t=grad, **fn_kwargs) for grad in grads]
return list(zip(clipped_grads, vars_))
return grad_clip_fn
def _get_static_lr(learning_rate=None, optimizer_class=None, hparams=None):
"""Return the base static learning_rate.
A helper function for creating the optimization function.
"""
hparams = HParams(hparams, default_optimization_hparams())
opt_hparams = hparams['optimizer']
if learning_rate is None:
learning_rate = opt_hparams["kwargs"].get("learning_rate", None)
if learning_rate is None:
# Try to get learning_rate from the default value of the
# optimizer's argument
opt_argspec = utils.get_default_arg_values(optimizer_class.__init__)
learning_rate = opt_argspec.get("learning_rate", None)
return learning_rate
def get_optimizer(learning_rate=None, global_step=None, hparams=None):
"""Creates a optimizer instance.
Args:
learning_rate (float or Tensor, optional): If `None`, learning rate
specified in :attr:`hparams`, or the default learning rate
of the optimizer will be used (if exists).
global_step (optional): A scalar int Tensor. Step counter to update on
each step unless :attr:`increment_global_step` is `False`.
Learning rate decay uses :attr:`global_step`.
If `None`, it will be fetched from the default graph (see
:tf_main:`tf.train.get_global_step ` for
more details). If it has not been created, no step will be
incremented with each weight update.
hparams (dict or HParams, optional): hyperparameters. Missing
hyperparameters are set to default values automatically. See
:func:`~texar.core.default_optimization_hparams` for
all hyperparameters and default values.
Returns:
optimizer: the tf.train.Optimizer instance specified in hparams.
"""
hparams = HParams(hparams, default_optimization_hparams())
opt_hparams = hparams["optimizer"]
optimizer_fn, optimizer_class = get_optimizer_fn(opt_hparams)
static_lr = _get_static_lr(learning_rate, optimizer_class, hparams)
lr_decay_fn = get_learning_rate_decay_fn(hparams["learning_rate_decay"])
if lr_decay_fn is not None:
learning_rate = lr_decay_fn(learning_rate=static_lr,
global_step=global_step)
else:
learning_rate = static_lr
tf.summary.scalar("learning_rate", learning_rate)
optimizer = optimizer_fn(learning_rate=learning_rate)
return optimizer
def get_train_op(loss, variables=None,
optimizer=None, learning_rate=None,
global_step=None, increment_global_step=True, hparams=None):
"""Creates a training op.
This is a wrapper of :tf_main:`tf.contrib.layers.optimize_loss
`.
Args:
loss: A scalar Tensor representing the loss to minimize.
variables (optional): A list of Variables to optimize. If
`None`, all trainable variables are used.
optimizer (optional): An tf.train.Optimizer instance. If `None`,
use the setting in `hparams` to create the optimizer.
learning_rate (float or Tensor, optional): If `None`, learning rate
specified in :attr:`hparams`, or the default learning rate
of the optimizer will be used (if exists).
global_step (optional): A scalar int Tensor. Step counter to update on
each step unless :attr:`increment_global_step` is `False`.
Learning rate decay uses :attr:`global_step`.
If `None`, it will be fetched from the default graph (see
:tf_main:`tf.train.get_global_step ` for
more details). If it has not been created, no step will be
incremented with each weight update.
increment_global_step (bool): Whether to increment
:attr:`global_step`. This is useful if the :attr:`global_step` is
used in multiple training ops per training step (e.g. to optimize
different parts of the model) to avoid incrementing
:attr:`global_step` more times than necessary.
hparams (dict or HParams, optional): hyperparameters. Missing
hyperparameters are set to default values automatically. See
:func:`~texar.core.default_optimization_hparams` for
all hyperparameters and default values.
Returns:
train_op: the operator used for variables optimization.
"""
hparams = HParams(hparams, default_optimization_hparams())
grad_clip_fn = get_gradient_clip_fn(hparams["gradient_clip"])
if not isinstance(optimizer, tf.train.Optimizer):
opt_hparams = hparams["optimizer"]
optimizer_fn, optimizer_class = get_optimizer_fn(opt_hparams)
learning_rate = _get_static_lr(learning_rate, optimizer_class, hparams)
lr_decay_fn = get_learning_rate_decay_fn(
hparams["learning_rate_decay"])
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=global_step,
learning_rate=learning_rate,
optimizer=optimizer_fn,
gradient_noise_scale=hparams["gradient_noise_scale"],
clip_gradients=grad_clip_fn,
learning_rate_decay_fn=lr_decay_fn,
variables=variables,
name=hparams["name"],
increment_global_step=increment_global_step)
else:
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=global_step,
learning_rate=None,
optimizer=optimizer,
gradient_noise_scale=hparams["gradient_noise_scale"],
clip_gradients=grad_clip_fn,
variables=variables,
name=hparams["name"],
increment_global_step=increment_global_step)
return train_op
class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""
A basic Adam optimizer that includes "correct" L2 weight decay.
Copied from the google BERT repo.
Except that in `apply_gradient` function, we add the support to increment
the passed global step parameter, to make it more compatible to
tf.train.Optimizer implementation.
"""
def __init__(self,
learning_rate,
weight_decay_rate=0.0,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-6,
exclude_from_weight_decay=None,
name="AdamWeightDecayOptimizer"):
"""Constructs a AdamWeightDecayOptimizer."""
super(AdamWeightDecayOptimizer, self).__init__(False, name)
self.learning_rate = learning_rate
self.weight_decay_rate = weight_decay_rate
self.beta_1 = beta_1
self.beta_2 = beta_2
self.epsilon = epsilon
self.exclude_from_weight_decay = exclude_from_weight_decay
# pylint: disable=too-many-locals
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""See base class."""
# pylint: disable=redefined-argument-from-local
with tf.name_scope(name, self._name) as name:
assignments = []
for (grad, param) in grads_and_vars:
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name + "/adam_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
v = tf.get_variable(
name=param_name + "/adam_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
initializer=tf.zeros_initializer())
# Standard Adam update.
next_m = (tf.multiply(self.beta_1, m)\
+ tf.multiply(1.0 - self.beta_1,
grad))
next_v = (tf.multiply(self.beta_2, v)\
+ tf.multiply(1.0 - self.beta_2,
tf.square(grad)))
update = next_m / (tf.sqrt(next_v) + self.epsilon)
# Just adding the square of the weights to the loss function is
# *not* the correct way of using L2 regularization/weight decay
# with Adam, since that will interact with the m and v
# parameters in strange ways.
# Instead we want ot decay the weights in a manner that doesn't
# interact with the m/v parameters.
# This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
update_with_lr = self.learning_rate * update
next_param = param - update_with_lr
assignments.extend(
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
update_ops = assignments
if global_step is None:
apply_updates = self._finish(update_ops, name)
else:
with tf.control_dependencies([self._finish(update_ops,
"update")]):
with tf.colocate_with(global_step):
apply_updates = tf.assign_add(global_step, 1, name=name)
return apply_updates
def _do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
if self.exclude_from_weight_decay:
for r in self.exclude_from_weight_decay:
if re.search(r, param_name) is not None:
return False
return True
def _get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
================================================
FILE: texar_repo/texar/core/optimization_test.py
================================================
#
"""
Unit tests for various optimization related utilities.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
import texar.core.optimization as opt
from texar.utils import utils
class OptimizationTest(tf.test.TestCase):
"""Tests optimization.
"""
def test_get_optimizer(self):
"""Tests get_optimizer.
"""
default_optimizer_fn, optimizer_class = opt.get_optimizer_fn(
opt.default_optimization_hparams()["optimizer"])
default_optimizer = default_optimizer_fn(1.0)
self.assertTrue(optimizer_class, tf.train.Optimizer)
self.assertIsInstance(default_optimizer, tf.train.AdamOptimizer)
hparams = {
"type": "MomentumOptimizer",
"kwargs": {
"learning_rate": 0.001,
"momentum": 0.9,
"use_nesterov": True
}
}
momentum_optimizer_fn, _ = opt.get_optimizer_fn(hparams)
momentum_optimizer = momentum_optimizer_fn()
self.assertIsInstance(momentum_optimizer, tf.train.MomentumOptimizer)
hparams = {
"type": tf.train.MomentumOptimizer,
"kwargs": {
"momentum": 0.9,
"use_nesterov": True
}
}
momentum_optimizer_fn, _ = opt.get_optimizer_fn(hparams)
momentum_optimizer = momentum_optimizer_fn(0.001)
self.assertIsInstance(momentum_optimizer, tf.train.MomentumOptimizer)
hparams = {
"type": tf.train.MomentumOptimizer(0.001, 0.9)
}
momentum_optimizer, _ = opt.get_optimizer_fn(hparams)
self.assertIsInstance(momentum_optimizer, tf.train.MomentumOptimizer)
def test_get_learning_rate_decay_fn(self): # pylint: disable=too-many-locals
"""Tests get_learning_rate_decay_fn.
"""
default_lr_decay_fn = opt.get_learning_rate_decay_fn(
opt.default_optimization_hparams()["learning_rate_decay"])
self.assertIsNone(default_lr_decay_fn)
boundaries = [2, 4]
values = [0.1, 0.01, 0.001]
hparams = {
"type": "piecewise_constant",
"kwargs": {
"boundaries": boundaries,
"values": values
},
"min_learning_rate": 0.05,
"start_decay_step": 1,
"end_decay_step": utils.MAX_SEQ_LENGTH,
}
pc_lr_decay_fn = opt.get_learning_rate_decay_fn(hparams)
global_step = 1
pc_lr = pc_lr_decay_fn(learning_rate=1., global_step=global_step)
pc_lr_true = tf.train.piecewise_constant(
global_step-hparams["start_decay_step"], boundaries, values)
hparams["type"] = "natural_exp_decay"
hparams["kwargs"] = {
"decay_steps": 1,
"decay_rate": 0.5
}
ned_lr_decay_fn = opt.get_learning_rate_decay_fn(hparams)
ned_lr = ned_lr_decay_fn(learning_rate=1., global_step=global_step)
ned_lr_true = tf.train.natural_exp_decay(
1., global_step-hparams["start_decay_step"],
hparams["kwargs"]["decay_steps"], hparams["kwargs"]["decay_rate"])
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
pc_lr_, pc_lr_true_, ned_lr_, ned_lr_true_ = sess.run(
[pc_lr, pc_lr_true, ned_lr, ned_lr_true])
self.assertEqual(pc_lr_, pc_lr_true_)
self.assertEqual(ned_lr_, ned_lr_true_)
def test_get_gradient_clip_fn(self): # pylint: disable=too-many-locals
"""Tests get_gradient_clip_fn.
"""
default_grad_clip_fn = opt.get_gradient_clip_fn(
opt.default_optimization_hparams()["gradient_clip"])
self.assertIsNone(default_grad_clip_fn)
grads = [tf.random_uniform([10, 10], -1., 1.) for _ in range(5)]
grads_and_vars = list(zip(grads, range(5)))
hparams = {
"type": "clip_by_global_norm",
"kwargs": {
"clip_norm": 0.1
}
}
gn_grad_clip_fn = opt.get_gradient_clip_fn(hparams)
gn_grads_and_vars = gn_grad_clip_fn(grads_and_vars)
gn_grads, _ = zip(*gn_grads_and_vars)
gn_grads_true, _ = tf.clip_by_global_norm(
grads, hparams["kwargs"]["clip_norm"])
hparams = {
"type": "clip_by_value",
"kwargs": {
"clip_value_min": -0.01,
"clip_value_max": 0.01
}
}
v_grad_clip_fn = opt.get_gradient_clip_fn(hparams)
v_grads_and_vars = v_grad_clip_fn(grads_and_vars)
v_grads, _ = zip(*v_grads_and_vars)
v_grads_true = tf.clip_by_value(grads,
hparams["kwargs"]["clip_value_min"],
hparams["kwargs"]["clip_value_max"])
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
gn_grads_, gn_grads_true_, v_grads_, v_grads_true_ = sess.run(
[gn_grads, gn_grads_true, v_grads, v_grads_true])
np.testing.assert_array_equal(gn_grads_, gn_grads_true_)
np.testing.assert_array_equal(v_grads_, v_grads_true_)
def test_get_train_op(self):
"""Tests get_train_op.
"""
var = tf.Variable(0.)
loss = tf.nn.l2_loss(var)
train_op = opt.get_train_op(loss)
self.assertTrue(tf.contrib.framework.is_tensor(train_op))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/core/replay_memories.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Classes and utilities for replay memory in RL.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from collections import deque
import random
from texar.hyperparams import HParams
__all__ = [
"ReplayMemoryBase",
"DequeReplayMemory"
]
class ReplayMemoryBase(object):
"""Base class of replay memory inheritted by all replay memory classes.
Args:
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparameters are set to default values. See
:meth:`default_hparams` for the defaults.
"""
def __init__(self, hparams=None):
self._hparams = HParams(hparams, self.default_hparams())
@staticmethod
def default_hparams():
"""Returns a `dict` of hyperparameters and their default values.
.. code-block:: python
{
'name': 'replay_memory'
}
"""
return {
'name': 'replay_memory'
}
def add(self, element):
"""Inserts a memory entry
"""
raise NotImplementedError
def get(self, size):
"""Pops a memory entry.
"""
raise NotImplementedError
def last(self):
"""Returns the latest element in the memeory.
"""
raise NotImplementedError
def size(self):
"""Returns the current size of the memory.
"""
raise NotImplementedError
class DequeReplayMemory(ReplayMemoryBase):
"""A deque based replay memory that accepts new memory entry and deletes
oldest memory entry if exceeding the capacity. Memory entries are
accessed in random order.
Args:
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparameters are set to default values. See
:meth:`default_hparams` for the defaults.
"""
def __init__(self, hparams=None):
ReplayMemoryBase.__init__(self, hparams)
self.deque = deque()
self.capacity = self._hparams.capacity
@staticmethod
def default_hparams():
"""Returns a `dict` of hyperparameters and their default values.
.. code-block:: python
{
'capacity': 80000,
'name': 'deque_replay_memory',
}
Here:
"capacity" : int
Maximum size of memory kept. Deletes oldest memories if exceeds
the capacity.
"""
return {
'name': 'deque_replay_memory',
'capacity': 80000
}
def add(self, element):
"""Appends element to the memory and deletes old memory if exceeds
the capacity.
"""
self.deque.append(element)
if len(self.deque) > self.capacity:
self.deque.popleft()
#TODO(zhiting): is it okay to have stand alone random generator ?
def get(self, size):
"""Randomly samples :attr:`size` entries from the memory. Returns
a list.
"""
return random.sample(self.deque, size)
def last(self):
"""Returns the latest element in the memeory.
"""
return self.deque[-1]
def size(self):
"""Returns the current size of the memory.
"""
return len(self.deque)
================================================
FILE: texar_repo/texar/data/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.data.data_utils import *
from texar.data.data import *
from texar.data.data_decoders import *
from texar.data.vocabulary import *
from texar.data.embedding import *
================================================
FILE: texar_repo/texar/data/data/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library data inputs.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.data.data.data_base import *
from texar.data.data.scalar_data import *
from texar.data.data.text_data_base import *
from texar.data.data.mono_text_data import *
from texar.data.data.paired_text_data import *
from texar.data.data.multi_aligned_data import *
from texar.data.data.data_iterators import *
from texar.data.data.dataset_utils import *
================================================
FILE: texar_repo/texar/data/data/data_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base data class that is enherited by all data classes.
A data defines data reading, parsing, batching, and other
preprocessing operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.hyperparams import HParams
from texar.data.data import dataset_utils as dsutils
from texar.data.data_utils import count_file_lines
__all__ = [
"DataBase"
]
class DataBase(object):
"""Base class inheritted by all data classes.
"""
def __init__(self, hparams):
self._hparams = HParams(hparams, self.default_hparams())
@staticmethod
def default_hparams():
"""Returns a dictionary of default hyperparameters.
.. code-block:: python
{
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None,
"name": "data",
}
Here:
"num_epochs" : int
Number of times the dataset should be repeated. An
:tf_main:`OutOfRangeError ` signal will
be raised after the whole repeated dataset has been iterated
through.
E.g., For training data, set it to 1 (default) so that you
will get the signal after each epoch of training. Set to -1
to repeat the dataset indefinitely.
"batch_size" : int
Batch size, i.e., the number of consecutive elements of the
dataset to combine in a single batch.
"allow_smaller_final_batch" : bool
Whether to allow the final batch to be smaller if there are
insufficient elements left. If `False`, the final batch is
discarded if it is smaller than batch size. Note that,
if `True`, `output_shapes` of the resulting dataset
will have a a **static** batch_size dimension equal to
"batch_size".
"shuffle" : bool
Whether to randomly shuffle the elements of the dataset.
"shuffle_buffer_size" : int
The buffer size for data shuffling. The larger, the better
the resulting data is mixed.
If `None` (default), buffer size is set to the size of the
whole dataset (i.e., make the shuffling the maximally
effective).
"shard_and_shuffle" : bool
Whether to first shard the dataset and then shuffle each
block respectively. Useful when the whole data is too large to
be loaded efficiently into the memory.
If `True`, :attr:`shuffle_buffer_size` must be specified to
determine the size of each shard.
"num_parallel_calls" : int
Number of elements from the datasets to process in parallel.
"prefetch_buffer_size" : int
The maximum number of elements that will be buffered when
prefetching.
max_dataset_size : int
Maximum number of instances to include in
the dataset. If set to `-1` or greater than the size of
dataset, all instances will be included. This constraint is
imposed after data shuffling and filtering.
seed : int, optional
The random seed for shuffle.
Note that if a seed is set, the shuffle order will be exact
the same every time when going through the (repeated) dataset.
For example, consider a dataset with elements [1, 2, 3], with
"num_epochs"`=2` and some fixed seed, the resulting sequence
can be: 2 1 3, 1 3 2 | 2 1 3, 1 3 2, ... That is, the orders are
different **within** every `num_epochs`, but are the same
**across** the `num_epochs`.
name : str
Name of the data.
"""
return {
"name": "data",
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None
}
@staticmethod
def _make_batch(dataset, hparams, padded_batch=False, padding_values=None):
dataset = dataset.repeat(hparams.num_epochs)
batch_size = hparams["batch_size"]
if hparams["allow_smaller_final_batch"]:
if padded_batch:
dataset = dataset.padded_batch(
batch_size, dataset.output_shapes,
padding_values=padding_values)
else:
dataset = dataset.batch(batch_size)
else:
dataset = dataset.apply(
tf.contrib.data.padded_batch_and_drop_remainder(
batch_size, dataset.output_shapes,
padding_values=padding_values))
return dataset
@staticmethod
def _shuffle_dataset(dataset, hparams, dataset_files):
dataset_size = None
shuffle_buffer_size = hparams["shuffle_buffer_size"]
if hparams["shard_and_shuffle"]:
if shuffle_buffer_size is None:
raise ValueError(
"Dataset hyperparameter 'shuffle_buffer_size' "
"must not be `None` if 'shard_and_shuffle'=`True`.")
dataset_size = count_file_lines(dataset_files)
if shuffle_buffer_size >= dataset_size:
raise ValueError(
"Dataset size (%d) <= shuffle_buffer_size (%d). Set "
"shuffle_and_shard to `False`." %
(dataset_size, shuffle_buffer_size))
#TODO(zhiting): Use a different seed?
dataset = dataset.apply(dsutils.random_shard_dataset(
dataset_size, shuffle_buffer_size, hparams["seed"]))
dataset = dataset.shuffle(shuffle_buffer_size + 16, # add a margin
seed=hparams["seed"])
elif hparams["shuffle"]:
if shuffle_buffer_size is None:
dataset_size = count_file_lines(dataset_files)
shuffle_buffer_size = dataset_size
dataset = dataset.shuffle(shuffle_buffer_size, seed=hparams["seed"])
return dataset, dataset_size
@property
def num_epochs(self):
"""Number of epochs.
"""
return self._hparams.num_epochs
@property
def batch_size(self):
"""The batch size.
"""
return self._hparams.batch_size
@property
def hparams(self):
"""A :class:`~texar.HParams` instance of the
data hyperparameters.
"""
return self._hparams
@property
def name(self):
"""Name of the module.
"""
return self._hparams.name
================================================
FILE: texar_repo/texar/data/data/data_iterators.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various data iterator classes.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
import texar as tx
from texar.utils.variables import get_unique_named_variable_scope
__all__ = [
"DataIteratorBase",
"DataIterator",
"TrainTestDataIterator",
"FeedableDataIterator",
"TrainTestFeedableDataIterator"
]
class DataIteratorBase(object):
"""Base class for all data iterator classes to inherit. A data iterator
is a wrapper of :tf_main:`tf.data.Iterator `, and can
switch between and iterate through **multiple** datasets.
Args:
datasets: Datasets to iterates through. This can be:
- A single instance of :tf_main:`tf.data.Dataset ` \
or instance of subclass of :class:`~texar.data.DataBase`.
- A `dict` that maps dataset name to \
instance of :tf_main:`tf.data.Dataset ` or \
subclass of :class:`~texar.data.DataBase`.
- A `list` of instances of subclasses of \
:class:`texar.data.DataBase`. The name of instances \
(:attr:`texar.data.DataBase.name`) must be unique.
"""
def __init__(self, datasets):
self._default_dataset_name = 'data'
if isinstance(datasets, (tf.data.Dataset, tx.data.DataBase)):
datasets = {self._default_dataset_name: datasets}
elif isinstance(datasets, (list, tuple)):
if any(not isinstance(d, tx.data.DataBase) for d in datasets):
raise ValueError("`datasets` must be an non-empty list of "
"`texar.data.DataBase` instances.")
num_datasets = len(datasets)
datasets = {d.name: d for d in datasets}
if len(datasets) < num_datasets:
raise ValueError("Names of datasets must be unique.")
_datasets = {}
for k, v in datasets.items(): # pylint: disable=invalid-name
_datasets[k] = v if isinstance(v, tf.data.Dataset) else v.dataset
self._datasets = _datasets
if len(self._datasets) <= 0:
raise ValueError("`datasets` must not be empty.")
@property
def num_datasets(self):
"""Number of datasets.
"""
return len(self._datasets)
@property
def dataset_names(self):
"""A list of dataset names.
"""
return list(self._datasets.keys())
class DataIterator(DataIteratorBase):
"""Data iterator that switches and iterates through multiple datasets.
This is a wrapper of TF reinitializble :tf_main:`iterator `.
Args:
datasets: Datasets to iterates through. This can be:
- A single instance of :tf_main:`tf.data.Dataset ` \
or instance of subclass of :class:`~texar.data.DataBase`.
- A `dict` that maps dataset name to \
instance of :tf_main:`tf.data.Dataset ` or \
subclass of :class:`~texar.data.DataBase`.
- A `list` of instances of subclasses of \
:class:`texar.data.DataBase`. The name of instances \
(:attr:`texar.data.DataBase.name`) must be unique.
Example:
.. code-block:: python
train_data = MonoTextData(hparams_train)
test_data = MonoTextData(hparams_test)
iterator = DataIterator({'train': train_data, 'test': test_data})
batch = iterator.get_next()
sess = tf.Session()
for _ in range(200): # Run 200 epochs of train/test
# Starts iterating through training data from the beginning
iterator.switch_to_dataset(sess, 'train')
while True:
try:
train_batch_ = sess.run(batch)
except tf.errors.OutOfRangeError:
print("End of training epoch.")
# Starts iterating through test data from the beginning
iterator.switch_to_dataset(sess, 'test')
while True:
try:
test_batch_ = sess.run(batch)
except tf.errors.OutOfRangeError:
print("End of test epoch.")
"""
def __init__(self, datasets):
DataIteratorBase.__init__(self, datasets)
self._variable_scope = get_unique_named_variable_scope('data_iterator')
with tf.variable_scope(self._variable_scope):
first_dataset = self._datasets[sorted(self.dataset_names)[0]]
self._iterator = tf.data.Iterator.from_structure(
first_dataset.output_types, first_dataset.output_shapes)
self._iterator_init_ops = {
name: self._iterator.make_initializer(d)
for name, d in self._datasets.items()
}
def switch_to_dataset(self, sess, dataset_name=None):
"""Re-initializes the iterator of a given dataset and starts iterating
over the dataset (from the beginning).
Args:
sess: The current tf session.
dataset_name (optional): Name of the dataset. If not provided,
there must be only one Dataset.
"""
if dataset_name is None:
if self.num_datasets > 1:
raise ValueError("`dataset_name` is required if there are "
"more than one datasets.")
dataset_name = next(iter(self._datasets))
if dataset_name not in self._datasets:
raise ValueError("Dataset not found: ", dataset_name)
sess.run(self._iterator_init_ops[dataset_name])
def get_next(self):
"""Returns the next element of the activated dataset.
"""
return self._iterator.get_next()
class TrainTestDataIterator(DataIterator):
"""Data iterator that alternatives between train, val, and test datasets.
:attr:`train`, :attr:`val`, and :attr:`test` can be instance of
either :tf_main:`tf.data.Dataset ` or subclass of
:class:`~texar.data.DataBase`. At least one of them must be provided.
This is a wrapper of :class:`~texar.data.DataIterator`.
Args:
train (optional): Training data.
val (optional): Validation data.
test (optional): Test data.
Example:
.. code-block:: python
train_data = MonoTextData(hparams_train)
val_data = MonoTextData(hparams_val)
iterator = TrainTestDataIterator(train=train_data, val=val_data)
batch = iterator.get_next()
sess = tf.Session()
for _ in range(200): # Run 200 epochs of train/val
# Starts iterating through training data from the beginning
iterator.switch_to_train_data(sess)
while True:
try:
train_batch_ = sess.run(batch)
except tf.errors.OutOfRangeError:
print("End of training epoch.")
# Starts iterating through val data from the beginning
iterator.switch_to_val_dataset(sess)
while True:
try:
val_batch_ = sess.run(batch)
except tf.errors.OutOfRangeError:
print("End of val epoch.")
"""
def __init__(self, train=None, val=None, test=None):
dataset_dict = {}
self._train_name = 'train'
self._val_name = 'val'
self._test_name = 'test'
if train is not None:
dataset_dict[self._train_name] = train
if val is not None:
dataset_dict[self._val_name] = val
if test is not None:
dataset_dict[self._test_name] = test
if len(dataset_dict) == 0:
raise ValueError("At least one of `train`, `val`, and `test` "
"must be provided.")
DataIterator.__init__(self, dataset_dict)
def switch_to_train_data(self, sess):
"""Starts to iterate through training data (from the beginning).
Args:
sess: The current tf session.
"""
if self._train_name not in self._datasets:
raise ValueError("Training data not provided.")
self.switch_to_dataset(sess, self._train_name)
def switch_to_val_data(self, sess):
"""Starts to iterate through val data (from the beginning).
Args:
sess: The current tf session.
"""
if self._val_name not in self._datasets:
raise ValueError("Val data not provided.")
self.switch_to_dataset(sess, self._val_name)
def switch_to_test_data(self, sess):
"""Starts to iterate through test data (from the beginning).
Args:
sess: The current tf session.
"""
if self._test_name not in self._datasets:
raise ValueError("Test data not provided.")
self.switch_to_dataset(sess, self._test_name)
class FeedableDataIterator(DataIteratorBase):
"""Data iterator that iterates through **multiple** datasets and switches
between datasets.
The iterator can switch to a dataset and resume from where we
left off last time we visited the dataset. This is a wrapper of TF
feedable :tf_main:`iterator `.
Args:
datasets: Datasets to iterates through. This can be:
- A single instance of :tf_main:`tf.data.Dataset ` \
or instance of subclass of :class:`~texar.data.DataBase`.
- A `dict` that maps dataset name to \
instance of :tf_main:`tf.data.Dataset ` or \
subclass of :class:`~texar.data.DataBase`.
- A `list` of instances of subclasses of \
:class:`texar.data.DataBase`. The name of instances \
(:attr:`texar.data.DataBase.name`) must be unique.
Example:
.. code-block:: python
train_data = MonoTextData(hparams={'num_epochs': 200, ...})
test_data = MonoTextData(hparams_test)
iterator = FeedableDataIterator({'train': train_data,
'test': test_data})
batch = iterator.get_next()
sess = tf.Session()
def _eval_epoch(): # Iterate through test data for one epoch
# Initialize and start from beginning of test data
iterator.initialize_dataset(sess, 'test')
while True:
try:
fetch_dict = { # Read from test data
iterator.handle: Iterator.get_handle(sess, 'test')
}
test_batch_ = sess.run(batch, feed_dict=feed_dict)
except tf.errors.OutOfRangeError:
print("End of val epoch.")
# Initialize and start from beginning of training data
iterator.initialize_dataset(sess, 'train')
step = 0
while True:
try:
fetch_dict = { # Read from training data
iterator.handle: Iterator.get_handle(sess, 'train')
}
train_batch_ = sess.run(batch, fetch_dict=fetch_dict)
step +=1
if step % 200 == 0: # Evaluate periodically
_eval_epoch()
except tf.errors.OutOfRangeError:
print("End of training.")
"""
def __init__(self, datasets):
DataIteratorBase.__init__(self, datasets)
self._variable_scope = get_unique_named_variable_scope(
'feedable_data_iterator')
with tf.variable_scope(self._variable_scope):
self._handle = tf.placeholder(tf.string, shape=[], name='handle')
first_dataset = self._datasets[sorted(self.dataset_names)[0]]
self._iterator = tf.data.Iterator.from_string_handle(
self._handle, first_dataset.output_types,
first_dataset.output_shapes)
self._dataset_iterators = {
name: dataset.make_initializable_iterator()
for name, dataset in self._datasets.items()
}
def get_handle(self, sess, dataset_name=None):
"""Returns a dataset handle used to feed the
:attr:`handle` placeholder to fetch data from the dataset.
Args:
sess: The current tf session.
dataset_name (optional): Name of the dataset. If not provided,
there must be only one Dataset.
Returns:
A string handle to be fed to the :attr:`handle` placeholder.
Example:
.. code-block:: python
next_element = iterator.get_next()
train_handle = iterator.get_handle(sess, 'train')
# Gets the next training element
ne_ = sess.run(next_element,
feed_dict={iterator.handle: train_handle})
"""
if dataset_name is None:
if self.num_datasets > 1:
raise ValueError("`dataset_name` is required if there are "
"more than one datasets.")
dataset_name = next(iter(self._datasets))
if dataset_name not in self._datasets:
raise ValueError("Dataset not found: ", dataset_name)
return sess.run(self._dataset_iterators[dataset_name].string_handle())
def restart_dataset(self, sess, dataset_name=None):
"""Restarts datasets so that next iteration will fetch data from
the beginning of the datasets.
Args:
sess: The current tf session.
dataset_name (optional): A dataset name or a list of dataset names
that specifies which dataset(s) to restart. If `None`, all
datasets are restart.
"""
self.initialize_dataset(sess, dataset_name)
def initialize_dataset(self, sess, dataset_name=None):
"""Initializes datasets. A dataset must be initialized before being
used.
Args:
sess: The current tf session.
dataset_name (optional): A dataset name or a list of dataset names
that specifies which dataset(s) to initialize. If `None`, all
datasets are initialized.
"""
if dataset_name is None:
dataset_name = self.dataset_names
if not isinstance(dataset_name, (tuple, list)):
dataset_name = [dataset_name]
for name in dataset_name:
sess.run(self._dataset_iterators[name].initializer)
def get_next(self):
"""Returns the next element of the activated dataset.
"""
return self._iterator.get_next()
@property
def handle(self):
"""The handle placeholder that can be fed with a dataset handle to
fetch data from the dataset.
"""
return self._handle
class TrainTestFeedableDataIterator(FeedableDataIterator):
"""Feedable data iterator that alternatives between train, val, and test
datasets.
This is a wrapper of :class:`~texar.data.FeedableDataIterator`.
The iterator can switch to a dataset and resume from where it was
left off when it was visited last time.
:attr:`train`, :attr:`val`, and :attr:`test` can be instance of
either :tf_main:`tf.data.Dataset ` or subclass of
:class:`~texar.data.DataBase`. At least one of them must be provided.
Args:
train (optional): Training data.
val (optional): Validation data.
test (optional): Test data.
Example:
.. code-block:: python
train_data = MonoTextData(hparams={'num_epochs': 200, ...})
test_data = MonoTextData(hparams_test)
iterator = TrainTestFeedableDataIterator(train=train_data,
test=test_data)
batch = iterator.get_next()
sess = tf.Session()
def _eval_epoch(): # Iterate through test data for one epoch
# Initialize and start from beginning of test data
iterator.initialize_test_dataset(sess)
while True:
try:
fetch_dict = { # Read from test data
iterator.handle: Iterator.get_test_handle(sess)
}
test_batch_ = sess.run(batch, feed_dict=feed_dict)
except tf.errors.OutOfRangeError:
print("End of test epoch.")
# Initialize and start from beginning of training data
iterator.initialize_train_dataset(sess)
step = 0
while True:
try:
fetch_dict = { # Read from training data
iterator.handle: Iterator.get_train_handle(sess)
}
train_batch_ = sess.run(batch, fetch_dict=fetch_dict)
step +=1
if step % 200 == 0: # Evaluate periodically
_eval_epoch()
except tf.errors.OutOfRangeError:
print("End of training.")
"""
def __init__(self, train=None, val=None, test=None):
dataset_dict = {}
self._train_name = 'train'
self._val_name = 'val'
self._test_name = 'test'
if train is not None:
dataset_dict[self._train_name] = train
if val is not None:
dataset_dict[self._val_name] = val
if test is not None:
dataset_dict[self._test_name] = test
if len(dataset_dict) == 0:
raise ValueError("At least one of `train`, `val`, and `test` "
"must be provided.")
FeedableDataIterator.__init__(self, dataset_dict)
def get_train_handle(self, sess):
"""Returns the handle of the training dataset. The handle can be used
to feed the :attr:`handle` placeholder to fetch training data.
Args:
sess: The current tf session.
Returns:
A string handle to be fed to the :attr:`handle` placeholder.
Example:
.. code-block:: python
next_element = iterator.get_next()
train_handle = iterator.get_train_handle(sess)
# Gets the next training element
ne_ = sess.run(next_element,
feed_dict={iterator.handle: train_handle})
"""
if self._train_name not in self._datasets:
raise ValueError("Training data not provided.")
return self.get_handle(sess, self._train_name)
def get_val_handle(self, sess):
"""Returns the handle of the validation dataset. The handle can be used
to feed the :attr:`handle` placeholder to fetch validation data.
Args:
sess: The current tf session.
Returns:
A string handle to be fed to the :attr:`handle` placeholder.
"""
if self._val_name not in self._datasets:
raise ValueError("Val data not provided.")
return self.get_handle(sess, self._val_name)
def get_test_handle(self, sess):
"""Returns the handle of the test dataset. The handle can be used
to feed the :attr:`handle` placeholder to fetch test data.
Args:
sess: The current tf session.
Returns:
A string handle to be fed to the :attr:`handle` placeholder.
"""
if self._test_name not in self._datasets:
raise ValueError("Test data not provided.")
return self.get_handle(sess, self._test_name)
def restart_train_dataset(self, sess):
"""Restarts the training dataset so that next iteration will fetch
data from the beginning of the training dataset.
Args:
sess: The current tf session.
"""
if self._train_name not in self._datasets:
raise ValueError("Training data not provided.")
self.restart_dataset(sess, self._train_name)
def restart_val_dataset(self, sess):
"""Restarts the validation dataset so that next iteration will fetch
data from the beginning of the validation dataset.
Args:
sess: The current tf session.
"""
if self._val_name not in self._datasets:
raise ValueError("Val data not provided.")
self.restart_dataset(sess, self._val_name)
def restart_test_dataset(self, sess):
"""Restarts the test dataset so that next iteration will fetch
data from the beginning of the test dataset.
Args:
sess: The current tf session.
"""
if self._test_name not in self._datasets:
raise ValueError("Test data not provided.")
self.restart_dataset(sess, self._test_name)
================================================
FILE: texar_repo/texar/data/data/data_iterators_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data iterator related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=no-member, invalid-name
import tempfile
import numpy as np
import tensorflow as tf
import texar as tx
class DataIteratorTest(tf.test.TestCase):
"""Tests data iterators.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create data
train_text = list(np.linspace(1, 1000, num=1000, dtype=np.int64))
train_text = [str(x) for x in train_text]
train_text_file = tempfile.NamedTemporaryFile()
train_text_file.write('\n'.join(train_text).encode("utf-8"))
train_text_file.flush()
self._train_text_file = train_text_file
test_text = list(np.linspace(1001, 2000, num=1000, dtype=np.int64))
test_text = [str(x) for x in test_text]
test_text_file = tempfile.NamedTemporaryFile()
test_text_file.write('\n'.join(test_text).encode("utf-8"))
test_text_file.flush()
self._test_text_file = test_text_file
vocab_list = train_text + test_text
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
self._vocab_file = vocab_file
self._vocab_size = len(vocab_list)
self._train_hparams = {
"num_epochs": 2,
"batch_size": 1,
"shuffle": False,
"dataset": {
"files": self._train_text_file.name,
"vocab_file": self._vocab_file.name,
"bos_token": '',
"eos_token": ''
},
"name": "train"
}
self._test_hparams = {
"num_epochs": 1,
"batch_size": 1,
"shuffle": False,
"dataset": {
"files": self._test_text_file.name,
"vocab_file": self._vocab_file.name,
"bos_token": '',
"eos_token": ''
},
"name": "test"
}
def test_iterator_single_dataset(self):
"""Tests iterating over a single dataset.
"""
data = tx.data.MonoTextData(self._test_hparams)
iterator = tx.data.DataIterator(data)
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for _ in range(2):
iterator.switch_to_dataset(sess)
i = 1001
while True:
try:
data_batch_ = sess.run(data_batch)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i))
i += 1
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
self.assertEqual(i, 2001)
break
def test_iterator_multi_datasets(self):
"""Tests iterating over multiple datasets.
"""
train_data = tx.data.MonoTextData(self._train_hparams)
test_data = tx.data.MonoTextData(self._test_hparams)
iterator = tx.data.DataIterator([train_data, test_data])
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for _ in range(2):
# Iterates over train data
iterator.switch_to_dataset(sess, train_data.name)
i = 0
while True:
try:
data_batch_ = sess.run(data_batch)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i+1))
i = (i+1) % 1000
except tf.errors.OutOfRangeError:
print('Train data limit reached')
self.assertEqual(i, 0)
break
# Iterates over test data
iterator.switch_to_dataset(sess, test_data.name)
i = 1001
while True:
try:
data_batch_ = sess.run(data_batch)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i))
i += 1
except tf.errors.OutOfRangeError:
print('Test data limit reached')
self.assertEqual(i, 2001)
break
def test_train_test_data_iterator(self):
"""Tests :class:`texar.data.TrainTestDataIterator`
"""
train_data = tx.data.MonoTextData(self._train_hparams)
test_data = tx.data.MonoTextData(self._test_hparams)
iterator = tx.data.TrainTestDataIterator(train=train_data,
test=test_data)
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for _ in range(2):
iterator.switch_to_train_data(sess)
i = 0
while True:
try:
data_batch_ = sess.run(data_batch)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i+1))
i = (i+1) % 1000
except tf.errors.OutOfRangeError:
print('Train data limit reached')
self.assertEqual(i, 0)
break
iterator.switch_to_test_data(sess)
i = 1001
while True:
try:
data_batch_ = sess.run(data_batch)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i))
i += 1
except tf.errors.OutOfRangeError:
print('Test data limit reached')
self.assertEqual(i, 2001)
break
def test_feedable_iterator_multi_datasets(self):
"""Tests iterating over multiple datasets with the
:class:`FeedableDataIterator`.
"""
train_data = tx.data.MonoTextData(self._train_hparams)
test_data = tx.data.MonoTextData(self._test_hparams)
iterator = tx.data.FeedableDataIterator([train_data, test_data])
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
iterator.initialize_dataset(sess)
for _ in range(2):
# Iterates over train data
iterator.restart_dataset(sess, train_data.name)
data_handle = iterator.get_handle(sess, train_data.name)
i = 0
while True:
try:
feed_dict = {iterator.handle: data_handle}
data_batch_ = sess.run(data_batch, feed_dict=feed_dict)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i+1))
i = (i+1) % 1000
except tf.errors.OutOfRangeError:
print('Train data limit reached')
self.assertEqual(i, 0)
break
# Iterates over test data
iterator.restart_dataset(sess, test_data.name)
data_handle = iterator.get_handle(sess, test_data.name)
i = 1001
while True:
try:
feed_dict = {iterator.handle: data_handle}
data_batch_ = sess.run(data_batch, feed_dict=feed_dict)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i))
i += 1
except tf.errors.OutOfRangeError:
print('Test data limit reached')
self.assertEqual(i, 2001)
break
def test_train_test_feedable_data_iterator(self):
"""Tests :class:`texar.data.TrainTestFeedableDataIterator`
"""
train_data = tx.data.MonoTextData(self._train_hparams)
test_data = tx.data.MonoTextData(self._test_hparams)
iterator = tx.data.TrainTestFeedableDataIterator(train=train_data,
test=test_data)
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
for _ in range(2):
iterator.restart_train_dataset(sess)
i = 0
while True:
try:
feed_dict = {
iterator.handle: iterator.get_train_handle(sess)
}
data_batch_ = sess.run(data_batch, feed_dict=feed_dict)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i+1))
i = (i+1) % 1000
except tf.errors.OutOfRangeError:
print('Train data limit reached')
self.assertEqual(i, 0)
break
iterator.restart_test_dataset(sess)
i = 1001
while True:
try:
feed_dict = {
iterator.handle: iterator.get_test_handle(sess)
}
data_batch_ = sess.run(data_batch, feed_dict=feed_dict)
self.assertEqual(
tf.compat.as_text(data_batch_['text'][0][0]),
str(i))
i += 1
except tf.errors.OutOfRangeError:
print('Test data limit reached')
self.assertEqual(i, 2001)
break
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/dataset_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various utilities specific to dataset processing.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import six
import tensorflow as tf
import numpy as np
from texar.utils import utils
# pylint: disable=invalid-name, too-many-arguments
__all__ = [
"_DataSpec",
"_connect_name",
"maybe_tuple",
"make_partial",
"make_chained_transformation",
"make_combined_transformation",
"random_shard_dataset",
]
class _DataSpec(object):
"""Dataset specification. Used to pass necessary info to
user-defined tranformation functions.
Args:
dataset: Instance of :tf_main:`tf.data.Dataset `.
dataset_size (int): Number of data samples.
decoder: A (list of) data decoder.
vocab: A (list of) :class:`~texar.data.Vocab` instance.
embeddidng: A (list of) :class:`~texar.data.Embedding` instance.
**kwargs: Any remaining dataset-specific fields.
"""
def __init__(self, dataset=None, dataset_size=None, decoder=None,
vocab=None, embedding=None, **kwargs):
kwargs['dataset'] = dataset
kwargs['dataset_size'] = dataset_size
kwargs['decoder'] = decoder
kwargs['vocab'] = vocab
kwargs['embedding'] = embedding
self.__dict__.update(kwargs)
def add_spec(self, **kwargs):
"""Adds new field(s).
"""
self.__dict__.update(kwargs)
def get_ith_data_spec(self, i):
"""Returns an instance of :class:`_DataSpec` that contains the
`i`-th specifications.
"""
kwargs = {}
for k, v in six.iteritems(self.__dict__):
kwargs[k] = v[i] if isinstance(v, (tuple, list)) else v
return _DataSpec(**kwargs)
def set_ith_data_spec(self, i, data_spec, total_count):
"""Sets the `i`-th specification to respective values in
:attr:`data_spec`.
"""
for k, v in six.iteritems(data_spec.__dict__):
if k in self.__dict__:
v_ = self.__dict__[k]
if isinstance(v_, (tuple, list)):
v_[i] = v
else:
new_v_ = [v_] * total_count
new_v_[i] = v
self.__dict__[k] = new_v_
else:
v_ = [None] * total_count
v_[i] = v
self.__dict__[k] = v_
def _make_length_filter_fn(length_name, max_length):
"""Returns a predicate function which takes in data sample
and returns a bool indicating whether to filter by length.
"""
def _filter_fn(data):
return data[length_name] <= max_length
return _filter_fn
def _make_smaller_batch_filter_fn(batch_size):
"""Returns a predicate function which takes in a batched data
and returns a bool indicating whether the batch is of :attr:`batch_size`.
"""
def _filter_fn(data):
if isinstance(data, (list, tuple)):
return _filter_fn(data[0])
elif isinstance(data, dict):
return _filter_fn(data[next(iter(data))])
else:
return tf.equal(tf.shape(data)[0], batch_size)
return _filter_fn
def _make_combined_filter_fn(filter_fns, mode="and"):
"""Returns a new predicate function that combines multiple
predicate functions with certain mode.
Returns `None` if all elements in :attr:`filter_fns` are `None`.
Args:
filter_fns (list): Filter functions to combine. `None` functions are
ignored.
mode (str): A mode from `{"and", "or"}`.
"""
if not any(filter_fns):
return None
def _combined_fn(data):
outputs = []
for fn in filter_fns:
if fn:
outputs.append(fn(data))
if mode == "and":
return tf.reduce_all(outputs)
elif mode == "or":
return tf.reduce_any(outputs)
else:
raise ValueError("Unknown mode: {}".format(mode))
return _combined_fn
def _connect_name(lhs_name, rhs_name):
if not lhs_name:
return rhs_name
if not rhs_name:
return lhs_name
return "{}_{}".format(lhs_name, rhs_name)
def maybe_tuple(data):
"""Returns `tuple(data)` if :attr:`data` contains more than 1 elements.
Used to wrap `map_func` inputs.
"""
data = tuple(data)
data = data if len(data) > 1 else data[0]
return data
def make_partial(fn, *args, **kwargs):
"""Returns a new function with single argument by freezing other arguments
of :attr:`fn`.
"""
def _new_fn(data):
return fn(data, *args, **kwargs)
return _new_fn
def name_prefix_fn(name_prefix):
"""Returns a function that append a prefix to field names.
"""
def _prefix_fn(data):
transformed_data = {}
for name, value in six.iteritems(data):
new_name = _connect_name(name_prefix, name)
transformed_data[new_name] = value
return transformed_data
return _prefix_fn
def make_chained_transformation(tran_fns, *args, **kwargs):
"""Returns a dataset transformation function that applies a list of
transformations sequentially.
Args:
tran_fns (list): A list of dataset transformation function.
*args: Extra arguments for each of the transformation function.
**kwargs: Extra keyword arguments for each of the transformation
function.
Returns:
A transformation function to be used in
:tf_main:`tf.data.Dataset.map `.
"""
def _chained_fn(data):
for tran_fns_i in tran_fns:
data = tran_fns_i(data, *args, **kwargs)
return data
return _chained_fn
def make_combined_transformation(tran_fns, name_prefix=None, *args, **kwargs):
"""Returns a dataset transformation function that applies
transformations to each component of the data.
The data to be transformed must be a tuple of the same length
of :attr:`tran_fns`.
Args:
tran_fns (list): A list of elements where each element is a
transformation function or a list of transformation functions.
name_prefix (list, optional): Prefix to the field names of each
component of the data, to prevent fields with the same name
in different components from overriding each other. If not `None`,
must be of the same length of :attr:`tran_fns`.
*args: Extra arguments for each of the transformation function.
**kwargs: Extra keyword arguments for each of the transformation
function.
Returns:
A transformation function to be used in
:tf_main:`tf.data.Dataset.map `.
"""
if name_prefix and len(name_prefix) != len(tran_fns):
raise ValueError("`name_prefix`, if provided, must be of the same "
"length of `tran_fns`.")
def _combined_fn(data):
transformed_data = {}
for i, tran_fns_i in enumerate(tran_fns):
data_i = data[i]
# Process data_i
if not isinstance(tran_fns_i, (list, tuple)):
tran_fns_i = [tran_fns_i]
for tran_fns_ij in tran_fns_i:
data_i = tran_fns_ij(data_i, *args, **kwargs)
# Add to dict by appending name prefix
for name, value in six.iteritems(data_i):
new_name = name
if name_prefix:
new_name = _connect_name(name_prefix[i], name)
if new_name in transformed_data:
raise ValueError(
"Field name already exists: {}".format(new_name))
transformed_data[new_name] = value
return transformed_data
return _combined_fn
def random_shard_dataset(dataset_size, shard_size, seed=None):
"""Returns a dataset transformation function that randomly shards a
dataset.
"""
num_shards = utils.ceildiv(dataset_size, shard_size)
boundaries = np.linspace(0, dataset_size, num=num_shards, endpoint=False,
dtype=np.int64) #pylint: disable=no-member
def _shard_fn(dataset):
sharded_dataset = (
tf.data.Dataset.from_tensor_slices(boundaries)
.shuffle(num_shards, seed=seed)
.flat_map(lambda lb: dataset.skip(lb).take(shard_size)))
return sharded_dataset
return _shard_fn
================================================
FILE: texar_repo/texar/data/data/dataset_utils_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data utils.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from texar.data.data import dataset_utils as dsutils
# pylint: disable=invalid-name
class TransformationTest(tf.test.TestCase):
"""Tests various transformation utilities.
"""
def test_make_chained_transformation(self):
"""Tests :func:`texar.data.make_chained_transformation`
"""
original_data = np.arange(0, 10)
dataset = tf.data.Dataset.from_tensor_slices(original_data)
def _tran_a(data):
return data + 100
def _tran_b(data):
return data + 1000
def _tran_c(data):
return data + 10000
chained_tran = dsutils.make_chained_transformation(
[_tran_a, _tran_b, _tran_c])
dataset = dataset.map(chained_tran)
iterator = dataset.make_one_shot_iterator()
elem = iterator.get_next()
with self.test_session() as sess:
data_ = []
while True:
try:
data_.append(sess.run(elem))
except tf.errors.OutOfRangeError:
break
self.assertEqual(len(data_), len(original_data))
data_ = [elem_ - 11100 for elem_ in data_]
self.assertEqual(data_, original_data.tolist())
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/mono_text_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Mono text data class that define data reading, parsing, batching, and other
preprocessing operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.utils import utils
from texar.utils.dtypes import is_callable
from texar.data.data_utils import count_file_lines
from texar.data.data import dataset_utils as dsutils
from texar.data.data.text_data_base import TextDataBase
from texar.data.data_decoders import TextDataDecoder, VarUttTextDataDecoder
from texar.data.vocabulary import Vocab, SpecialTokens
from texar.data.embedding import Embedding
# pylint: disable=invalid-name, arguments-differ, protected-access, no-member
__all__ = [
"_default_mono_text_dataset_hparams",
"MonoTextData"
]
class _LengthFilterMode(object): # pylint: disable=no-init, too-few-public-methods
"""Options of length filter mode.
"""
TRUNC = "truncate"
DISCARD = "discard"
def _default_mono_text_dataset_hparams():
"""Returns hyperparameters of a mono text dataset with default values.
See :meth:`texar.MonoTextData.default_hparams` for details.
"""
return {
"files": [],
"compression_type": None,
"vocab_file": "",
"embedding_init": Embedding.default_hparams(),
"delimiter": " ",
"max_seq_length": None,
"length_filter_mode": "truncate",
"pad_to_max_seq_length": False,
"bos_token": SpecialTokens.BOS,
"eos_token": SpecialTokens.EOS,
"other_transformations": [],
"variable_utterance": False,
"utterance_delimiter": "|||",
"max_utterance_cnt": 5,
"data_name": None,
"@no_typecheck": ["files"]
}
class MonoTextData(TextDataBase):
"""Text data processor that reads single set of text files. This can be
used for, e.g., language models, auto-encoders, etc.
Args:
hparams: A `dict` or instance of :class:`~texar.HParams` containing
hyperparameters. See :meth:`default_hparams` for the defaults.
By default, the processor reads raw data files, performs tokenization,
batching and other pre-processing steps, and results in a TF Dataset
whose element is a python `dict` including three fields:
- "text":
A string Tensor of shape `[batch_size, max_time]` containing
the **raw** text toknes. `max_time` is the length of the longest
sequence in the batch.
Short sequences in the batch are padded with **empty string**.
BOS and EOS tokens are added as per
:attr:`hparams`. Out-of-vocabulary tokens are **NOT** replaced
with UNK.
- "text_ids":
An `int64` Tensor of shape `[batch_size, max_time]`
containing the token indexes.
- "length":
An `int` Tensor of shape `[batch_size]` containing the
length of each sequence in the batch (including BOS and
EOS if added).
If :attr:`'variable_utterance'` is set to `True` in :attr:`hparams`, the
resulting dataset has elements with four fields:
- "text":
A string Tensor of shape
`[batch_size, max_utterance, max_time]`, where *max_utterance* is
either the maximum number of utterances in each elements of the
batch, or :attr:`max_utterance_cnt` as specified in :attr:`hparams`.
- "text_ids":
An `int64` Tensor of shape
`[batch_size, max_utterance, max_time]` containing the token
indexes.
- "length":
An `int` Tensor of shape `[batch_size, max_utterance]`
containing the length of each sequence in the batch.
- "utterance_cnt":
An `int` Tensor of shape `[batch_size]` containing
the number of utterances of each element in the batch.
The above field names can be accessed through :attr:`text_name`,
:attr:`text_id_name`, :attr:`length_name`, and
:attr:`utterance_cnt_name`, respectively.
Example:
.. code-block:: python
hparams={
'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' },
'batch_size': 1
}
data = MonoTextData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()
iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
# 'text': [['', 'example', 'sequence', '']],
# 'text_ids': [[1, 5, 10, 2]],
# 'length': [4]
# }
"""
def __init__(self, hparams):
TextDataBase.__init__(self, hparams)
with tf.name_scope(self.name, self.default_hparams()["name"]):
self._make_data()
@staticmethod
def default_hparams():
"""Returns a dicitionary of default hyperparameters:
.. code-block:: python
{
# (1) Hyperparams specific to text dataset
"dataset": {
"files": [],
"compression_type": None,
"vocab_file": "",
"embedding_init": {},
"delimiter": " ",
"max_seq_length": None,
"length_filter_mode": "truncate",
"pad_to_max_seq_length": False,
"bos_token": ""
"eos_token": ""
"other_transformations": [],
"variable_utterance": False,
"utterance_delimiter": "|||",
"max_utterance_cnt": 5,
"data_name": None,
}
# (2) General hyperparams
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None,
"name": "mono_text_data",
# (3) Bucketing
"bucket_boundaries": [],
"bucket_batch_sizes": None,
"bucket_length_fn": None,
}
Here:
1. For the hyperparameters in the :attr:`"dataset"` field:
"files" : str or list
A (list of) text file path(s).
Each line contains a single text sequence.
"compression_type" : str, optional
One of "" (no compression), "ZLIB", or "GZIP".
"vocab_file": str
Path to vocabulary file. Each line of the file should contain
one vocabulary token.
Used to create an instance of :class:`~texar.data.Vocab`.
"embedding_init" : dict
The hyperparameters for pre-trained embedding loading and
initialization.
The structure and default values are defined in
:meth:`texar.data.Embedding.default_hparams`.
"delimiter" : str
The delimiter to split each line of the text files into tokens.
"max_seq_length" : int, optional
Maximum length of output sequences. Data samples exceeding the
length will be truncated or discarded according to
:attr:`"length_filter_mode"`. The length does not include
any added
:attr:`"bos_token"` or :attr:`"eos_token"`. If `None` (default),
no filtering is performed.
"length_filter_mode" : str
Either "truncate" or "discard". If "truncate" (default),
tokens exceeding the :attr:`"max_seq_length"` will be truncated.
If "discard", data samples longer than the
:attr:`"max_seq_length"`
will be discarded.
"pad_to_max_seq_length" : bool
If `True`, pad all data instances to length
:attr:`"max_seq_length"`.
Raises error if :attr:`"max_seq_length"` is not provided.
"bos_token" : str
The Begin-Of-Sequence token prepended to each sequence.
Set to an empty string to avoid prepending.
"eos_token" : str
The End-Of-Sequence token appended to each sequence.
Set to an empty string to avoid appending.
"other_transformations" : list
A list of transformation functions or function names/paths to
further transform each single data instance.
(More documentations to be added.)
"variable_utterance" : bool
If `True`, each line of the text file is considered to contain
multiple sequences (utterances) separated by
:attr:`"utterance_delimiter"`.
For example, in dialog data, each line can contain a series of
dialog history utterances. See the example in
`examples/hierarchical_dialog` for a use case.
"utterance_delimiter" : str
The delimiter to split over utterance level. Should not be the
same with :attr:`"delimiter"`. Used only when
:attr:`"variable_utterance"``==True`.
"max_utterance_cnt" : int
Maximally allowed number of utterances in a data instance.
Extra utterances are truncated out.
"data_name" : str
Name of the dataset.
2. For the **general** hyperparameters, see
:meth:`texar.data.DataBase.default_hparams` for details.
3. **Bucketing** is to group elements of the dataset together by length
and then pad and batch. (See more at
:tf_main:`bucket_by_sequence_length
`). For bucketing
hyperparameters:
"bucket_boundaries" : list
An int list containing the upper length boundaries of the
buckets.
Set to an empty list (default) to disable bucketing.
"bucket_batch_sizes" : list
An int list containing batch size per bucket. Length should be
`len(bucket_boundaries) + 1`.
If `None`, every bucket whill have the same batch size specified
in :attr:`batch_size`.
"bucket_length_fn" : str or callable
Function maps dataset element to `tf.int32` scalar, determines
the length of the element.
This can be a function, or the name or full module path to the
function. If function name is given, the function must be in the
:mod:`texar.custom` module.
If `None` (default), length is determined by the number of
tokens (including BOS and EOS if added) of the element.
"""
hparams = TextDataBase.default_hparams()
hparams["name"] = "mono_text_data"
hparams.update({
"dataset": _default_mono_text_dataset_hparams()
})
return hparams
@staticmethod
def make_vocab(hparams):
"""Reads vocab file and returns an instance of
:class:`texar.data.Vocab`.
"""
bos_token = utils.default_str(
hparams["bos_token"], SpecialTokens.BOS)
eos_token = utils.default_str(
hparams["eos_token"], SpecialTokens.EOS)
vocab = Vocab(hparams["vocab_file"],
bos_token=bos_token, eos_token=eos_token)
return vocab
@staticmethod
def make_embedding(emb_hparams, token_to_id_map):
"""Optionally loads embedding from file (if provided), and returns
an instance of :class:`texar.data.Embedding`.
"""
embedding = None
if emb_hparams["file"] is not None and len(emb_hparams["file"]) > 0:
embedding = Embedding(token_to_id_map, emb_hparams)
return embedding
@staticmethod
def _make_mono_text_dataset(dataset_hparams):
dataset = tf.data.TextLineDataset(
dataset_hparams["files"],
compression_type=dataset_hparams["compression_type"])
return dataset
@staticmethod
def _make_other_transformations(other_trans_hparams, data_spec):
"""Creates a list of tranformation functions based on the
hyperparameters.
Args:
other_trans_hparams (list): A list of transformation functions,
names, or full paths.
data_spec: An instance of :class:`texar.data._DataSpec` to
be passed to transformation functions.
Returns:
A list of transformation functions.
"""
other_trans = []
for tran in other_trans_hparams:
if not is_callable(tran):
tran = utils.get_function(tran, ["texar.custom"])
other_trans.append(dsutils.make_partial(tran, data_spec))
return other_trans
@staticmethod
def _make_processor(dataset_hparams, data_spec, chained=True,
name_prefix=None):
# Create data decoder
max_seq_length = None
if dataset_hparams["length_filter_mode"] == "truncate":
max_seq_length = dataset_hparams["max_seq_length"]
if not dataset_hparams["variable_utterance"]:
decoder = TextDataDecoder(
delimiter=dataset_hparams["delimiter"],
bos_token=dataset_hparams["bos_token"],
eos_token=dataset_hparams["eos_token"],
max_seq_length=max_seq_length,
token_to_id_map=data_spec.vocab.token_to_id_map)
else:
decoder = VarUttTextDataDecoder( # pylint: disable=redefined-variable-type
sentence_delimiter=dataset_hparams["utterance_delimiter"],
delimiter=dataset_hparams["delimiter"],
bos_token=dataset_hparams["bos_token"],
eos_token=dataset_hparams["eos_token"],
max_seq_length=max_seq_length,
max_utterance_cnt=dataset_hparams["max_utterance_cnt"],
token_to_id_map=data_spec.vocab.token_to_id_map)
# Create other transformations
data_spec.add_spec(decoder=decoder)
other_trans = MonoTextData._make_other_transformations(
dataset_hparams["other_transformations"], data_spec)
if name_prefix:
other_trans.append(dsutils.name_prefix_fn(name_prefix))
data_spec.add_spec(name_prefix=name_prefix)
if chained:
chained_tran = dsutils.make_chained_transformation(
[decoder] + other_trans)
return chained_tran, data_spec
else:
return decoder, other_trans, data_spec
@staticmethod
def _make_length_filter(dataset_hparams, length_name, decoder):
filter_mode = dataset_hparams["length_filter_mode"]
max_length = dataset_hparams["max_seq_length"]
filter_fn = None
if filter_mode == _LengthFilterMode.DISCARD and max_length is not None:
max_length += decoder.added_length
filter_fn = dsutils._make_length_filter_fn(length_name,
max_length)
return filter_fn
def _process_dataset(self, dataset, hparams, data_spec):
chained_tran, data_spec = self._make_processor(
hparams["dataset"], data_spec,
name_prefix=hparams["dataset"]["data_name"])
num_parallel_calls = hparams["num_parallel_calls"]
dataset = dataset.map(
lambda *args: chained_tran(dsutils.maybe_tuple(args)),
num_parallel_calls=num_parallel_calls)
# Filters by length
length_name = dsutils._connect_name(
data_spec.name_prefix,
data_spec.decoder.length_tensor_name)
filter_fn = self._make_length_filter(
hparams["dataset"], length_name, data_spec.decoder)
if filter_fn:
dataset = dataset.filter(filter_fn)
# Truncates data count
dataset = dataset.take(hparams["max_dataset_size"])
return dataset, data_spec
def _make_bucket_length_fn(self):
length_fn = self._hparams.bucket_length_fn
if not length_fn:
length_fn = lambda x: x[self.length_name]
elif not is_callable(length_fn):
# pylint: disable=redefined-variable-type
length_fn = utils.get_function(length_fn, ["texar.custom"])
return length_fn
@staticmethod
def _make_padded_text_and_id_shapes(dataset, dataset_hparams, decoder,
text_name, text_id_name):
max_length = dataset_hparams['max_seq_length']
if max_length is None:
raise ValueError("hparams 'max_seq_length' must be specified "
"when 'pad_to_max_seq_length' is True.")
max_length += decoder.added_length
padded_shapes = dataset.output_shapes
def _get_new_shape(name):
dim = len(padded_shapes[name])
if not dataset_hparams['variable_utterance']:
if dim != 1:
raise ValueError(
"Unable to pad data '%s' to max seq length. Expected "
"1D Tensor, but got %dD Tensor." % (name, dim))
return tf.TensorShape(max_length)
else:
if dim != 2:
raise ValueError(
"Unable to pad data '%s' to max seq length. Expected "
"2D Tensor, but got %dD Tensor." % (name, dim))
return tf.TensorShape([padded_shapes[name][0], max_length])
text_and_id_shapes = {}
if text_name in padded_shapes:
text_and_id_shapes[text_name] = _get_new_shape(text_name)
if text_id_name in padded_shapes:
text_and_id_shapes[text_id_name] = _get_new_shape(text_id_name)
return text_and_id_shapes
def _make_padded_shapes(self, dataset, decoder):
if not self._hparams.dataset.pad_to_max_seq_length:
return None
text_and_id_shapes = MonoTextData._make_padded_text_and_id_shapes(
dataset, self._hparams.dataset, decoder,
self.text_name, self.text_id_name)
padded_shapes = dataset.output_shapes
padded_shapes.update(text_and_id_shapes)
return padded_shapes
def _make_data(self):
dataset_hparams = self._hparams.dataset
# Create vocab and embedding
self._vocab = self.make_vocab(dataset_hparams)
self._embedding = self.make_embedding(
dataset_hparams["embedding_init"], self._vocab.token_to_id_map_py)
# Create and shuffle dataset
dataset = self._make_mono_text_dataset(dataset_hparams)
dataset, dataset_size = self._shuffle_dataset(
dataset, self._hparams, self._hparams.dataset.files)
self._dataset_size = dataset_size
# Processing
data_spec = dsutils._DataSpec(dataset=dataset,
dataset_size=self._dataset_size,
vocab=self._vocab,
embedding=self._embedding)
dataset, data_spec = self._process_dataset(dataset, self._hparams,
data_spec)
self._data_spec = data_spec
self._decoder = data_spec.decoder
# Batching
length_fn = self._make_bucket_length_fn()
padded_shapes = self._make_padded_shapes(dataset, self._decoder)
dataset = self._make_batch(
dataset, self._hparams, length_fn, padded_shapes)
# Prefetching
if self._hparams.prefetch_buffer_size > 0:
dataset = dataset.prefetch(self._hparams.prefetch_buffer_size)
self._dataset = dataset
def list_items(self):
"""Returns the list of item names that the data can produce.
Returns:
A list of strings.
"""
return list(self._dataset.output_types.keys())
@property
def dataset(self):
"""The dataset, an instance of
:tf_main:`TF dataset `.
"""
return self._dataset
def dataset_size(self):
"""Returns the number of data instances in the data files.
Note that this is the total data count in the raw files, before any
filtering and truncation.
"""
if not self._dataset_size:
# pylint: disable=attribute-defined-outside-init
self._dataset_size = count_file_lines(
self._hparams.dataset.files)
return self._dataset_size
@property
def vocab(self):
"""The vocabulary, an instance of :class:`~texar.data.Vocab`.
"""
return self._vocab
@property
def embedding_init_value(self):
"""The `Tensor` containing the embedding value loaded from file.
`None` if embedding is not specified.
"""
if self._embedding is None:
return None
return self._embedding.word_vecs
@property
def text_name(self):
"""The name of text tensor, "text" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix,
self._data_spec.decoder.text_tensor_name)
return name
@property
def length_name(self):
"""The name of length tensor, "length" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix,
self._data_spec.decoder.length_tensor_name)
return name
@property
def text_id_name(self):
"""The name of text index tensor, "text_ids" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix,
self._data_spec.decoder.text_id_tensor_name)
return name
@property
def utterance_cnt_name(self):
"""The name of utterance count tensor, "utterance_cnt" by default.
"""
if not self._hparams.dataset.variable_utterance:
raise ValueError("`utterance_cnt_name` is not defined.")
name = dsutils._connect_name(
self._data_spec.name_prefix,
self._data_spec.decoder.utterance_cnt_tensor_name)
return name
================================================
FILE: texar_repo/texar/data/data/mono_text_data_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import copy
import numpy as np
import tensorflow as tf
import texar as tx
# pylint: disable=too-many-locals, protected-access, too-many-branches
# pylint: disable=invalid-name
class MonoTextDataTest(tf.test.TestCase):
"""Tests text data class.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create test data
vocab_list = ['word', '词']
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
self._vocab_file = vocab_file
self._vocab_size = len(vocab_list)
text = ['This is a test sentence .', '词 词 。']
text_file = tempfile.NamedTemporaryFile()
text_file.write('\n'.join(text).encode("utf-8"))
text_file.flush()
self._text_file = text_file
self._hparams = {
"num_epochs": 50,
"batch_size": 3,
"dataset": {
"files": self._text_file.name,
"vocab_file": self._vocab_file.name,
}
}
def _run_and_test(self,
hparams,
test_batch_size=False,
length_inc=None):
# Construct database
text_data = tx.data.MonoTextData(hparams)
self.assertEqual(text_data.vocab.size,
self._vocab_size + len(text_data.vocab.special_tokens))
iterator = text_data.dataset.make_initializable_iterator()
text_data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
while True:
try:
data_batch_ = sess.run(text_data_batch)
self.assertEqual(set(data_batch_.keys()),
set(text_data.list_items()))
if test_batch_size:
self.assertEqual(len(data_batch_['text']),
hparams['batch_size'])
if length_inc:
for i in range(len(data_batch_['text'])):
text_ = data_batch_['text'][i].tolist()
self.assertEqual(
text_.index(b'') + 1,
data_batch_['length'][i] - length_inc)
max_seq_length = text_data.hparams.dataset.max_seq_length
mode = text_data.hparams.dataset.length_filter_mode
if max_seq_length == 6:
max_l = max_seq_length
max_l += text_data._decoder.added_length
for length in data_batch_['length']:
self.assertLessEqual(length, max_l)
if mode == "discard":
for length in data_batch_['length']:
self.assertEqual(length, 5)
elif mode == "truncate":
num_length_6 = 0
for length in data_batch_['length']:
num_length_6 += int(length == 6)
self.assertGreater(num_length_6, 0)
else:
raise ValueError("Unknown mode: %s" % mode)
if text_data.hparams.dataset.pad_to_max_seq_length:
max_l = max_seq_length + text_data._decoder.added_length
for x in data_batch_['text']:
self.assertEqual(len(x), max_l)
for x in data_batch_['text_ids']:
self.assertEqual(len(x), max_l)
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_default_setting(self):
"""Tests the logics of MonoTextData.
"""
self._run_and_test(self._hparams)
def test_batching(self):
"""Tests different batching.
"""
# dis-allow smaller final batch
hparams = copy.copy(self._hparams)
hparams.update({"allow_smaller_final_batch": False})
self._run_and_test(hparams, test_batch_size=True)
def test_bucketing(self):
"""Tests bucketing.
"""
hparams = copy.copy(self._hparams)
hparams.update({
"bucket_boundaries": [7],
"bucket_batch_sizes": [6, 4]})
text_data = tx.data.MonoTextData(hparams)
iterator = text_data.dataset.make_initializable_iterator()
text_data_batch = iterator.get_next()
hparams.update({
"bucket_boundaries": [7],
"bucket_batch_sizes": [7, 7],
"allow_smaller_final_batch": False})
text_data_1 = tx.data.MonoTextData(hparams)
iterator_1 = text_data_1.dataset.make_initializable_iterator()
text_data_batch_1 = iterator_1.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
sess.run(iterator_1.initializer)
while True:
try:
# Run the logics
data_batch_, data_batch_1_ = sess.run(
[text_data_batch, text_data_batch_1])
length_ = data_batch_['length'][0]
if length_ < 7:
last_batch_size = hparams['num_epochs'] % 6
self.assertTrue(
len(data_batch_['text']) == 6 or
len(data_batch_['text']) == last_batch_size)
else:
last_batch_size = hparams['num_epochs'] % 4
self.assertTrue(
len(data_batch_['text']) == 4 or
len(data_batch_['text']) == last_batch_size)
self.assertEqual(len(data_batch_1_['text']), 7)
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_shuffle(self):
"""Tests different shuffle strategies.
"""
hparams = copy.copy(self._hparams)
hparams.update({
"shard_and_shuffle": True,
"shuffle_buffer_size": 1})
self._run_and_test(hparams)
def test_prefetch(self):
"""Tests prefetching.
"""
hparams = copy.copy(self._hparams)
hparams.update({"prefetch_buffer_size": 2})
self._run_and_test(hparams)
def test_other_transformations(self):
"""Tests use of other transformations
"""
def _transform(x, data_specs): # pylint: disable=invalid-name
x[data_specs.decoder.length_tensor_name] += 1
return x
hparams = copy.copy(self._hparams)
hparams["dataset"].update(
{"other_transformations": [_transform, _transform]})
self._run_and_test(hparams, length_inc=2)
def test_list_items(self):
"""Tests the item names of the output data.
"""
text_data = tx.data.MonoTextData(self._hparams)
self.assertSetEqual(set(text_data.list_items()),
{"text", "text_ids", "length"})
hparams = copy.copy(self._hparams)
hparams["dataset"]["data_name"] = "data"
text_data = tx.data.MonoTextData(hparams)
self.assertSetEqual(set(text_data.list_items()),
{"data_text", "data_text_ids", "data_length"})
def test_length_discard(self):
"""Tests discard lenghy seq.
"""
hparams = copy.copy(self._hparams)
hparams["dataset"].update({"max_seq_length": 4,
"length_filter_mode": "discard"})
self._run_and_test(hparams)
def test_length_truncate(self):
"""Tests truncation.
"""
hparams = copy.copy(self._hparams)
hparams["dataset"].update({"max_seq_length": 4,
"length_filter_mode": "truncate"})
hparams["shuffle"] = False
hparams["allow_smaller_final_batch"] = False
self._run_and_test(hparams)
def test_pad_to_max_length(self):
"""Tests padding.
"""
hparams = copy.copy(self._hparams)
hparams["dataset"].update({"max_seq_length": 10,
"length_filter_mode": "truncate",
"pad_to_max_seq_length": True})
self._run_and_test(hparams)
class VarUttMonoTextDataTest(tf.test.TestCase):
"""Tests variable utterance text data class.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create test data
vocab_list = ['word', 'sentence', '词', 'response', 'dialog', '1', '2']
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
self._vocab_file = vocab_file
self._vocab_size = len(vocab_list)
text = [
'This is a dialog 1 sentence . ||| This is a dialog 1 sentence . '
'||| This is yet another dialog 1 sentence .', #//
'This is a dialog 2 sentence . ||| '
'This is also a dialog 2 sentence . ', #//
'词 词 词 ||| word', #//
'This This', #//
'1 1 1 ||| 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ||| 1 1 1 ||| 2'
]
text_file = tempfile.NamedTemporaryFile()
text_file.write('\n'.join(text).encode("utf-8"))
text_file.flush()
self._text_file = text_file
self._hparams = {
"num_epochs": 50,
"batch_size": 3,
"shuffle": False,
"dataset": {
"files": self._text_file.name,
"vocab_file": self._vocab_file.name,
"variable_utterance": True,
"max_utterance_cnt": 3,
"max_seq_length": 10
}
}
def _run_and_test(self, hparams):
# Construct database
text_data = tx.data.MonoTextData(hparams)
self.assertEqual(text_data.vocab.size,
self._vocab_size + len(text_data.vocab.special_tokens))
iterator = text_data.dataset.make_initializable_iterator()
text_data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
while True:
try:
# Run the logics
data_batch_ = sess.run(text_data_batch)
self.assertEqual(set(data_batch_.keys()),
set(text_data.list_items()))
# Test utterance count
utt_ind = np.sum(data_batch_["text_ids"], 2) != 0
utt_cnt = np.sum(utt_ind, 1)
self.assertListEqual(
data_batch_[text_data.utterance_cnt_name].tolist(),
utt_cnt.tolist())
if text_data.hparams.dataset.pad_to_max_seq_length:
max_l = text_data.hparams.dataset.max_seq_length
max_l += text_data._decoder.added_length
for x in data_batch_['text']:
for xx in x:
self.assertEqual(len(xx), max_l)
for x in data_batch_['text_ids']:
for xx in x:
self.assertEqual(len(xx), max_l)
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_default_setting(self):
"""Tests the logics of the text data.
"""
self._run_and_test(self._hparams)
def test_pad_to_max_length(self):
"""Tests padding.
"""
hparams = copy.copy(self._hparams)
hparams["dataset"].update({"max_seq_length": 20,
"length_filter_mode": "truncate",
"pad_to_max_seq_length": True})
self._run_and_test(hparams)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/multi_aligned_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Paired text data that consists of source text and target text.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import copy
import tensorflow as tf
from texar.hyperparams import HParams
from texar.utils import utils
from texar.utils.dtypes import is_str, is_callable
from texar.data.data.text_data_base import TextDataBase
from texar.data.data.scalar_data import ScalarData
from texar.data.data.mono_text_data import _default_mono_text_dataset_hparams
from texar.data.data.scalar_data import _default_scalar_dataset_hparams
from texar.data.data.mono_text_data import MonoTextData
from texar.data.data_utils import count_file_lines
from texar.data.data import dataset_utils as dsutils
from texar.data.vocabulary import Vocab, SpecialTokens
from texar.data.embedding import Embedding
# pylint: disable=invalid-name, arguments-differ
# pylint: disable=protected-access, too-many-instance-attributes
__all__ = [
"_default_dataset_hparams",
"MultiAlignedData"
]
class _DataTypes(object): # pylint: disable=no-init, too-few-public-methods
"""Enumeration of data types.
"""
TEXT = "text"
INT = "int"
FLOAT = "float"
def _is_text_data(data_type):
return data_type == _DataTypes.TEXT
def _is_scalar_data(data_type):
return data_type == _DataTypes.INT or data_type == _DataTypes.FLOAT
def _default_dataset_hparams(data_type=None):
"""Returns hyperparameters of a dataset with default values.
See :meth:`texar.data.MultiAlignedData.default_hparams` for details.
"""
if not data_type or _is_text_data(data_type):
hparams = _default_mono_text_dataset_hparams()
hparams.update({
"data_type": _DataTypes.TEXT,
"vocab_share_with": None,
"embedding_init_share_with": None,
"processing_share_with": None,
})
elif _is_scalar_data(data_type):
hparams = _default_scalar_dataset_hparams()
return hparams
class MultiAlignedData(TextDataBase):
"""Data consisting of multiple aligned parts.
Args:
hparams (dict): Hyperparameters. See :meth:`default_hparams` for the
defaults.
The processor can read any number of parallel fields as specified in
the "datasets" list of :attr:`hparams`, and result in a TF Dataset whose
element is a python `dict` containing data fields from each of the
specified datasets. Fields from a text dataset have names prefixed by
its "data_name". Fields from a scalar dataset are specified by its
"data_name".
Example:
.. code-block:: python
hparams={
'datasets': [
{'files': 'a.txt', 'vocab_file': 'v.a', 'data_name': 'x'},
{'files': 'b.txt', 'vocab_file': 'v.b', 'data_name': 'y'},
{'files': 'c.txt', 'data_type': 'int', 'data_name': 'z'}
]
'batch_size': 1
}
data = MultiAlignedData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()
iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
# 'x_text': [['', 'x', 'sequence', '']],
# 'x_text_ids': [['1', '5', '10', '2']],
# 'x_length': [4]
# 'y_text': [['', 'y', 'sequence', '1', '']],
# 'y_text_ids': [['1', '6', '10', '20', '2']],
# 'y_length': [5],
# 'z': [1000]
# }
"""
def __init__(self, hparams):
TextDataBase.__init__(self, hparams)
# Defaultizes hparams of each dataset
datasets_hparams = self._hparams.datasets
defaultized_datasets_hparams = []
for ds_hpms in datasets_hparams:
data_type = ds_hpms.get("data_type", None)
defaultized_ds_hpms = HParams(ds_hpms,
_default_dataset_hparams(data_type))
defaultized_datasets_hparams.append(defaultized_ds_hpms)
self._hparams.datasets = defaultized_datasets_hparams
with tf.name_scope(self.name, self.default_hparams()["name"]):
self._make_data()
@staticmethod
def default_hparams():
"""Returns a dicitionary of default hyperparameters.
.. code-block:: python
{
# (1) Hyperparams specific to text dataset
"datasets": []
# (2) General hyperparams
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None,
"name": "multi_aligned_data",
}
Here:
1. "datasets" is a list of `dict` each of which specifies a
text or scalar dataset. The :attr:`"data_name"` field of each dataset
is used as the name prefix of the data fields from the respective
dataset. The :attr:`"data_name"` field of each dataset should not
be the same.
- For scalar dataset, the allowed hyperparameters and default \
values are the same as the "dataset" field of \
:meth:`texar.data.ScalarData.default_hparams`. Note that \
:attr:`"data_type"` must be explicily specified \
(either "int" or "float"). \
- For text dataset, the allowed hyperparameters and default values\
are the same as the "dataset" filed of \
:meth:`texar.data.MonoTextData.default_hparams`, with several \
extra hyperparameters:
"data_type" : str
The type of the dataset, one of {"text", "int", "float"}.
If set to "int" or "float", the dataset is considered to be
a scalar dataset. If not specified or set to "text", the
dataset is considered to be a text dataset.
"vocab_share_with" : int, optional
Share the vocabulary of a preceding text dataset with the
specified index in the list (starting from 0). The
specified dataset must be a text dataset, and must have
an index smaller than the current dataset.
If specified, the vocab file of current dataset is ignored.
Default is `None` which disables the vocab sharing.
"embedding_init_share_with": int, optional
Share the embedding initial value of a preceding text
dataset with the specified index in the list (starting
from 0).
The specified dataset must be a text dataset, and must have
an index smaller than the current dataset.
If specified, the :attr:`"embedding_init"` field of
the current dataset is ignored. Default is `None` which
disables the initial value sharing.
"processing_share_with" : int, optional
Share the processing configurations of a preceding text
dataset with the specified index in the list (starting
from 0).
The specified dataset must be a text dataset, and must have
an index smaller than the current dataset.
If specified, relevant field of the current dataset are
ignored, including "delimiter", "bos_token", "eos_token",
and "other_transformations". Default is `None` which
disables the processing sharing.
2. For the **general** hyperparameters, see
:meth:`texar.data.DataBase.default_hparams` for details.
"""
hparams = TextDataBase.default_hparams()
hparams["name"] = "multi_aligned_data"
hparams["datasets"] = []
return hparams
@staticmethod
def _raise_sharing_error(err_data, shr_data, hparam_name):
raise ValueError(
"Must only share specifications with a preceding dataset. "
"Dataset %d has '%s=%d'" % (err_data, hparam_name, shr_data))
@staticmethod
def make_vocab(hparams):
"""Makes a list of vocabs based on the hparams.
Args:
hparams (list): A list of dataset hyperparameters.
Returns:
A list of :class:`texar.data.Vocab` instances. Some instances
may be the same objects if they are set to be shared and have
the same other configs.
"""
if not isinstance(hparams, (list, tuple)):
hparams = [hparams]
vocabs = []
for i, hparams_i in enumerate(hparams):
if not _is_text_data(hparams_i["data_type"]):
vocabs.append(None)
continue
proc_shr = hparams_i["processing_share_with"]
if proc_shr is not None:
bos_token = hparams[proc_shr]["bos_token"]
eos_token = hparams[proc_shr]["eos_token"]
else:
bos_token = hparams_i["bos_token"]
eos_token = hparams_i["eos_token"]
bos_token = utils.default_str(
bos_token, SpecialTokens.BOS)
eos_token = utils.default_str(
eos_token, SpecialTokens.EOS)
vocab_shr = hparams_i["vocab_share_with"]
if vocab_shr is not None:
if vocab_shr >= i:
MultiAlignedData._raise_sharing_error(
i, vocab_shr, "vocab_share_with")
if not vocabs[vocab_shr]:
raise ValueError("Cannot share vocab with dataset %d which "
"does not have a vocab." % vocab_shr)
if bos_token == vocabs[vocab_shr].bos_token and \
eos_token == vocabs[vocab_shr].eos_token:
vocab = vocabs[vocab_shr]
else:
vocab = Vocab(hparams[vocab_shr]["vocab_file"],
bos_token=bos_token,
eos_token=eos_token)
else:
vocab = Vocab(hparams_i["vocab_file"],
bos_token=bos_token,
eos_token=eos_token)
vocabs.append(vocab)
return vocabs
@staticmethod
def make_embedding(hparams, vocabs):
"""Optionally loads embeddings from files (if provided), and
returns respective :class:`texar.data.Embedding` instances.
"""
if not isinstance(hparams, (list, tuple)):
hparams = [hparams]
embs = []
for i, hparams_i in enumerate(hparams):
if not _is_text_data(hparams_i["data_type"]):
embs.append(None)
continue
emb_shr = hparams_i["embedding_init_share_with"]
if emb_shr is not None:
if emb_shr >= i:
MultiAlignedData._raise_sharing_error(
i, emb_shr, "embedding_init_share_with")
if not embs[emb_shr]:
raise ValueError("Cannot share embedding with dataset %d "
"which does not have an embedding." %
emb_shr)
if emb_shr != hparams_i["vocab_share_with"]:
raise ValueError("'embedding_init_share_with' != "
"vocab_share_with. embedding_init can "
"be shared only when vocab is shared.")
emb = embs[emb_shr]
else:
emb = None
emb_file = hparams_i["embedding_init"]["file"]
if emb_file and emb_file != "":
emb = Embedding(vocabs[i].token_to_id_map_py,
hparams_i["embedding_init"])
embs.append(emb)
return embs
def _make_dataset(self):
datasets = []
for _, hparams_i in enumerate(self._hparams.datasets):
dtype = hparams_i.data_type
if _is_text_data(dtype) or _is_scalar_data(dtype):
dataset = tf.data.TextLineDataset(
hparams_i.files,
compression_type=hparams_i.compression_type)
datasets.append(dataset)
else:
raise ValueError("Unknown data type: %s" % hparams_i.data_type)
return tf.data.Dataset.zip(tuple(datasets))
#@staticmethod
#def _get_name_prefix(dataset_hparams):
# def _dtype_conflict(dtype_1, dtype_2):
# conflict = ((dtype_1 == dtype_2) or
# (dtype_1 in {_DataTypes.INT, _DataTypes.FLOAT} and
# dtype_2 in {_DataTypes.INT, _DataTypes.FLOAT}))
# return conflict
# name_prefix = [hpms["data_name"] for hpms in dataset_hparams]
# name_prefix_dict = {}
# for i, np in enumerate(name_prefix):
# ids = name_prefix_dict.get(np, [])
# for j in ids:
# if _dtype_conflict(dataset_hparams[j]["data_type"],
# dataset_hparams[i]["data_type"]):
# raise ValueError(
# "'data_name' of the datasets with compatible "
# "data_types cannot be the same: %d-th dataset and "
# "%d-th dataset have the same name '%s'" %
# (i, j, name_prefix[i]))
# ids.append(i)
# name_prefix_dict[np] = ids
# return name_prefix
@staticmethod
def _get_name_prefix(dataset_hparams):
name_prefix = [hpms["data_name"] for hpms in dataset_hparams]
for i in range(1, len(name_prefix)):
if name_prefix[i] in name_prefix[:i-1]:
raise ValueError("Data name duplicated: %s" % name_prefix[i])
return name_prefix
@staticmethod
def _make_processor(dataset_hparams, data_spec, name_prefix):
processors = []
for i, hparams_i in enumerate(dataset_hparams):
data_spec_i = data_spec.get_ith_data_spec(i)
data_type = hparams_i["data_type"]
if _is_text_data(data_type):
tgt_proc_hparams = hparams_i
proc_shr = hparams_i["processing_share_with"]
if proc_shr is not None:
tgt_proc_hparams = copy.copy(dataset_hparams[proc_shr])
try:
tgt_proc_hparams["variable_utterance"] = \
hparams_i["variable_utterance"]
except TypeError:
tgt_proc_hparams.variable_utterance = \
hparams_i["variable_utterance"]
processor, data_spec_i = MonoTextData._make_processor(
tgt_proc_hparams, data_spec_i)
elif _is_scalar_data(data_type):
processor, data_spec_i = ScalarData._make_processor(
hparams_i, data_spec_i, name_prefix='')
else:
raise ValueError("Unsupported data type: %s" % data_type)
processors.append(processor)
data_spec.set_ith_data_spec(i, data_spec_i, len(dataset_hparams))
tran_fn = dsutils.make_combined_transformation(
processors, name_prefix=name_prefix)
data_spec.add_spec(name_prefix=name_prefix)
return tran_fn, data_spec
@staticmethod
def _make_length_filter(dataset_hparams, length_name, decoder):
filter_fns = []
for i, hpms in enumerate(dataset_hparams):
if not _is_text_data(hpms["data_type"]):
filter_fn = None
else:
filter_fn = MonoTextData._make_length_filter(
hpms, length_name[i], decoder[i])
filter_fns.append(filter_fn)
combined_filter_fn = dsutils._make_combined_filter_fn(filter_fns)
return combined_filter_fn
def _process_dataset(self, dataset, hparams, data_spec):
name_prefix = self._get_name_prefix(hparams["datasets"])
# pylint: disable=attribute-defined-outside-init
self._name_to_id = {v:k for k, v in enumerate(name_prefix)}
tran_fn, data_spec = self._make_processor(
hparams["datasets"], data_spec, name_prefix)
num_parallel_calls = hparams["num_parallel_calls"]
dataset = dataset.map(
lambda *args: tran_fn(dsutils.maybe_tuple(args)),
num_parallel_calls=num_parallel_calls)
# Filters by length
def _get_length_name(i):
if not _is_text_data(hparams["datasets"][i]["data_type"]):
return None
name = dsutils._connect_name(
data_spec.name_prefix[i],
data_spec.decoder[i].length_tensor_name)
return name
filter_fn = self._make_length_filter(
hparams["datasets"],
[_get_length_name(i) for i in range(len(hparams["datasets"]))],
data_spec.decoder)
if filter_fn:
dataset = dataset.filter(filter_fn)
# Truncates data count
dataset = dataset.take(hparams["max_dataset_size"])
return dataset, data_spec
def _make_bucket_length_fn(self):
length_fn = self._hparams.bucket_length_fn
if not length_fn:
# Uses the length of the first text data
i = -1
for i, hparams_i in enumerate(self._hparams.datasets):
if _is_text_data(hparams_i["data_type"]):
break
if i < 0:
raise ValueError("Undefined `length_fn`.")
length_fn = lambda x: x[self.length_name(i)]
elif not is_callable(length_fn):
# pylint: disable=redefined-variable-type
length_fn = utils.get_function(length_fn, ["texar.custom"])
return length_fn
def _make_padded_shapes(self, dataset, decoders):
padded_shapes = dataset.output_shapes
for i, hparams_i in enumerate(self._hparams.datasets):
if not _is_text_data(hparams_i["data_type"]):
continue
if not hparams_i["pad_to_max_seq_length"]:
continue
text_and_id_shapes = MonoTextData._make_padded_text_and_id_shapes(
dataset, hparams_i, decoders[i],
self.text_name(i), self.text_id_name(i))
padded_shapes.update(text_and_id_shapes)
return padded_shapes
def _make_data(self):
self._vocab = self.make_vocab(self._hparams.datasets)
self._embedding = self.make_embedding(self._hparams.datasets,
self._vocab)
# Create dataset
dataset = self._make_dataset()
dataset, dataset_size = self._shuffle_dataset(
dataset, self._hparams, self._hparams.datasets[0].files)
self._dataset_size = dataset_size
# Processing
data_spec = dsutils._DataSpec(dataset=dataset,
dataset_size=self._dataset_size,
vocab=self._vocab,
embedding=self._embedding)
dataset, data_spec = self._process_dataset(
dataset, self._hparams, data_spec)
self._data_spec = data_spec
self._decoder = data_spec.decoder
# Batching
length_fn = self._make_bucket_length_fn()
padded_shapes = self._make_padded_shapes(dataset, self._decoder)
dataset = self._make_batch(
dataset, self._hparams, length_fn, padded_shapes)
# Prefetching
if self._hparams.prefetch_buffer_size > 0:
dataset = dataset.prefetch(self._hparams.prefetch_buffer_size)
self._dataset = dataset
def list_items(self):
"""Returns the list of item names that the data can produce.
Returns:
A list of strings.
"""
return list(self._dataset.output_types.keys())
@property
def dataset(self):
"""The dataset.
"""
return self._dataset
def dataset_size(self):
"""Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any
filtering and truncation.
"""
if not self._dataset_size:
# pylint: disable=attribute-defined-outside-init
self._dataset_size = count_file_lines(
self._hparams.datasets[0].files)
return self._dataset_size
def _maybe_name_to_id(self, name_or_id):
if is_str(name_or_id):
if name_or_id not in self._name_to_id:
raise ValueError("Unknown data name: {}".format(name_or_id))
return self._name_to_id[name_or_id]
return name_or_id
def vocab(self, name_or_id):
"""Returns the :class:`~texar.data.Vocab` of text dataset by its name
or id. `None` if the dataset is not of text type.
Args:
name_or_id (str or int): Data name or the index of text dataset.
"""
i = self._maybe_name_to_id(name_or_id)
return self._vocab[i]
def embedding_init_value(self, name_or_id):
"""Returns the `Tensor` of embedding init value of the
dataset by its name or id. `None` if the dataset is not of text type.
"""
i = self._maybe_name_to_id(name_or_id)
return self._embedding[i]
def text_name(self, name_or_id):
"""The name of text tensor of text dataset by its name or id. If the
dataaet is not of text type, returns `None`.
"""
i = self._maybe_name_to_id(name_or_id)
if not _is_text_data(self._hparams.datasets[i]["data_type"]):
return None
name = dsutils._connect_name(
self._data_spec.name_prefix[i],
self._data_spec.decoder[i].text_tensor_name)
return name
def length_name(self, name_or_id):
"""The name of length tensor of text dataset by its name or id. If the
dataset is not of text type, returns `None`.
"""
i = self._maybe_name_to_id(name_or_id)
if not _is_text_data(self._hparams.datasets[i]["data_type"]):
return None
name = dsutils._connect_name(
self._data_spec.name_prefix[i],
self._data_spec.decoder[i].length_tensor_name)
return name
def text_id_name(self, name_or_id):
"""The name of length tensor of text dataset by its name or id. If the
dataset is not of text type, returns `None`.
"""
i = self._maybe_name_to_id(name_or_id)
if not _is_text_data(self._hparams.datasets[i]["data_type"]):
return None
name = dsutils._connect_name(
self._data_spec.name_prefix[i],
self._data_spec.decoder[i].text_id_tensor_name)
return name
def utterance_cnt_name(self, name_or_id):
"""The name of utterance count tensor of text dataset by its name or id.
If the dataset is not variable utterance text data, returns `None`.
"""
i = self._maybe_name_to_id(name_or_id)
if not _is_text_data(self._hparams.datasets[i]["data_type"]) or \
not self._hparams.datasets[i]["variable_utterance"]:
return None
name = dsutils._connect_name(
self._data_spec.name_prefix[i],
self._data_spec.decoder[i].utterance_cnt_tensor_name)
return name
@property
def data_name(self, name_or_id):
"""The name of the data tensor of scalar dataset by its name or id..
If the dataset is not a scalar data, returns `None`.
"""
i = self._maybe_name_to_id(name_or_id)
if not _is_scalar_data(self._hparams.datasets[i]["data_type"]):
return None
name = dsutils._connect_name(
self._data_spec.name_prefix[i],
self._data_spec.decoder[i].data_tensor_name)
return name
================================================
FILE: texar_repo/texar/data/data/multi_aligned_data_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import copy
import numpy as np
import tensorflow as tf
import texar as tx
# pylint: disable=too-many-locals, too-many-branches, protected-access
class MultiAlignedDataTest(tf.test.TestCase):
"""Tests multi aligned text data class.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create test data
vocab_list = ['This', 'is', 'a', 'word', '词']
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
self._vocab_file = vocab_file
self._vocab_size = len(vocab_list)
text_0 = ['This is a sentence from source .', '词 词 。 source']
text_0_file = tempfile.NamedTemporaryFile()
text_0_file.write('\n'.join(text_0).encode("utf-8"))
text_0_file.flush()
self._text_0_file = text_0_file
text_1 = ['This is a sentence from target .', '词 词 。 target']
text_1_file = tempfile.NamedTemporaryFile()
text_1_file.write('\n'.join(text_1).encode("utf-8"))
text_1_file.flush()
self._text_1_file = text_1_file
text_2 = [
'This is a sentence from dialog . ||| dialog ',
'词 词 。 ||| 词 dialog']
text_2_file = tempfile.NamedTemporaryFile()
text_2_file.write('\n'.join(text_2).encode("utf-8"))
text_2_file.flush()
self._text_2_file = text_2_file
int_3 = [0, 1]
int_3_file = tempfile.NamedTemporaryFile()
int_3_file.write(('\n'.join([str(_) for _ in int_3])).encode("utf-8"))
int_3_file.flush()
self._int_3_file = int_3_file
# Construct database
self._hparams = {
"num_epochs": 123,
"batch_size": 23,
"datasets": [
{ # dataset 0
"files": [self._text_0_file.name],
"vocab_file": self._vocab_file.name,
"bos_token": "",
"data_name": "0"
},
{ # dataset 1
"files": [self._text_1_file.name],
"vocab_share_with": 0,
"eos_token": "",
"data_name": "1"
},
{ # dataset 2
"files": [self._text_2_file.name],
"vocab_file": self._vocab_file.name,
"processing_share_with": 0,
"variable_utterance": True,
"data_name": "2"
},
{ # dataset 3
"files": self._int_3_file.name,
"data_type": "int",
"data_name": "label"
},
]
}
def _run_and_test(self, hparams, discard_did=None):
# Construct database
text_data = tx.data.MultiAlignedData(hparams)
self.assertEqual(
text_data.vocab(0).size,
self._vocab_size + len(text_data.vocab(0).special_tokens))
iterator = text_data.dataset.make_initializable_iterator()
text_data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
while True:
try:
# Run the logics
data_batch_ = sess.run(text_data_batch)
self.assertEqual(set(data_batch_.keys()),
set(text_data.list_items()))
self.assertEqual(text_data.utterance_cnt_name('2'),
'2_utterance_cnt')
text_0 = data_batch_['0_text']
text_1 = data_batch_['1_text']
text_2 = data_batch_['2_text']
int_3 = data_batch_['label']
# pylint: disable=invalid-name
for t0, t1, t2, i3 in zip(text_0, text_1, text_2, int_3):
np.testing.assert_array_equal(
t0[:2], t1[1:3])
np.testing.assert_array_equal(
t0[:3], t2[0][:3])
if t0[0].startswith(b'This'):
self.assertEqual(i3, 0)
else:
self.assertEqual(i3, 1)
if discard_did is not None:
hpms = text_data._hparams.datasets[discard_did]
max_l = hpms.max_seq_length
max_l += text_data._decoder[discard_did].added_length
for i in range(2):
for length in data_batch_[text_data.length_name(i)]:
self.assertLessEqual(length, max_l)
for lengths in data_batch_[text_data.length_name(2)]:
for length in lengths:
self.assertLessEqual(length, max_l)
for i, hpms in enumerate(text_data._hparams.datasets):
if hpms.data_type != "text":
continue
max_l = hpms.max_seq_length
mode = hpms.length_filter_mode
if max_l is not None and mode == "truncate":
max_l += text_data._decoder[i].added_length
for length in data_batch_[text_data.length_name(i)]:
self.assertLessEqual(length, max_l)
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_default_setting(self):
"""Tests the logics of the text data.
"""
self._run_and_test(self._hparams)
def test_length_filter(self):
"""Tests filtering by length.
"""
hparams = copy.copy(self._hparams)
hparams["datasets"][0].update(
{"max_seq_length": 4,
"length_filter_mode": "discard"})
hparams["datasets"][1].update(
{"max_seq_length": 2,
"length_filter_mode": "truncate"})
self._run_and_test(hparams, discard_did=0)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/paired_text_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Paired text data that consists of source text and target text.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import copy
import tensorflow as tf
from texar.utils import utils
from texar.utils.dtypes import is_callable
from texar.data.data.mono_text_data import _default_mono_text_dataset_hparams
from texar.data.data.text_data_base import TextDataBase
from texar.data.data.mono_text_data import MonoTextData
from texar.data.data_utils import count_file_lines
from texar.data.data import dataset_utils as dsutils
from texar.data.vocabulary import Vocab, SpecialTokens
from texar.data.embedding import Embedding
# pylint: disable=invalid-name, arguments-differ, not-context-manager
# pylint: disable=protected-access, too-many-arguments
__all__ = [
"_default_paired_text_dataset_hparams",
"PairedTextData"
]
def _default_paired_text_dataset_hparams():
"""Returns hyperparameters of a paired text dataset with default values.
See :meth:`texar.data.PairedTextData.default_hparams` for details.
"""
source_hparams = _default_mono_text_dataset_hparams()
source_hparams["bos_token"] = None
source_hparams["data_name"] = "source"
target_hparams = _default_mono_text_dataset_hparams()
target_hparams.update(
{
"vocab_share": False,
"embedding_init_share": False,
"processing_share": False,
"data_name": "target"
}
)
return {
"source_dataset": source_hparams,
"target_dataset": target_hparams
}
# pylint: disable=too-many-instance-attributes, too-many-public-methods
class PairedTextData(TextDataBase):
"""Text data processor that reads parallel source and target text.
This can be used in, e.g., seq2seq models.
Args:
hparams (dict): Hyperparameters. See :meth:`default_hparams` for the
defaults.
By default, the processor reads raw data files, performs tokenization,
batching and other pre-processing steps, and results in a TF Dataset
whose element is a python `dict` including six fields:
- "source_text":
A string Tensor of shape `[batch_size, max_time]` containing
the **raw** text toknes of source sequences. `max_time` is the
length of the longest sequence in the batch.
Short sequences in the batch are padded with **empty string**.
By default only EOS token is appended to each sequence.
Out-of-vocabulary tokens are **NOT** replaced with UNK.
- "source_text_ids":
An `int64` Tensor of shape `[batch_size, max_time]`
containing the token indexes of source sequences.
- "source_length":
An `int` Tensor of shape `[batch_size]` containing the
length of each source sequence in the batch (including BOS and/or
EOS if added).
- "target_text":
A string Tensor as "source_text" but for target sequences. By
default both BOS and EOS are added.
- "target_text_ids":
An `int64` Tensor as "source_text_ids" but for target sequences.
- "target_length":
An `int` Tensor of shape `[batch_size]` as "source_length" but for
target sequences.
If :attr:`'variable_utterance'` is set to `True` in :attr:`'source_dataset'`
and/or :attr:`'target_dataset'` of :attr:`hparams`, the corresponding
fields "source_*" and/or "target_*" are respectively changed to contain
variable utterance text data, as in :class:`~texar.data.MonoTextData`.
The above field names can be accessed through :attr:`source_text_name`,
:attr:`source_text_id_name`, :attr:`source_length_name`,
:attr:`source_utterance_cnt_name`, and those prefixed with `target_`,
respectively.
Example:
.. code-block:: python
hparams={
'source_dataset': {'files': 's', 'vocab_file': 'vs'},
'target_dataset': {'files': ['t1', 't2'], 'vocab_file': 'vt'},
'batch_size': 1
}
data = PairedTextData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()
iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
# 'source_text': [['source', 'sequence', '']],
# 'source_text_ids': [[5, 10, 2]],
# 'source_length': [3]
# 'target_text': [['', 'target', 'sequence', '1', '']],
# 'target_text_ids': [[1, 6, 10, 20, 2]],
# 'target_length': [5]
# }
"""
def __init__(self, hparams):
TextDataBase.__init__(self, hparams)
with tf.name_scope(self.name, self.default_hparams()["name"]):
self._make_data()
@staticmethod
def default_hparams():
"""Returns a dicitionary of default hyperparameters.
.. code-block:: python
{
# (1) Hyperparams specific to text dataset
"source_dataset": {
"files": [],
"compression_type": None,
"vocab_file": "",
"embedding_init": {},
"delimiter": " ",
"max_seq_length": None,
"length_filter_mode": "truncate",
"pad_to_max_seq_length": False,
"bos_token": None,
"eos_token": "",
"other_transformations": [],
"variable_utterance": False,
"utterance_delimiter": "|||",
"max_utterance_cnt": 5,
"data_name": "source",
},
"target_dataset": {
# ...
# Same fields are allowed as in "source_dataset" with the
# same default values, except the
# following new fields/values:
"bos_token": ""
"vocab_share": False,
"embedding_init_share": False,
"processing_share": False,
"data_name": "target"
}
# (2) General hyperparams
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None,
"name": "paired_text_data",
# (3) Bucketing
"bucket_boundaries": [],
"bucket_batch_sizes": None,
"bucket_length_fn": None,
}
Here:
1. Hyperparameters in the :attr:`"source_dataset"` and
attr:`"target_dataset"` fields have the same definition as those
in :meth:`texar.data.MonoTextData.default_hparams`, for source and
target text, respectively.
For the new hyperparameters in "target_dataset":
"vocab_share" : bool
Whether to share the vocabulary of source.
If `True`, the vocab file of target is ignored.
"embedding_init_share" : bool
Whether to share the embedding initial value of source. If
`True`, :attr:`"embedding_init"` of target is ignored.
:attr:`"vocab_share"` must be true to share the embedding
initial value.
"processing_share" : bool
Whether to share the processing configurations of source,
including
"delimiter", "bos_token", "eos_token", and
"other_transformations".
2. For the **general** hyperparameters, see
:meth:`texar.data.DataBase.default_hparams` for details.
3. For **bucketing** hyperparameters, see
:meth:`texar.data.MonoTextData.default_hparams` for details, except
that the default bucket_length_fn is the maximum sequence length
of source and target sequences.
"""
hparams = TextDataBase.default_hparams()
hparams["name"] = "paired_text_data"
hparams.update(_default_paired_text_dataset_hparams())
return hparams
@staticmethod
def make_vocab(src_hparams, tgt_hparams):
"""Reads vocab files and returns source vocab and target vocab.
Args:
src_hparams (dict or HParams): Hyperparameters of source dataset.
tgt_hparams (dict or HParams): Hyperparameters of target dataset.
Returns:
A pair of :class:`texar.data.Vocab` instances. The two instances
may be the same objects if source and target vocabs are shared
and have the same other configs.
"""
src_vocab = MonoTextData.make_vocab(src_hparams)
if tgt_hparams["processing_share"]:
tgt_bos_token = src_hparams["bos_token"]
tgt_eos_token = src_hparams["eos_token"]
else:
tgt_bos_token = tgt_hparams["bos_token"]
tgt_eos_token = tgt_hparams["eos_token"]
tgt_bos_token = utils.default_str(tgt_bos_token,
SpecialTokens.BOS)
tgt_eos_token = utils.default_str(tgt_eos_token,
SpecialTokens.EOS)
if tgt_hparams["vocab_share"]:
if tgt_bos_token == src_vocab.bos_token and \
tgt_eos_token == src_vocab.eos_token:
tgt_vocab = src_vocab
else:
tgt_vocab = Vocab(src_hparams["vocab_file"],
bos_token=tgt_bos_token,
eos_token=tgt_eos_token)
else:
tgt_vocab = Vocab(tgt_hparams["vocab_file"],
bos_token=tgt_bos_token,
eos_token=tgt_eos_token)
return src_vocab, tgt_vocab
@staticmethod
def make_embedding(src_emb_hparams, src_token_to_id_map,
tgt_emb_hparams=None, tgt_token_to_id_map=None,
emb_init_share=False):
"""Optionally loads source and target embeddings from files
(if provided), and returns respective :class:`texar.data.Embedding`
instances.
"""
src_embedding = MonoTextData.make_embedding(src_emb_hparams,
src_token_to_id_map)
if emb_init_share:
tgt_embedding = src_embedding
else:
tgt_emb_file = tgt_emb_hparams["file"]
tgt_embedding = None
if tgt_emb_file is not None and tgt_emb_file != "":
tgt_embedding = Embedding(tgt_token_to_id_map, tgt_emb_hparams)
return src_embedding, tgt_embedding
def _make_dataset(self):
src_dataset = tf.data.TextLineDataset(
self._hparams.source_dataset.files,
compression_type=self._hparams.source_dataset.compression_type)
tgt_dataset = tf.data.TextLineDataset(
self._hparams.target_dataset.files,
compression_type=self._hparams.target_dataset.compression_type)
return tf.data.Dataset.zip((src_dataset, tgt_dataset))
@staticmethod
def _get_name_prefix(src_hparams, tgt_hparams):
name_prefix = [
src_hparams["data_name"], tgt_hparams["data_name"]]
if name_prefix[0] == name_prefix[1]:
raise ValueError("'data_name' of source and target "
"datasets cannot be the same.")
return name_prefix
@staticmethod
def _make_processor(src_hparams, tgt_hparams, data_spec, name_prefix):
# Create source data decoder
data_spec_i = data_spec.get_ith_data_spec(0)
src_decoder, src_trans, data_spec_i = MonoTextData._make_processor(
src_hparams, data_spec_i, chained=False)
data_spec.set_ith_data_spec(0, data_spec_i, 2)
# Create target data decoder
tgt_proc_hparams = tgt_hparams
if tgt_hparams["processing_share"]:
tgt_proc_hparams = copy.copy(src_hparams)
try:
tgt_proc_hparams["variable_utterance"] = \
tgt_hparams["variable_utterance"]
except TypeError:
tgt_proc_hparams.variable_utterance = \
tgt_hparams["variable_utterance"]
data_spec_i = data_spec.get_ith_data_spec(1)
tgt_decoder, tgt_trans, data_spec_i = MonoTextData._make_processor(
tgt_proc_hparams, data_spec_i, chained=False)
data_spec.set_ith_data_spec(1, data_spec_i, 2)
tran_fn = dsutils.make_combined_transformation(
[[src_decoder] + src_trans, [tgt_decoder] + tgt_trans],
name_prefix=name_prefix)
data_spec.add_spec(name_prefix=name_prefix)
return tran_fn, data_spec
@staticmethod
def _make_length_filter(src_hparams, tgt_hparams,
src_length_name, tgt_length_name,
src_decoder, tgt_decoder):
src_filter_fn = MonoTextData._make_length_filter(
src_hparams, src_length_name, src_decoder)
tgt_filter_fn = MonoTextData._make_length_filter(
tgt_hparams, tgt_length_name, tgt_decoder)
combined_filter_fn = dsutils._make_combined_filter_fn(
[src_filter_fn, tgt_filter_fn])
return combined_filter_fn
def _process_dataset(self, dataset, hparams, data_spec):
name_prefix = PairedTextData._get_name_prefix(
hparams["source_dataset"], hparams["target_dataset"])
tran_fn, data_spec = self._make_processor(
hparams["source_dataset"], hparams["target_dataset"],
data_spec, name_prefix=name_prefix)
num_parallel_calls = hparams["num_parallel_calls"]
dataset = dataset.map(
lambda *args: tran_fn(dsutils.maybe_tuple(args)),
num_parallel_calls=num_parallel_calls)
# Filters by length
src_length_name = dsutils._connect_name(
data_spec.name_prefix[0],
data_spec.decoder[0].length_tensor_name)
tgt_length_name = dsutils._connect_name(
data_spec.name_prefix[1],
data_spec.decoder[1].length_tensor_name)
filter_fn = self._make_length_filter(
hparams["source_dataset"], hparams["target_dataset"],
src_length_name, tgt_length_name,
data_spec.decoder[0], data_spec.decoder[1])
if filter_fn:
dataset = dataset.filter(filter_fn)
# Truncates data count
dataset = dataset.take(hparams["max_dataset_size"])
return dataset, data_spec
def _make_bucket_length_fn(self):
length_fn = self._hparams.bucket_length_fn
if not length_fn:
length_fn = lambda x: tf.maximum(
x[self.source_length_name], x[self.target_length_name])
elif not is_callable(length_fn):
# pylint: disable=redefined-variable-type
length_fn = utils.get_function(length_fn, ["texar.custom"])
return length_fn
def _make_padded_shapes(self, dataset, src_decoder, tgt_decoder):
src_text_and_id_shapes = {}
if self._hparams.source_dataset.pad_to_max_seq_length:
src_text_and_id_shapes = \
MonoTextData._make_padded_text_and_id_shapes(
dataset, self._hparams.source_dataset, src_decoder,
self.source_text_name, self.source_text_id_name)
tgt_text_and_id_shapes = {}
if self._hparams.target_dataset.pad_to_max_seq_length:
tgt_text_and_id_shapes = \
MonoTextData._make_padded_text_and_id_shapes(
dataset, self._hparams.target_dataset, tgt_decoder,
self.target_text_name, self.target_text_id_name)
padded_shapes = dataset.output_shapes
padded_shapes.update(src_text_and_id_shapes)
padded_shapes.update(tgt_text_and_id_shapes)
return padded_shapes
def _make_data(self):
self._src_vocab, self._tgt_vocab = self.make_vocab(
self._hparams.source_dataset, self._hparams.target_dataset)
tgt_hparams = self._hparams.target_dataset
if not tgt_hparams.vocab_share and tgt_hparams.embedding_init_share:
raise ValueError("embedding_init can be shared only when vocab "
"is shared. Got `vocab_share=False, "
"emb_init_share=True`.")
self._src_embedding, self._tgt_embedding = self.make_embedding(
self._hparams.source_dataset.embedding_init,
self._src_vocab.token_to_id_map_py,
self._hparams.target_dataset.embedding_init,
self._tgt_vocab.token_to_id_map_py,
self._hparams.target_dataset.embedding_init_share)
# Create dataset
dataset = self._make_dataset()
dataset, dataset_size = self._shuffle_dataset(
dataset, self._hparams, self._hparams.source_dataset.files)
self._dataset_size = dataset_size
# Processing.
data_spec = dsutils._DataSpec(
dataset=dataset, dataset_size=self._dataset_size,
vocab=[self._src_vocab, self._tgt_vocab],
embedding=[self._src_embedding, self._tgt_embedding])
dataset, data_spec = self._process_dataset(
dataset, self._hparams, data_spec)
self._data_spec = data_spec
self._decoder = data_spec.decoder
self._src_decoder = data_spec.decoder[0]
self._tgt_decoder = data_spec.decoder[1]
# Batching
length_fn = self._make_bucket_length_fn()
padded_shapes = self._make_padded_shapes(
dataset, self._src_decoder, self._tgt_decoder)
dataset = self._make_batch(
dataset, self._hparams, length_fn, padded_shapes)
# Prefetching
if self._hparams.prefetch_buffer_size > 0:
dataset = dataset.prefetch(self._hparams.prefetch_buffer_size)
self._dataset = dataset
def list_items(self):
"""Returns the list of item names that the data can produce.
Returns:
A list of strings.
"""
return list(self._dataset.output_types.keys())
@property
def dataset(self):
"""The dataset.
"""
return self._dataset
def dataset_size(self):
"""Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any
filtering and truncation.
"""
if not self._dataset_size:
# pylint: disable=attribute-defined-outside-init
self._dataset_size = count_file_lines(
self._hparams.source_dataset.files)
return self._dataset_size
@property
def vocab(self):
"""A pair instances of :class:`~texar.data.Vocab` that are source
and target vocabs, respectively.
"""
return self._src_vocab, self._tgt_vocab
@property
def source_vocab(self):
"""The source vocab, an instance of :class:`~texar.data.Vocab`.
"""
return self._src_vocab
@property
def target_vocab(self):
"""The target vocab, an instance of :class:`~texar.data.Vocab`.
"""
return self._tgt_vocab
@property
def source_embedding_init_value(self):
"""The `Tensor` containing the embedding value of source data
loaded from file. `None` if embedding is not specified.
"""
if self._src_embedding is None:
return None
return self._src_embedding.word_vecs
@property
def target_embedding_init_value(self):
"""The `Tensor` containing the embedding value of target data
loaded from file. `None` if embedding is not specified.
"""
if self._tgt_embedding is None:
return None
return self._tgt_embedding.word_vecs
def embedding_init_value(self):
"""A pair of `Tensor` containing the embedding values of source and
target data loaded from file.
"""
src_emb = self.source_embedding_init_value
tgt_emb = self.target_embedding_init_value
return src_emb, tgt_emb
@property
def source_text_name(self):
"""The name of the source text tensor, "source_text" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[0],
self._src_decoder.text_tensor_name)
return name
@property
def source_length_name(self):
"""The name of the source length tensor, "source_length" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[0],
self._src_decoder.length_tensor_name)
return name
@property
def source_text_id_name(self):
"""The name of the source text index tensor, "source_text_ids" by
default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[0],
self._src_decoder.text_id_tensor_name)
return name
@property
def source_utterance_cnt_name(self):
"""The name of the source text utterance count tensor,
"source_utterance_cnt" by default.
"""
if not self._hparams.source_dataset.variable_utterance:
raise ValueError(
"`utterance_cnt_name` of source data is undefined.")
name = dsutils._connect_name(
self._data_spec.name_prefix[0],
self._src_decoder.utterance_cnt_tensor_name)
return name
@property
def target_text_name(self):
"""The name of the target text tensor, "target_text" bt default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[1],
self._tgt_decoder.text_tensor_name)
return name
@property
def target_length_name(self):
"""The name of the target length tensor, "target_length" by default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[1],
self._tgt_decoder.length_tensor_name)
return name
@property
def target_text_id_name(self):
"""The name of the target text index tensor, "target_text_ids" by
default.
"""
name = dsutils._connect_name(
self._data_spec.name_prefix[1],
self._tgt_decoder.text_id_tensor_name)
return name
@property
def target_utterance_cnt_name(self):
"""The name of the target text utterance count tensor,
"target_utterance_cnt" by default.
"""
if not self._hparams.target_dataset.variable_utterance:
raise ValueError(
"`utterance_cnt_name` of target data is undefined.")
name = dsutils._connect_name(
self._data_spec.name_prefix[1],
self._tgt_decoder.utterance_cnt_tensor_name)
return name
@property
def text_name(self):
"""The name of text tensor, "text" by default.
"""
return self._src_decoder.text_tensor_name
@property
def length_name(self):
"""The name of length tensor, "length" by default.
"""
return self._src_decoder.length_tensor_name
@property
def text_id_name(self):
"""The name of text index tensor, "text_ids" by default.
"""
return self._src_decoder.text_id_tensor_name
@property
def utterance_cnt_name(self):
"""The name of the text utterance count tensor, "utterance_cnt" by
default.
"""
if self._hparams.source_dataset.variable_utterance:
return self._src_decoder.utterance_cnt_tensor_name
if self._hparams.target_dataset.variable_utterance:
return self._tgt_decoder.utterance_cnt_tensor_name
raise ValueError("`utterance_cnt_name` is not defined.")
================================================
FILE: texar_repo/texar/data/data/paired_text_data_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import copy
import numpy as np
import tensorflow as tf
import texar as tx
from texar.data import SpecialTokens
# pylint: disable=too-many-locals, too-many-branches, protected-access
# pylint: disable=invalid-name
class PairedTextDataTest(tf.test.TestCase):
"""Tests paired text data class.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create test data
vocab_list = ['This', 'is', 'a', 'word', '词']
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
self._vocab_file = vocab_file
self._vocab_size = len(vocab_list)
src_text = ['This is a sentence from source .', '词 词 。 source']
src_text_file = tempfile.NamedTemporaryFile()
src_text_file.write('\n'.join(src_text).encode("utf-8"))
src_text_file.flush()
self._src_text_file = src_text_file
tgt_text = ['This is a sentence from target .', '词 词 。 target']
tgt_text_file = tempfile.NamedTemporaryFile()
tgt_text_file.write('\n'.join(tgt_text).encode("utf-8"))
tgt_text_file.flush()
self._tgt_text_file = tgt_text_file
self._hparams = {
"num_epochs": 50,
"batch_size": 3,
"source_dataset": {
"files": [self._src_text_file.name],
"vocab_file": self._vocab_file.name,
},
"target_dataset": {
"files": self._tgt_text_file.name,
"vocab_share": True,
"eos_token": ""
}
}
def _run_and_test(self, hparams, proc_shr=False, length_inc=None,
discard_src=False):
# Construct database
text_data = tx.data.PairedTextData(hparams)
self.assertEqual(
text_data.source_vocab.size,
self._vocab_size + len(text_data.source_vocab.special_tokens))
iterator = text_data.dataset.make_initializable_iterator()
text_data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
if proc_shr:
tgt_eos = b''
else:
tgt_eos = b''
while True:
try:
# Run the logics
data_batch_ = sess.run(text_data_batch)
self.assertEqual(set(data_batch_.keys()),
set(text_data.list_items()))
# Test matching
src_text = data_batch_['source_text']
tgt_text = data_batch_['target_text']
if proc_shr:
for src, tgt in zip(src_text, tgt_text):
np.testing.assert_array_equal(src[:3], tgt[:3])
else:
for src, tgt in zip(src_text, tgt_text):
np.testing.assert_array_equal(src[:3], tgt[1:4])
self.assertTrue(
tgt_eos in data_batch_['target_text'][0])
if length_inc:
for i in range(len(data_batch_['source_text'])):
text_ = data_batch_['source_text'][i].tolist()
self.assertEqual(
text_.index(b'') + 1,
data_batch_['source_length'][i] - length_inc[0])
for i in range(len(data_batch_['target_text'])):
text_ = data_batch_['target_text'][i].tolist()
self.assertEqual(
text_.index(tgt_eos) + 1,
data_batch_['target_length'][i] - length_inc[1])
if discard_src:
src_hparams = text_data.hparams.source_dataset
max_l = src_hparams.max_seq_length
max_l += text_data._decoder[0].added_length
for l in data_batch_[text_data.source_length_name]:
self.assertLessEqual(l, max_l)
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_default_setting(self):
"""Tests the logics of the text data.
"""
self._run_and_test(self._hparams)
def test_shuffle(self):
"""Tests toggling shuffle.
"""
hparams = copy.copy(self._hparams)
hparams["shuffle"] = False
self._run_and_test(hparams)
def test_processing_share(self):
"""Tests sharing processing.
"""
hparams = copy.copy(self._hparams)
hparams["target_dataset"]["processing_share"] = True
self._run_and_test(hparams, proc_shr=True)
def test_other_transformations(self):
"""Tests use of other transformations
"""
def _transform(x, data_specs): # pylint: disable=invalid-name
x[data_specs.decoder.length_tensor_name] += 1
return x
hparams = copy.copy(self._hparams)
hparams["source_dataset"].update(
{"other_transformations": [_transform, _transform]})
hparams["target_dataset"].update(
{"other_transformations": [_transform]})
self._run_and_test(hparams, length_inc=(2, 1))
def test_length_filter(self):
"""Tests filtering by length.
"""
hparams = copy.copy(self._hparams)
hparams["source_dataset"].update(
{"max_seq_length": 4,
"length_filter_mode": "discard"})
self._run_and_test(hparams, discard_src=True)
#def test_sequence_length(self):
# hparams = {
# "batch_size": 64,
# "num_epochs": 1,
# "shuffle": False,
# "allow_smaller_final_batch": False,
# "source_dataset": {
# "files": "../../../data/yelp/sentiment.dev.sort.0",
# "vocab_file": "../../../data/yelp/vocab",
# "bos_token": SpecialTokens.BOS,
# "eos_token": SpecialTokens.EOS,
# },
# "target_dataset": {
# "files": "../../../data/yelp/sentiment.dev.sort.1",
# "vocab_share": True,
# },
# }
# data = tx.data.PairedTextData(hparams)
# iterator = tx.data.TrainTestDataIterator(val=data)
# text_data_batch = iterator.get_next()
# with self.test_session() as sess:
# sess.run(tf.global_variables_initializer())
# sess.run(tf.local_variables_initializer())
# sess.run(tf.tables_initializer())
# iterator.switch_to_val_data(sess)
# while True:
# try:
# data_batch_ = sess.run(text_data_batch)
# src = data_batch_["source_text_ids"]
# src_len = data_batch_["source_length"]
# self.assertEqual(src.shape[1], np.max(src_len))
# tgt = data_batch_["target_text_ids"]
# tgt_len = data_batch_["target_length"]
# self.assertEqual(tgt.shape[1], np.max(tgt_len))
# except tf.errors.OutOfRangeError:
# break
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/scalar_data.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various data classes that define data reading, parsing, batching, and other
preprocessing operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.data.data_utils import count_file_lines
from texar.data.data import dataset_utils as dsutils
from texar.data.data.data_base import DataBase
from texar.data.data.mono_text_data import MonoTextData
from texar.data.data_decoders import ScalarDataDecoder
# pylint: disable=invalid-name, arguments-differ, not-context-manager
__all__ = [
"_default_scalar_dataset_hparams",
"ScalarData"
]
def _default_scalar_dataset_hparams():
"""Returns hyperparameters of a scalar dataset with default values.
See :meth:`texar.data.ScalarData.default_hparams` for details.
"""
return {
"files": [],
"compression_type": None,
"data_type": "int",
"data_name": None,
"other_transformations": [],
"@no_typecheck": ["files"]
}
class ScalarData(DataBase):
"""Scalar data where each line of the files is a scalar (int or float),
e.g., a data label.
Args:
hparams (dict): Hyperparameters. See :meth:`default_hparams` for the
defaults.
The processor reads and processes raw data and results in a TF dataset
whose element is a python `dict` including one field. The field name is
specified in :attr:`hparams["dataset"]["data_name"]`. If not specified,
the default name is `"data"`. The field name can be accessed through
:attr:`data_name`.
This field is a Tensor of shape `[batch_size]` containing a batch of
scalars, of either int or float type as specified in :attr:`hparams`.
Example:
.. code-block:: python
hparams={
'dataset': { 'files': 'data.txt', 'data_name': 'label' },
'batch_size': 2
}
data = ScalarData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()
iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
# 'label': [2, 9]
# }
"""
def __init__(self, hparams):
DataBase.__init__(self, hparams)
with tf.name_scope(self.name, self.default_hparams()["name"]):
self._make_data()
@staticmethod
def default_hparams():
"""Returns a dicitionary of default hyperparameters.
.. code-block:: python
{
# (1) Hyperparams specific to scalar dataset
"dataset": {
"files": [],
"compression_type": None,
"data_type": "int",
"other_transformations": [],
"data_name": None,
}
# (2) General hyperparams
"num_epochs": 1,
"batch_size": 64,
"allow_smaller_final_batch": True,
"shuffle": True,
"shuffle_buffer_size": None,
"shard_and_shuffle": False,
"num_parallel_calls": 1,
"prefetch_buffer_size": 0,
"max_dataset_size": -1,
"seed": None,
"name": "scalar_data",
}
Here:
1. For the hyperparameters in the :attr:`"dataset"` field:
"files" : str or list
A (list of) file path(s).
Each line contains a single scalar number.
"compression_type" : str, optional
One of "" (no compression), "ZLIB", or "GZIP".
"data_type" : str
The scalar type. Currently supports "int" and "float".
"other_transformations" : list
A list of transformation functions or function names/paths to
further transform each single data instance.
(More documentations to be added.)
"data_name" : str
Name of the dataset.
2. For the **general** hyperparameters, see
:meth:`texar.data.DataBase.default_hparams` for details.
"""
hparams = DataBase.default_hparams()
hparams["name"] = "scalar_data"
hparams.update({
"dataset": _default_scalar_dataset_hparams()
})
return hparams
@staticmethod
def _get_dtype(dtype_hparam):
if dtype_hparam == "int":
dtype = tf.int32
elif dtype_hparam == "float":
dtype = tf.float32
else:
raise ValueError("Unknown data type: " + dtype_hparam)
return dtype
@staticmethod
def _make_processor(dataset_hparams, data_spec, chained=True,
name_prefix=None):
# Create data decoder
decoder = ScalarDataDecoder(
ScalarData._get_dtype(dataset_hparams["data_type"]),
data_name=name_prefix)
# Create other transformations
data_spec.add_spec(decoder=decoder)
# pylint: disable=protected-access
other_trans = MonoTextData._make_other_transformations(
dataset_hparams["other_transformations"], data_spec)
data_spec.add_spec(name_prefix=name_prefix)
if chained:
chained_tran = dsutils.make_chained_transformation(
[decoder] + other_trans)
return chained_tran, data_spec
else:
return decoder, other_trans, data_spec
def _process_dataset(self, dataset, hparams, data_spec):
chained_tran, data_spec = self._make_processor(
hparams["dataset"], data_spec,
name_prefix=hparams["dataset"]["data_name"])
num_parallel_calls = hparams["num_parallel_calls"]
dataset = dataset.map(
lambda *args: chained_tran(dsutils.maybe_tuple(args)),
num_parallel_calls=num_parallel_calls)
# Truncates data count
dataset = dataset.take(hparams["max_dataset_size"])
return dataset, data_spec
def _make_data(self):
dataset_hparams = self._hparams.dataset
# Create and shuffle dataset
dataset = MonoTextData._make_mono_text_dataset(dataset_hparams)
dataset, dataset_size = self._shuffle_dataset(
dataset, self._hparams, self._hparams.dataset.files)
self._dataset_size = dataset_size
# Processing
# pylint: disable=protected-access
data_spec = dsutils._DataSpec(dataset=dataset,
dataset_size=self._dataset_size)
dataset, data_spec = self._process_dataset(dataset, self._hparams,
data_spec)
self._data_spec = data_spec
self._decoder = data_spec.decoder # pylint: disable=no-member
# Batching
dataset = self._make_batch(dataset, self._hparams)
# Prefetching
if self._hparams.prefetch_buffer_size > 0:
dataset = dataset.prefetch(self._hparams.prefetch_buffer_size)
self._dataset = dataset
def list_items(self):
"""Returns the list of item names that the data can produce.
Returns:
A list of strings.
"""
return list(self._dataset.output_types.keys())
@property
def dataset(self):
"""The dataset.
"""
return self._dataset
def dataset_size(self):
"""Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any
filtering and truncation.
"""
if not self._dataset_size:
# pylint: disable=attribute-defined-outside-init
self._dataset_size = count_file_lines(
self._hparams.dataset.files)
return self._dataset_size
@property
def data_name(self):
"""The name of the data tensor, "data" by default if not specified in
:attr:`hparams`.
"""
return self._decoder.data_tensor_name
================================================
FILE: texar_repo/texar/data/data/scalar_data_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import copy
import tempfile
import numpy as np
import tensorflow as tf
import texar as tx
class ScalarDataTest(tf.test.TestCase):
"""Tests scalar data class.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
# Create test data
# pylint: disable=no-member
int_data = np.linspace(0, 100, num=101, dtype=np.int32).tolist()
int_data = [str(i) for i in int_data]
int_file = tempfile.NamedTemporaryFile()
int_file.write('\n'.join(int_data).encode("utf-8"))
int_file.flush()
self._int_file = int_file
self._int_hparams = {
"num_epochs": 1,
"batch_size": 1,
"shuffle": False,
"dataset": {
"files": self._int_file.name,
"data_type": "int",
"data_name": "label"
}
}
self._float_hparams = {
"num_epochs": 1,
"batch_size": 1,
"shuffle": False,
"dataset": {
"files": self._int_file.name,
"data_type": "float",
"data_name": "feat"
}
}
def _run_and_test(self, hparams):
# Construct database
scalar_data = tx.data.ScalarData(hparams)
self.assertEqual(scalar_data.list_items()[0],
hparams["dataset"]["data_name"])
iterator = scalar_data.dataset.make_initializable_iterator()
data_batch = iterator.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
i = 0
while True:
try:
# Run the logics
data_batch_ = sess.run(data_batch)
self.assertEqual(set(data_batch_.keys()),
set(scalar_data.list_items()))
value = data_batch_[scalar_data.data_name][0]
self.assertEqual(i, value)
i += 1
# pylint: disable=no-member
if hparams["dataset"]["data_type"] == "int":
self.assertTrue(isinstance(value, np.int32))
else:
self.assertTrue(isinstance(value, np.float32))
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
def test_default_setting(self):
"""Tests the logics of ScalarData.
"""
self._run_and_test(self._int_hparams)
self._run_and_test(self._float_hparams)
def test_shuffle(self):
"""Tests results of toggling shuffle.
"""
hparams = copy.copy(self._int_hparams)
hparams["batch_size"] = 10
scalar_data = tx.data.ScalarData(hparams)
iterator = scalar_data.dataset.make_initializable_iterator()
data_batch = iterator.get_next()
hparams_sfl = copy.copy(hparams)
hparams_sfl["shuffle"] = True
scalar_data_sfl = tx.data.ScalarData(hparams_sfl)
iterator_sfl = scalar_data_sfl.dataset.make_initializable_iterator()
data_batch_sfl = iterator_sfl.get_next()
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())
sess.run(iterator.initializer)
sess.run(iterator_sfl.initializer)
vals = []
vals_sfl = []
while True:
try:
# Run the logics
data_batch_, data_batch_sfl_ = sess.run([data_batch,
data_batch_sfl])
vals += data_batch_[scalar_data.data_name].tolist()
vals_sfl += data_batch_sfl_[scalar_data.data_name].tolist()
except tf.errors.OutOfRangeError:
print('Done -- epoch limit reached')
break
self.assertEqual(len(vals), len(vals_sfl))
self.assertSetEqual(set(vals), set(vals_sfl))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/data/text_data_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base text data class that is enherited by all text data classes.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.data.data.data_base import DataBase
from texar.data.data import dataset_utils as dsutils
# pylint: disable=protected-access, arguments-differ
__all__ = [
"TextDataBase"
]
class TextDataBase(DataBase): # pylint: disable=too-few-public-methods
"""Base class inheritted by all text data classes.
"""
def __init__(self, hparams):
DataBase.__init__(self, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of default hyperparameters.
See the specific subclasses for the details.
"""
hparams = DataBase.default_hparams()
hparams.update({
"bucket_boundaries": [],
"bucket_batch_sizes": None,
"bucket_length_fn": None})
return hparams
@staticmethod
def _make_batch(dataset, hparams, element_length_func,
padded_shapes=None, padding_values=None):
dataset = dataset.repeat(hparams.num_epochs)
batch_size = hparams["batch_size"]
bucket_boundaries = hparams["bucket_boundaries"]
if padded_shapes is None:
padded_shapes = dataset.output_shapes
if len(bucket_boundaries) == 0:
if hparams["allow_smaller_final_batch"]:
dataset = dataset.padded_batch(
batch_size, padded_shapes, padding_values=padding_values)
else:
dataset = dataset.apply(
tf.contrib.data.padded_batch_and_drop_remainder(
batch_size, padded_shapes,
padding_values=padding_values))
else:
bucket_batch_size = hparams["bucket_batch_sizes"]
if bucket_batch_size is None:
bucket_batch_size = [batch_size] * (len(bucket_boundaries) + 1)
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length(
element_length_func, bucket_boundaries, bucket_batch_size,
padded_shapes=padded_shapes, padding_values=padding_values))
if not hparams["allow_smaller_final_batch"]:
if len(set(bucket_batch_size)) > 1:
raise ValueError(
"Batch size of every bucket must be the same if "
"smaller final batch is not allowed.")
batch_size = bucket_batch_size[0]
filter_fn = dsutils._make_smaller_batch_filter_fn(batch_size)
dataset = dataset.filter(
lambda *args: filter_fn(dsutils.maybe_tuple(args)))
return dataset
================================================
FILE: texar_repo/texar/data/data_decoders.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Helper functions and classes for decoding text data which are used after
reading raw text data.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from tensorflow.contrib.slim.python.slim.data import data_decoder
from texar.data.vocabulary import SpecialTokens
# pylint: disable=too-many-instance-attributes, too-many-arguments,
# pylint: disable=no-member, invalid-name
__all__ = [
"ScalarDataDecoder",
"TextDataDecoder",
"VarUttTextDataDecoder"
]
def _append_token(token):
return token is not None and token != ""
class ScalarDataDecoder(data_decoder.DataDecoder):
"""A data decoder that decodes a scalar, e.g., int label or float number.
The only operation is to cast the data into a specified data type.
Args:
dtype: A :tf_main:`tf DType ` that data is cast into. Can be
`tf.int32` or `tf.float32`.
data_name (str): Name of the decoded data.
"""
def __init__(self, dtype=tf.int32, data_name="data"):
self._dtype = dtype
self._data_name = data_name
if self._data_name is None:
self._data_name = "data"
def __call__(self, data):
outputs = self.decode(data, self.list_items())
return dict(zip(self.list_items(), outputs))
def decode(self, data, items):
"""Decodes the data to return the tensors specified by the list of
items.
Args:
data: The scalar data to decode.
items: A list of strings, each of which is the name of the resulting
tensors to retrieve.
Returns:
A list of tensors, each of which corresponds to each item.
"""
data = tf.reshape(data, shape=[])
if data.dtype is tf.string:
decoded_data = tf.string_to_number(data, out_type=self._dtype)
else:
decoded_data = tf.cast(data, self._dtype),
outputs = {
self._data_name: decoded_data
}
return [outputs[item] for item in items]
def list_items(self):
"""Returns the list of item names that the decoder can produce.
Returns:
A list of strings can be passed to :meth:`decode()`.
"""
return [self._data_name]
@property
def data_tensor_name(self):
"""The name of the data tensor.
"""
return self._data_name
class TextDataDecoder(data_decoder.DataDecoder):
"""A text data decoder that decodes raw text data.
Operations include splitting on word or character level, truncation,
inserting special tokens, mapping text units to indexes, etc.
Args:
split_level (str): The name of split level on which text sequence is
split. Either "word" or "char".
delimiter (str): The delimiter character used when splitting on word
level.
bos_token (str, optional): Special token added to the beginning of
sequences. If it is `None` (default) or an empty string, no
BOS token is added.
eos_token (str, optional): Special tokan added to the end of
sequences. If it is `None` (default) or an empty string, no EOS
token is added.
max_seq_length (int, optional): Maximum length of output sequences.
Tokens exceeding the maximum length will be truncated. The length
does not include any added bos_token and eos_token. If not
given, no truncation is performed.
token_to_id_map (optional): A
:class:`~tensorflow.contrib.lookup.HashTable` instance that maps
token strings to integer indexes. If not given, the decoder will
not decode text into indexes. :attr:`bos_token` and
:attr:`eos_token` (if given) should have entries in the
:attr:`token_to_id_map` (if given).
text_tensor_name (str): Name of the text tensor results. Used as a
key to retrieve the text tensor.
length_tensor_name (str): Name of the text length tensor results.
text_id_tensor_name (str): Name of the text index tensor results.
"""
def __init__(self,
split_level="word",
delimiter=" ",
bos_token=None,
eos_token=None,
max_seq_length=None,
token_to_id_map=None,
text_tensor_name="text",
length_tensor_name="length",
text_id_tensor_name="text_ids"):
self._split_level = split_level
self._delimiter = delimiter
self._bos_token = bos_token
self._eos_token = eos_token
self._max_seq_length = max_seq_length
self._token_to_id_map = token_to_id_map
self._text_tensor_name = text_tensor_name
self._text_id_tensor_name = text_id_tensor_name
self._length_tensor_name = length_tensor_name
self._added_length = 0
def __call__(self, data):
outputs = self.decode(data, self.list_items())
return dict(zip(self.list_items(), outputs))
def decode(self, data, items):
"""Decodes the data to return the tensors specified by the list of
items.
Args:
data: The text data to decode.
items: A list of strings, each of which is the name of the resulting
tensors to retrieve.
Returns:
A list of tensors, each of which corresponds to each item. If
`token_to_id_map` is not given when constructing the decoder,
returns `None` for the token index item.
"""
# Split
if self._split_level == "word":
tokens = tf.string_split([data], delimiter=self._delimiter).values
elif self._split_level == "char":
raise NotImplementedError
else:
raise ValueError("Unknown split level: %s" % self._split_level)
# Truncate
if self._max_seq_length is not None:
tokens = tokens[:self._max_seq_length]
# Add BOS/EOS tokens
if _append_token(self._bos_token):
tokens = tf.concat([[self._bos_token], tokens], axis=0)
self._added_length += 1
if _append_token(self._eos_token):
tokens = tf.concat([tokens, [self._eos_token]], axis=0)
self._added_length += 1
# Map to index
token_ids = None
if self._token_to_id_map is not None:
token_ids = self._token_to_id_map.lookup(tokens)
outputs = {
self._text_tensor_name: tokens,
self._length_tensor_name: tf.size(tokens),
self._text_id_tensor_name: token_ids
}
return [outputs[item] for item in items]
def list_items(self):
"""Returns the list of item names that the decoder can produce.
Returns:
A list of strings can be passed to :meth:`decode()`.
"""
return [self._text_tensor_name,
self._length_tensor_name,
self._text_id_tensor_name]
@property
def text_tensor_name(self):
"""The name of text tensor.
"""
return self._text_tensor_name
@text_tensor_name.setter
def text_tensor_name(self, name):
self._text_tensor_name = name
@property
def length_tensor_name(self):
"""The name of length tensor.
"""
return self._length_tensor_name
@length_tensor_name.setter
def length_tensor_name(self, name):
self._length_tensor_name = name
@property
def text_id_tensor_name(self):
"""The name of text index tensor.
"""
return self._text_id_tensor_name
@text_id_tensor_name.setter
def text_id_tensor_name(self, name):
self._text_id_tensor_name = name
@property
def added_length(self):
"""The added text length due to appended bos and eos tokens.
"""
return self._added_length
class VarUttTextDataDecoder(data_decoder.DataDecoder):
"""A text data decoder that decodes raw text data. Each data is considered
to be multiple sentences concatenated by a delimiter.
Operations include splitting on word or character level, truncation,
inserting special tokens, mapping text units to indexes, etc.
Args:
split_level (str): The name of split level on which text sequence is
split. Either "word" or "char".
delimiter (str): The delimiter character used when splitting on word
level.
bos_token (str, optional): Special token added to the beginning of
sequences. If it is `None` (default) or an empty string, no
BOS token is added.
eos_token (str, optional): Special tokan added to the end of
sequences. If it is `None` (default) or an empty string, no EOS
token is added.
max_seq_length (int): Maximum length of each sequence.
Tokens exceed the maximum length will be truncated. Additional
padding will be done to ensure output sequence all reach this
number. The length does not include any added bos_token and eos_
token.
max_utterance_cnt (int): Maximum number of sequences.
Additional empty sentences will be added to
ensure the respective dimension of the output tensor has size
:attr:`max_utterance_cnt`. The output item named by
:meth:`utterance_cnt_tensor_name` contains the actual number of
utterance in the data.
token_to_id_map (optional): A
:class:`~tensorflow.contrib.lookup.HashTable` instance that maps
token strings to integer indexes. If not given, the decoder will
not decode text into indexes. :attr:`bos_token` and
:attr:`eos_token` (if given) should have entries in the
:attr:`token_to_id_map` (if given).
text_tensor_name (str): Name of the text tensor results. Used as a
key to retrieve the text tensor.
length_tensor_name (str): Name of the text length tensor results.
text_id_tensor_name (str): Name of the text index tensor results.
"""
def __init__(self,
split_level="word",
delimiter=" ",
sentence_delimiter="|||",
bos_token=None,
eos_token=None,
max_seq_length=None,
max_utterance_cnt=None,
token_to_id_map=None,
text_tensor_name="text",
length_tensor_name="length",
text_id_tensor_name="text_ids",
utterance_cnt_tensor_name="utterance_cnt"):
self._split_level = split_level
self._delimiter = delimiter
self._bos_token = bos_token
self._eos_token = eos_token
self._max_seq_length = max_seq_length
self._token_to_id_map = token_to_id_map
self._text_tensor_name = text_tensor_name
self._text_id_tensor_name = text_id_tensor_name
self._length_tensor_name = length_tensor_name
self._utterance_cnt_tensor_name = utterance_cnt_tensor_name
self._sentence_delimiter = sentence_delimiter
self._max_utterance_cnt = max_utterance_cnt
self._added_length = 0
def __call__(self, data):
outputs = self.decode(data, self.list_items())
return dict(zip(self.list_items(), outputs))
def decode(self, data, items): # pylint: disable=too-many-locals
"""Decodes the data to return the tensors specified by the list of
items.
Args:
data: The text data to decode.
items: A list of strings, each of which is the name of the resulting
tensors to retrieve.
Returns:
A list of tensors, each of which corresponds to each item. If
`token_to_id_map` is not given when constructing the decoder,
returns `None` for the token index item.
"""
sentences = tf.string_split([data],
delimiter=self._sentence_delimiter).values
# Truncate utterances
if self._max_utterance_cnt:
sentences = sentences[:self._max_utterance_cnt]
utterance_cnt = tf.shape(sentences)[0]
# Get (max) sentence length
def _get_sent_length(s):
raw_length = tf.size(
tf.string_split([s], delimiter=self._delimiter).values)
if self._max_seq_length:
return tf.minimum(raw_length, self._max_seq_length)
else:
return raw_length
raw_sent_length = tf.map_fn(
_get_sent_length, sentences, dtype=tf.int32)
sent_length = self._max_seq_length
if not sent_length:
sent_length = tf.reduce_max(raw_sent_length)
if _append_token(self._eos_token):
raw_sent_length += 1
sent_length += 1
self._added_length += 1
if _append_token(self._bos_token):
raw_sent_length += 1
sent_length += 1
self._added_length += 1
def _trunc_and_pad(s, pad_token, max_length):
if self._max_seq_length:
s = s[:self._max_seq_length]
if _append_token(self._bos_token):
s = np.append([self._bos_token], s)
if _append_token(self._eos_token):
s = np.append(s, [self._eos_token])
s = np.append(s, [pad_token]*(max_length-s.size))
return s
# Split each sentence to tokens, and pad them to a same length.
# This is necessary to treat all sentences as a single tensor.
split_sentences = tf.map_fn(
lambda s: tf.py_func(
_trunc_and_pad,
[
tf.string_split([s], delimiter=self._delimiter).values,
SpecialTokens.PAD,
sent_length
],
tf.string),
sentences, dtype=tf.string
)
split_sentences = tf.reshape(split_sentences,
[utterance_cnt, sent_length])
# Map to index
token_ids = None
if self._token_to_id_map is not None:
token_ids = self._token_to_id_map.lookup(split_sentences)
outputs = {
self._text_tensor_name: split_sentences,
self._length_tensor_name: raw_sent_length,
self._utterance_cnt_tensor_name: tf.shape(sentences)[0],
self._text_id_tensor_name: token_ids
}
return [outputs[item] for item in items]
def list_items(self):
"""Returns the list of item names that the decoder can produce.
Returns:
A list of strings can be passed to :meth:`decode()`.
"""
return [
self._text_tensor_name,
self._length_tensor_name,
self._text_id_tensor_name,
self._utterance_cnt_tensor_name
]
@property
def text_tensor_name(self):
"""The name of text tensor.
"""
return self._text_tensor_name
@text_tensor_name.setter
def text_tensor_name(self, name):
self._text_tensor_name = name
@property
def utterance_cnt_tensor_name(self):
"""The name of the utterance count tensor.
"""
return self._utterance_cnt_tensor_name
@property
def length_tensor_name(self):
"""The name of length tensor.
"""
return self._length_tensor_name
@length_tensor_name.setter
def length_tensor_name(self, name):
self._length_tensor_name = name
@property
def text_id_tensor_name(self):
"""The name of text index tensor.
"""
return self._text_id_tensor_name
@text_id_tensor_name.setter
def text_id_tensor_name(self, name):
self._text_id_tensor_name = name
@property
def added_length(self):
"""The added text length due to appended bos and eos tokens.
"""
return self._added_length
================================================
FILE: texar_repo/texar/data/data_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various utilities specific to data processing.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os
import sys
import tarfile
import zipfile
import collections
import numpy as np
from six.moves import urllib
import requests
import tensorflow as tf
from texar.utils import utils_io
# pylint: disable=invalid-name, too-many-branches
__all__ = [
"maybe_download",
"read_words",
"make_vocab",
"count_file_lines"
]
Py3 = sys.version_info[0] == 3
def maybe_download(urls, path, filenames=None, extract=False):
"""Downloads a set of files.
Args:
urls: A (list of) urls to download files.
path (str): The destination path to save the files.
filenames: A (list of) strings of the file names. If given,
must have the same length with :attr:`urls`. If `None`,
filenames are extracted from :attr:`urls`.
extract (bool): Whether to extract compressed files.
Returns:
A list of paths to the downloaded files.
"""
utils_io.maybe_create_dir(path)
if not isinstance(urls, (list, tuple)):
urls = [urls]
if filenames is not None:
if not isinstance(filenames, (list, tuple)):
filenames = [filenames]
if len(urls) != len(filenames):
raise ValueError(
'`filenames` must have the same number of elements as `urls`.')
result = []
for i, url in enumerate(urls):
if filenames is not None:
filename = filenames[i]
elif 'drive.google.com' in url:
filename = _extract_google_drive_file_id(url)
else:
filename = url.split('/')[-1]
# If downloading from GitHub, remove suffix ?raw=True
# from local filename
if filename.endswith("?raw=true"):
filename = filename[:-9]
filepath = os.path.join(path, filename)
result.append(filepath)
if not tf.gfile.Exists(filepath):
if 'drive.google.com' in url:
filepath = _download_from_google_drive(url, filename, path)
else:
filepath = _download(url, filename, path)
if extract:
tf.logging.info('Extract %s', filepath)
if tarfile.is_tarfile(filepath):
tarfile.open(filepath, 'r').extractall(path)
elif zipfile.is_zipfile(filepath):
with zipfile.ZipFile(filepath) as zfile:
zfile.extractall(path)
else:
tf.logging.info("Unknown compression type. Only .tar.gz, "
".tar.bz2, .tar, and .zip are supported")
return result
def _download(url, filename, path):
def _progress(count, block_size, total_size):
percent = float(count * block_size) / float(total_size) * 100.
# pylint: disable=cell-var-from-loop
sys.stdout.write('\r>> Downloading %s %.1f%%' %
(filename, percent))
sys.stdout.flush()
filepath = os.path.join(path, filename)
filepath, _ = urllib.request.urlretrieve(url, filepath, _progress)
print()
statinfo = os.stat(filepath)
print('Successfully downloaded {} {} bytes.'.format(
filename, statinfo.st_size))
return filepath
def _extract_google_drive_file_id(url):
# id is between `/d/` and '/'
url_suffix = url[url.find('/d/')+3:]
file_id = url_suffix[:url_suffix.find('/')]
return file_id
def _download_from_google_drive(url, filename, path):
"""Adapted from `https://github.com/saurabhshri/gdrive-downloader`
"""
def _get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
file_id = _extract_google_drive_file_id(url)
gurl = "https://docs.google.com/uc?export=download"
sess = requests.Session()
response = sess.get(gurl, params={'id': file_id}, stream=True)
token = _get_confirm_token(response)
if token:
params = {'id': file_id, 'confirm': token}
response = sess.get(gurl, params=params, stream=True)
filepath = os.path.join(path, filename)
CHUNK_SIZE = 32768
with tf.gfile.GFile(filepath, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk:
f.write(chunk)
print('Successfully downloaded {}.'.format(filename))
return filepath
def read_words(filename, newline_token=None):
"""Reads word from a file.
Args:
filename (str): Path to the file.
newline_token (str, optional): The token to replace the original newline
token "\\\\n". For example,
`newline_token=tx.data.SpecialTokens.EOS`.
If `None`, no replacement is performed.
Returns:
A list of words.
"""
with tf.gfile.GFile(filename, "r") as f:
if Py3:
if newline_token is None:
return f.read().split()
else:
return f.read().replace("\n", newline_token).split()
else:
if newline_token is None:
return f.read().decode("utf-8").split()
else:
return (f.read().decode("utf-8")
.replace("\n", newline_token).split())
def make_vocab(filenames, max_vocab_size=-1, newline_token=None,
return_type="list", return_count=False):
"""Builds vocab of the files.
Args:
filenames (str): A (list of) files.
max_vocab_size (int): Maximum size of the vocabulary. Low frequency
words that exceeding the limit will be discarded.
Set to `-1` (default) if no truncation is wanted.
newline_token (str, optional): The token to replace the original newline
token "\\\\n". For example,
`newline_token=tx.data.SpecialTokens.EOS`.
If `None`, no replacement is performed.
return_type (str): Either "list" or "dict". If "list" (default), this
function returns a list of words sorted by frequency. If "dict",
this function returns a dict mapping words to their index sorted
by frequency.
return_count (bool): Whether to return word counts. If `True` and
:attr:`return_type` is "dict", then a count dict is returned, which
is a mapping from words to their frequency.
Returns:
- If :attr:`return_count` is False, returns a list or dict containing \
the vocabulary words.
- If :attr:`return_count` if True, returns a pair of list or dict \
`(a, b)`, where `a` is a list or dict containing the vocabulary \
words, `b` is a list of dict containing the word counts.
"""
if not isinstance(filenames, (list, tuple)):
filenames = [filenames]
words = []
for fn in filenames:
words += read_words(fn, newline_token=newline_token)
counter = collections.Counter(words)
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
words, counts = list(zip(*count_pairs))
if max_vocab_size >= 0:
words = words[:max_vocab_size]
counts = counts[:max_vocab_size]
if return_type == "list":
if not return_count:
return words
else:
return words, counts
elif return_type == "dict":
word_to_id = dict(zip(words, range(len(words))))
if not return_count:
return word_to_id
else:
word_to_count = dict(zip(words, counts))
return word_to_id, word_to_count
else:
raise ValueError("Unknown return_type: {}".format(return_type))
def count_file_lines(filenames):
"""Counts the number of lines in the file(s).
"""
def _count_lines(fn):
with open(fn, "rb") as f:
i = -1
for i, _ in enumerate(f):
pass
return i + 1
if not isinstance(filenames, (list, tuple)):
filenames = [filenames]
num_lines = np.sum([_count_lines(fn) for fn in filenames])
return num_lines
================================================
FILE: texar_repo/texar/data/data_utils_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for data utils.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import tensorflow as tf
from texar.data import data_utils
class CountFileLinesTest(tf.test.TestCase):
"""Tests :func:`texar.data.data_utils.count_file_lines`.
"""
def test_load_glove(self):
"""Tests the load_glove function.
"""
file_1 = tempfile.NamedTemporaryFile(mode="w+")
num_lines = data_utils.count_file_lines(file_1.name)
self.assertEqual(num_lines, 0)
file_2 = tempfile.NamedTemporaryFile(mode="w+")
file_2.write('\n'.join(['x']*5))
file_2.flush()
num_lines = data_utils.count_file_lines(
[file_1.name, file_2.name, file_2.name])
self.assertEqual(num_lines, 0+5+5)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/embedding.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Helper functions and classes for embedding processing.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from tensorflow import gfile
import numpy as np
from texar.utils import utils
from texar.hyperparams import HParams
__all__ = [
"load_word2vec",
"load_glove",
"Embedding"
]
def load_word2vec(filename, vocab, word_vecs):
"""Loads embeddings in the word2vec binary format which has a header line
containing the number of vectors and their dimensionality (two integers),
followed with number-of-vectors lines each of which is formatted as
''.
Args:
filename (str): Path to the embedding file.
vocab (dict): A dictionary that maps token strings to integer index.
Tokens not in :attr:`vocab` are not read.
word_vecs: A 2D numpy array of shape `[vocab_size, embed_dim]`
which is updated as reading from the file.
Returns:
The updated :attr:`word_vecs`.
"""
with gfile.GFile(filename, "rb") as fin:
header = fin.readline()
vocab_size, vector_size = [int(s) for s in header.split()]
if vector_size != word_vecs.shape[1]:
raise ValueError("Inconsistent word vector sizes: %d vs %d" %
(vector_size, word_vecs.shape[1]))
binary_len = np.dtype('float32').itemsize * vector_size
for _ in np.arange(vocab_size):
chars = []
while True:
char = fin.read(1)
if char == b' ':
break
if char != b'\n':
chars.append(char)
word = b''.join(chars)
word = tf.compat.as_text(word)
if word in vocab:
word_vecs[vocab[word]] = np.fromstring(
fin.read(binary_len), dtype='float32')
else:
fin.read(binary_len)
return word_vecs
def load_glove(filename, vocab, word_vecs):
"""Loads embeddings in the glove text format in which each line is
''. Dimensions of the embedding vector
are separated with whitespace characters.
Args:
filename (str): Path to the embedding file.
vocab (dict): A dictionary that maps token strings to integer index.
Tokens not in :attr:`vocab` are not read.
word_vecs: A 2D numpy array of shape `[vocab_size, embed_dim]`
which is updated as reading from the file.
Returns:
The updated :attr:`word_vecs`.
"""
with gfile.GFile(filename) as fin:
for line in fin:
vec = line.strip().split()
if len(vec) == 0:
continue
word, vec = vec[0], vec[1:]
word = tf.compat.as_text(word)
if word not in vocab:
continue
if len(vec) != word_vecs.shape[1]:
raise ValueError("Inconsistent word vector sizes: %d vs %d" %
(len(vec), word_vecs.shape[1]))
word_vecs[vocab[word]] = np.array([float(v) for v in vec])
return word_vecs
class Embedding(object):
"""Embedding class that loads token embedding vectors from file. Token
embeddings not in the embedding file are initialized as specified in
:attr:`hparams`.
Args:
vocab (dict): A dictionary that maps token strings to integer index.
read_fn: Callable that takes `(filename, vocab, word_vecs)` and
returns the updated `word_vecs`. E.g.,
:func:`~texar.data.embedding.load_word2vec` and
:func:`~texar.data.embedding.load_glove`.
"""
def __init__(self, vocab, hparams=None):
self._hparams = HParams(hparams, self.default_hparams())
# Initialize embeddings
init_fn_kwargs = self._hparams.init_fn.kwargs.todict()
if "shape" in init_fn_kwargs or "size" in init_fn_kwargs:
raise ValueError("Argument 'shape' or 'size' must not be "
"specified. They are inferred automatically.")
init_fn = utils.get_function(
self._hparams.init_fn.type,
["numpy.random", "numpy", "texar.custom"])
try:
self._word_vecs = init_fn(size=[len(vocab), self._hparams.dim],
**init_fn_kwargs)
except TypeError:
self._word_vecs = init_fn(shape=[len(vocab), self._hparams.dim],
**init_fn_kwargs)
# Optionally read embeddings from file
if self._hparams.file is not None and self._hparams.file != "":
read_fn = utils.get_function(
self._hparams.read_fn,
["texar.data.embedding", "texar.data", "texar.custom"])
self._word_vecs = \
read_fn(self._hparams.file, vocab, self._word_vecs)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
.. role:: python(code)
:language: python
.. code-block:: python
{
"file": "",
"dim": 50,
"read_fn": "load_word2vec",
"init_fn": {
"type": "numpy.random.uniform",
"kwargs": {
"low": -0.1,
"high": 0.1,
}
},
}
Here:
"file" : str
Path to the embedding file. If not provided, all embeddings are
initialized with the initialization function.
"dim": int
Dimension size of each embedding vector
"read_fn" : str or callable
Function to read the embedding file. This can be the function,
or its string name or full module path. E.g.,
.. code-block:: python
"read_fn": texar.data.load_word2vec
"read_fn": "load_word2vec"
"read_fn": "texar.data.load_word2vec"
"read_fn": "my_module.my_read_fn"
If function string name is used, the function must be in
one of the modules: :mod:`texar.data` or :mod:`texar.custom`.
The function must have the same signature as with
:func:`load_word2vec`.
"init_fn" : dict
Hyperparameters of the initialization function used to initialize
embedding of tokens missing in the embedding
file.
The function must accept argument named `size` or `shape` to
specify the output shape, and return a numpy array of the shape.
The `dict` has the following fields:
"type" : str or callable
The initialization function. Can be either the function,
or its string name or full module path.
"kwargs" : dict
Keyword arguments for calling the function. The function
is called with :python:`init_fn(size=[.., ..], **kwargs)`.
"""
return {
"file": "",
"dim": 50,
"read_fn": "load_word2vec",
"init_fn": {
"type": "numpy.random.uniform",
"kwargs": {
"low": -0.1,
"high": 0.1,
},
},
"@no_typecheck": ["read_fn", "init_fn"]
}
@property
def word_vecs(self):
"""2D numpy array of shape `[vocab_size, embedding_dim]`.
"""
return self._word_vecs
@property
def vector_size(self):
"""The embedding dimention size.
"""
return self._hparams.dim
================================================
FILE: texar_repo/texar/data/embedding_test.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Unit tests for embedding related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import sys
import tempfile
import numpy as np
import tensorflow as tf
from texar.data import embedding
Py3 = sys.version_info[0] == 3 # pylint: disable=invalid-name
class EmbeddingTest(tf.test.TestCase):
"""Tests embedding related operations.
"""
def test_load_glove(self):
"""Tests the load_glove function.
"""
word_vec_lines = ["word 1.2 3.4 5.6", "词 1. 3. 5."]
glove_file = tempfile.NamedTemporaryFile(mode="w+")
if Py3:
glove_file.write('\n'.join(word_vec_lines))
else:
glove_file.write('\n'.join(word_vec_lines).encode("utf-8"))
glove_file.flush()
vocab = {"word": 0, "词": 1}
word_vecs = np.zeros([2, 3])
word_vecs = embedding.load_glove(glove_file.name, vocab, word_vecs)
self.assertEqual(word_vecs.shape[0], 2)
self.assertEqual(word_vecs.shape[1], 3)
np.testing.assert_array_equal(word_vecs[0], [1.2, 3.4, 5.6])
np.testing.assert_array_equal(word_vecs[1], [1., 3., 5.])
def test_load_word2vec(self):
"""Tests the load_word2vec function.
"""
header = "2 3"
words = ["word", "词"]
vec = np.array([1.2, 3.4, 5.6], dtype='float32')
w2v_file = tempfile.NamedTemporaryFile()
w2v_file.write(tf.compat.as_bytes(header + "\n"))
for word in words:
w2v_file.write(tf.compat.as_bytes(word + " "))
w2v_file.write(vec.tostring() + b'\n')
w2v_file.flush()
vocab = {"word": 0, "词": 1}
word_vecs = np.zeros([2, 3])
word_vecs = embedding.load_word2vec(w2v_file.name, vocab, word_vecs)
self.assertEqual(word_vecs.shape[0], 2)
self.assertEqual(word_vecs.shape[1], 3)
np.testing.assert_array_equal(word_vecs[0], vec)
np.testing.assert_array_equal(word_vecs[1], vec)
def test_embedding(self):
"""Tests :class:`texar.data.embedding.Embedding`.
"""
vocab = {"word": 0, "词": 1}
emb = embedding.Embedding(vocab)
self.assertEqual(len(emb.word_vecs), len(vocab))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/data/vocabulary.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Helper functions and classes for vocabulary processing.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import warnings
from collections import defaultdict
import tensorflow as tf
from tensorflow import gfile
import numpy as np
from texar.utils.utils import dict_lookup
# pylint: disable=too-few-public-methods, invalid-name
# pylint: disable=too-many-instance-attributes, too-many-arguments
__all__ = [
"SpecialTokens",
"Vocab"
]
class SpecialTokens(object):
"""Special tokens, including :attr:`PAD`, :attr:`BOS`, :attr:`EOS`,
:attr:`UNK`. These tokens will by default have token ids 0, 1, 2, 3,
respectively.
"""
PAD = ""
BOS = ""
EOS = ""
UNK = ""
def _make_defaultdict(keys, values, default_value):
"""Creates a python defaultdict.
Args:
keys (list): Keys of the dictionary.
values (list): Values correspond to keys. The two lists :attr:`keys` and
:attr:`values` must be of the same length.
default_value: default value returned when key is missing.
Returns:
defaultdict: A python `defaultdict` instance that maps keys to values.
"""
dict_ = defaultdict(lambda: default_value)
for k, v in zip(keys, values):
dict_[k] = v
return dict_
class Vocab(object):
"""Vocabulary class that loads vocabulary from file, and maintains mapping
tables between token strings and indexes.
Each line of the vocab file should contains one vocabulary token, e.g.,::
vocab_token_1
vocab token 2
vocab token | 3 .
...
Args:
filename (str): Path to the vocabulary file where each line contains
one token.
bos_token (str): A special token that will be added to the beginning of
sequences.
eos_token (str): A special token that will be added to the end of
sequences.
unk_token (str): A special token that will replace all unknown tokens
(tokens not included in the vocabulary).
pad_token (str): A special token that is used to do padding.
"""
def __init__(self,
filename,
pad_token=SpecialTokens.PAD,
bos_token=SpecialTokens.BOS,
eos_token=SpecialTokens.EOS,
unk_token=SpecialTokens.UNK):
self._filename = filename
self._pad_token = pad_token
self._bos_token = bos_token
self._eos_token = eos_token
self._unk_token = unk_token
self._id_to_token_map, self._token_to_id_map, \
self._id_to_token_map_py, self._token_to_id_map_py = \
self.load(self._filename)
def load(self, filename):
"""Loads the vocabulary from the file.
Args:
filename (str): Path to the vocabulary file.
Returns:
A tuple of TF and python mapping tables between word string and
index, (:attr:`id_to_token_map`, :attr:`token_to_id_map`,
:attr:`id_to_token_map_py`, :attr:`token_to_id_map_py`), where
:attr:`id_to_token_map` and :attr:`token_to_id_map` are
TF :tf_main:`HashTable ` instances,
and :attr:`id_to_token_map_py` and
:attr:`token_to_id_map_py` are python `defaultdict` instances.
"""
with gfile.GFile(filename) as vocab_file:
# Converts to 'unicode' (Python 2) or 'str' (Python 3)
vocab = list(tf.compat.as_text(line.strip()) for line in vocab_file)
warnings.simplefilter("ignore", UnicodeWarning)
if self._bos_token in vocab:
raise ValueError("Special begin-of-seq token already exists in the "
"vocabulary: '%s'" % self._bos_token)
if self._eos_token in vocab:
raise ValueError("Special end-of-seq token already exists in the "
"vocabulary: '%s'" % self._eos_token)
if self._unk_token in vocab:
raise ValueError("Special UNK token already exists in the "
"vocabulary: '%s'" % self._unk_token)
if self._pad_token in vocab:
raise ValueError("Special padding token already exists in the "
"vocabulary: '%s'" % self._pad_token)
warnings.simplefilter("default", UnicodeWarning)
# Places _pad_token at the beginning to make sure it take index 0.
vocab = [self._pad_token, self._bos_token, self._eos_token,
self._unk_token] + vocab
# Must make sure this is consistent with the above line
unk_token_idx = 3
vocab_size = len(vocab)
vocab_idx = np.arange(vocab_size)
# Creates TF maps
id_to_token_map = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(
vocab_idx, vocab, key_dtype=tf.int64, value_dtype=tf.string),
self._unk_token)
token_to_id_map = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(
vocab, vocab_idx, key_dtype=tf.string, value_dtype=tf.int64),
unk_token_idx)
# Creates python maps to interface with python code
id_to_token_map_py = _make_defaultdict(
vocab_idx, vocab, self._unk_token)
token_to_id_map_py = _make_defaultdict(
vocab, vocab_idx, unk_token_idx)
return id_to_token_map, token_to_id_map, \
id_to_token_map_py, token_to_id_map_py
def map_ids_to_tokens(self, ids):
"""Maps ids into text tokens.
The returned tokens are a Tensor.
Args:
ids: An `int` tensor of token ids.
Returns:
A tensor of text tokens of the same shape.
"""
return self.id_to_token_map.lookup(tf.to_int64(ids))
def map_tokens_to_ids(self, tokens):
"""Maps text tokens into ids.
The returned ids are a Tensor.
Args:
tokens: An tensor of text tokens.
Returns:
A tensor of token ids of the same shape.
"""
return self.token_to_id_map.lookup(tokens)
def map_ids_to_tokens_py(self, ids):
"""Maps ids into text tokens.
The input :attr:`ids` and returned tokens are both python
arrays or list.
Args:
ids: An `int` numpy arry or (possibly nested) list of token ids.
Returns:
A numpy array of text tokens of the same shape as :attr:`ids`.
"""
return dict_lookup(self.id_to_token_map_py, ids, self.unk_token)
def map_tokens_to_ids_py(self, tokens):
"""Maps text tokens into ids.
The input :attr:`tokens` and returned ids are both python
arrays or list.
Args:
tokens: A numpy array or (possibly nested) list of text tokens.
Returns:
A numpy array of token ids of the same shape as :attr:`tokens`.
"""
return dict_lookup(self.token_to_id_map_py, tokens, self.unk_token_id)
@property
def id_to_token_map(self):
"""The :tf_main:`HashTable `instance that
maps from token index to the string form.
"""
return self._id_to_token_map
@property
def token_to_id_map(self):
"""The :tf_main:`HashTable ` instance
that maps from token string to the index.
"""
return self._token_to_id_map
@property
def id_to_token_map_py(self):
"""The python `defaultdict` instance that maps from token index to the
string form.
"""
return self._id_to_token_map_py
@property
def token_to_id_map_py(self):
"""The python `defaultdict` instance that maps from token string to the
index.
"""
return self._token_to_id_map_py
@property
def size(self):
"""The vocabulary size.
"""
return len(self.token_to_id_map_py)
@property
def bos_token(self):
"""A string of the special token indicating the beginning of sequence.
"""
return self._bos_token
@property
def bos_token_id(self):
"""The `int` index of the special token indicating the beginning
of sequence.
"""
return self.token_to_id_map_py[self._bos_token]
@property
def eos_token(self):
"""A string of the special token indicating the end of sequence.
"""
return self._eos_token
@property
def eos_token_id(self):
"""The `int` index of the special token indicating the end
of sequence.
"""
return self.token_to_id_map_py[self._eos_token]
@property
def unk_token(self):
"""A string of the special token indicating unknown token.
"""
return self._unk_token
@property
def unk_token_id(self):
"""The `int` index of the special token indicating unknown token.
"""
return self.token_to_id_map_py[self._unk_token]
@property
def pad_token(self):
"""A string of the special token indicating padding token. The
default padding token is an empty string.
"""
return self._pad_token
@property
def pad_token_id(self):
"""The `int` index of the special token indicating padding token.
"""
return self.token_to_id_map_py[self._pad_token]
@property
def special_tokens(self):
"""The list of special tokens
[:attr:`pad_token`, :attr:`bos_token`, :attr:`eos_token`,
:attr:`unk_token`].
"""
return [self._pad_token, self._bos_token, self._eos_token,
self._unk_token]
================================================
FILE: texar_repo/texar/data/vocabulary_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for vocabulary related operations.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tempfile
import tensorflow as tf
from texar.data import vocabulary
# pylint: disable=protected-access
class VocabularyTest(tf.test.TestCase):
"""Tests vocabulary related operations.
"""
def test_make_defaultdict(self):
"""Tests the _make_defaultdict function.
"""
keys = ['word', '词']
values = [0, 1]
default_value = -1
dict_ = vocabulary._make_defaultdict(keys, values, default_value)
self.assertEqual(len(dict_), 2)
self.assertEqual(dict_['word'], 0)
self.assertEqual(dict_['词'], 1)
self.assertEqual(dict_['sth_else'], -1)
def test_vocab_construction(self):
"""Test vocabulary construction.
"""
vocab_list = ['word', '词']
vocab_file = tempfile.NamedTemporaryFile()
vocab_file.write('\n'.join(vocab_list).encode("utf-8"))
vocab_file.flush()
vocab = vocabulary.Vocab(vocab_file.name)
self.assertEqual(vocab.size, len(vocab_list) + 4)
self.assertEqual(
set(vocab.token_to_id_map_py.keys()),
set(['word', '词'] + vocab.special_tokens))
# Tests UNK token
unk_token_id = vocab.token_to_id_map_py['new']
unk_token_text = vocab.id_to_token_map_py[unk_token_id]
self.assertEqual(unk_token_text, vocab.unk_token)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/evals/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library evals.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.evals.bleu_moses import *
from texar.evals.bleu import *
from texar.evals.metrics import *
================================================
FILE: texar_repo/texar/evals/bleu.py
================================================
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Modifications copyright (C) 2018 Texar
# ==============================================================================
"""
Python implementation of BLEU and smoothed BLEU adapted from:
`https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py`
This module provides a Python implementation of BLEU and smoothed BLEU.
Smooth BLEU is computed following the method outlined in the paper:
(Lin et al. 2004) ORANGE: a method for evaluating automatic evaluation
metrics for maching translation.
Chin-Yew Lin, Franz Josef Och. COLING 2004.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import collections
import math
from texar.utils.dtypes import compat_as_text, is_str
# pylint: disable=invalid-name, too-many-branches, too-many-locals
# pylint: disable=too-many-arguments
__all__ = [
"sentence_bleu",
"corpus_bleu"
]
def _get_ngrams(segment, max_order):
"""Extracts all n-grams up to a given maximum order from an input segment.
Args:
segment: text segment from which n-grams will be extracted.
max_order: maximum length in tokens of the n-grams returned by this
methods.
Returns:
The Counter containing all n-grams upto max_order in segment
with a count of how many times each n-gram occurred.
"""
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i:i+order])
ngram_counts[ngram] += 1
return ngram_counts
def _maybe_str_to_list(list_or_str):
if is_str(list_or_str):
return list_or_str.split()
return list_or_str
def _lowercase(str_list):
return [str_.lower() for str_ in str_list]
def sentence_bleu(references, hypothesis, max_order=4, lowercase=False,
smooth=False, return_all=False):
"""Calculates BLEU score of a hypothesis sentence.
Args:
references: A list of reference for the hypothesis.
Each reference can be either a list of string tokens, or a string
containing tokenized tokens separated with whitespaces.
List can also be numpy array.
hypotheses: A hypothesis sentence.
Each hypothesis can be either a list of string tokens, or a
string containing tokenized tokens separated with whitespaces.
List can also be numpy array.
lowercase (bool): If `True`, lowercase reference and hypothesis tokens.
max_order (int): Maximum n-gram order to use when computing BLEU score.
smooth (bool): Whether or not to apply (Lin et al. 2004) smoothing.
return_all (bool): If `True`, returns BLEU and all n-gram precisions.
Returns:
If :attr:`return_all` is `False` (default), returns a float32
BLEU score.
If :attr:`return_all` is `True`, returns a list of float32 scores:
`[BLEU] + n-gram precisions`, which is of length :attr:`max_order`+1.
"""
return corpus_bleu(
[references], [hypothesis], max_order=max_order, lowercase=lowercase,
smooth=smooth, return_all=return_all)
def corpus_bleu(list_of_references, hypotheses, max_order=4, lowercase=False,
smooth=False, return_all=True):
"""Computes corpus-level BLEU score.
Args:
list_of_references: A list of lists of references for each hypothesis.
Each reference can be either a list of string tokens, or a string
containing tokenized tokens separated with whitespaces.
List can also be numpy array.
hypotheses: A list of hypothesis sentences.
Each hypothesis can be either a list of string tokens, or a
string containing tokenized tokens separated with whitespaces.
List can also be numpy array.
lowercase (bool): If `True`, lowercase reference and hypothesis tokens.
max_order (int): Maximum n-gram order to use when computing BLEU score.
smooth (bool): Whether or not to apply (Lin et al. 2004) smoothing.
return_all (bool): If `True`, returns BLEU and all n-gram precisions.
Returns:
If :attr:`return_all` is `False` (default), returns a float32
BLEU score.
If :attr:`return_all` is `True`, returns a list of float32 scores:
`[BLEU] + n-gram precisions`, which is of length :attr:`max_order`+1.
"""
list_of_references = compat_as_text(list_of_references)
hypotheses = compat_as_text(hypotheses)
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
hyperthsis_length = 0
for (references, hyperthsis) in zip(list_of_references, hypotheses):
reference_length += min(len(r) for r in references)
hyperthsis_length += len(hyperthsis)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
reference = _maybe_str_to_list(reference)
if lowercase:
reference = _lowercase(reference)
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
hyperthsis = _maybe_str_to_list(hyperthsis)
if lowercase:
hyperthsis = _lowercase(hyperthsis)
hyperthsis_ngram_counts = _get_ngrams(hyperthsis, max_order)
overlap = hyperthsis_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram)-1] += overlap[ngram]
for order in range(1, max_order+1):
possible_matches = len(hyperthsis) - order + 1
if possible_matches > 0:
possible_matches_by_order[order-1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) /
possible_matches_by_order[i])
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(hyperthsis_length) / reference_length
if ratio > 1.0:
bp = 1.
else:
try:
bp = math.exp(1 - 1. / ratio)
except ZeroDivisionError:
bp = math.exp(1 - 1. / (ratio + 1e-8))
bleu = geo_mean * bp
if return_all:
return [bleu * 100] + [p * 100 for p in precisions]
else:
return bleu * 100
================================================
FILE: texar_repo/texar/evals/bleu_moses.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The BLEU metric.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import os
from io import open # pylint: disable=redefined-builtin
import shutil
import re
import subprocess
import tempfile
import numpy as np
import tensorflow as tf
from texar.utils.dtypes import compat_as_text
# pylint: disable=too-many-locals, no-member, redefined-variable-type
__all__ = [
"sentence_bleu_moses",
"corpus_bleu_moses"
]
def _maybe_list_to_str(list_or_str):
if isinstance(list_or_str, (tuple, list, np.ndarray)):
return ' '.join(list_or_str)
return list_or_str
def _parse_multi_bleu_ret(bleu_str, return_all=False):
bleu_score = re.search(r"BLEU = (.+?),", bleu_str).group(1)
bleu_score = np.float32(bleu_score)
if return_all:
bleus = re.search(r", (.+?)/(.+?)/(.+?)/(.+?) ", bleu_str)
bleus = [bleus.group(group_idx) for group_idx in range(1, 5)]
bleus = [np.float32(b) for b in bleus]
bleu_score = [bleu_score] + bleus
return bleu_score
def sentence_bleu_moses(references, hypothesis, lowercase=False,
return_all=False):
"""Calculates BLEU score of a hypothesis sentence using the
**MOSES multi-bleu.perl** script.
Args:
references: A list of reference for the hypothesis.
Each reference can be either a string, or a list of string tokens.
List can also be numpy array.
hypotheses: A hypothesis sentence.
The hypothesis can be either a string, or a list of string tokens.
List can also be numpy array.
lowercase (bool): If `True`, pass the "-lc" flag to the multi-bleu
script.
return_all (bool): If `True`, returns BLEU and all n-gram precisions.
Returns:
If :attr:`return_all` is `False` (default), returns a float32
BLEU score.
If :attr:`return_all` is `True`, returns a list of 5 float32 scores:
`[BLEU, 1-gram precision, ..., 4-gram precision]`.
"""
return corpus_bleu_moses(
[references], [hypothesis], lowercase=lowercase, return_all=return_all)
def corpus_bleu_moses(list_of_references, hypotheses, lowercase=False,
return_all=False):
"""Calculates corpus-level BLEU score using the
**MOSES multi-bleu.perl** script.
Args:
list_of_references: A list of lists of references for each hypothesis.
Each reference can be either a string, or a list of string tokens.
List can also be numpy array.
hypotheses: A list of hypothesis sentences.
Each hyperthsis can be either a string, or a list of string tokens.
List can also be numpy array.
lowercase (bool): If `True`, pass the "-lc" flag to the multi-bleu
script.
return_all (bool): If `True`, returns BLEU and all n-gram precisions.
Returns:
If :attr:`return_all` is `False` (default), returns a float32
BLEU score.
If :attr:`return_all` is `True`, returns a list of 5 float32 scores:
`[BLEU, 1-gram precision, ..., 4-gram precision]`.
"""
list_of_references = compat_as_text(list_of_references)
hypotheses = compat_as_text(hypotheses)
if np.size(hypotheses) == 0:
return np.float32(0.) # pylint: disable=no-member
# Get multi-bleu.perl
cur_dir = os.path.dirname(os.path.realpath(__file__))
multi_bleu_path = os.path.abspath(
os.path.join(cur_dir, "..", "..", "bin", "utils", "multi-bleu.perl"))
# Create a temporary folder containing hyperthesis and reference files
result_path = tempfile.mkdtemp()
# Create hyperthesis file
hfile_path = os.path.join(result_path, 'hyp')
hyps = [_maybe_list_to_str(h) for h in hypotheses]
with open(hfile_path, 'w', encoding='utf-8') as hfile:
text = "\n".join(hyps)
hfile.write(text)
hfile.write("\n")
# Create reference files
max_nrefs = max([len(refs) for refs in list_of_references])
rfile_path = os.path.join(result_path, 'ref')
for rid in range(max_nrefs):
with open(rfile_path + '%d'%rid, 'w', encoding='utf-8') as rfile:
for refs in list_of_references:
if rid < len(refs):
ref = _maybe_list_to_str(refs[rid])
rfile.write(ref + "\n")
else:
rfile.write("\n")
# Calculate BLEU
multi_bleu_cmd = [multi_bleu_path]
if lowercase:
multi_bleu_cmd += ["-lc"]
multi_bleu_cmd += [rfile_path]
with open(hfile_path, "r") as hyp_input:
try:
multi_bleu_ret = subprocess.check_output(
multi_bleu_cmd, stdin=hyp_input, stderr=subprocess.STDOUT)
multi_bleu_ret = multi_bleu_ret.decode("utf-8")
bleu_score = _parse_multi_bleu_ret(multi_bleu_ret, return_all)
except subprocess.CalledProcessError as error:
if error.output is not None:
tf.logging.warning(
"multi-bleu.perl returned non-zero exit code")
tf.logging.warning(error.output)
if return_all:
bleu_score = [np.float32(0.0)] * 5
else:
bleu_score = np.float32(0.0)
shutil.rmtree(result_path)
return np.float32(bleu_score)
================================================
FILE: texar_repo/texar/evals/bleu_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for bleu.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from texar.evals.bleu_moses import sentence_bleu_moses, corpus_bleu_moses
from texar.evals.bleu import sentence_bleu, corpus_bleu
# pylint: disable=too-many-locals, too-many-arguments
class BLEUTest(tf.test.TestCase):
"""Tests the bleu functions.
"""
def _test_sentence_bleu(self, references, hypothesis, lowercase,
true_bleu):
bleu = sentence_bleu_moses(references=references,
hypothesis=hypothesis,
lowercase=lowercase)
self.assertAlmostEqual(bleu, true_bleu, places=2)
bleu = sentence_bleu(references=references,
hypothesis=hypothesis,
lowercase=lowercase)
self.assertAlmostEqual(bleu, true_bleu, places=0)
def test_sentence_strings(self):
"""Tests hypothesis as strings.
"""
hypothesis = \
"this is a test sentence to evaluate the good bleu score . 词"
references = ["this is a test sentence to evaluate the bleu score ."]
self._test_sentence_bleu(
references, hypothesis, lowercase=False, true_bleu=67.03)
def test_sentence_list(self):
"""Tests hypothesis as a list of tokens.
"""
hypothesis = \
"this is a test sentence to evaluate the good bleu score . 词"
hypothesis = hypothesis.split()
references = ["this is a test sentence to evaluate the bleu score ."]
references = [references[0].split()]
self._test_sentence_bleu(
references, hypothesis, lowercase=False, true_bleu=67.03)
def test_sentence_multi_references(self):
"""Tests multiple references.
"""
hypothesis = \
"this is a test sentence to evaluate the good bleu score . 词"
references = ["this is a test sentence to evaluate the bleu score .",
"this is a test sentence to evaluate the good score ."]
self._test_sentence_bleu(
references, hypothesis, lowercase=False, true_bleu=76.12)
def test_sentence_numpy(self):
"""Tests with numpy format.
"""
hypothesis = \
"this is a test sentence to evaluate the good bleu score . 词"
hypothesis = np.array(hypothesis.split())
references = ["this is a test sentence to evaluate the bleu score .",
"this is a test sentence to evaluate the good score ."]
references = np.array([np.array(r.split()) for r in references])
self._test_sentence_bleu(
references, hypothesis, lowercase=False, true_bleu=76.12)
def _test_corpus_bleu(self, list_of_references, hypotheses, lowercase,
return_all, true_bleu):
bleu = corpus_bleu_moses(list_of_references=list_of_references,
hypotheses=hypotheses,
lowercase=lowercase,
return_all=return_all)
if not return_all:
self.assertAlmostEqual(bleu, true_bleu, places=2)
else:
for ret, true in zip(bleu, true_bleu):
self.assertAlmostEqual(ret, true, places=2)
bleu = corpus_bleu(list_of_references=list_of_references,
hypotheses=hypotheses,
lowercase=lowercase,
return_all=return_all)
if not return_all:
self.assertAlmostEqual(bleu, true_bleu, places=0)
else:
for ret, true in zip(bleu, true_bleu):
self.assertAlmostEqual(ret, true, places=0)
def test_corpus_strings(self):
"""Tests corpus level BLEU.
"""
hypotheses = [
"this is a test sentence to evaluate the good bleu score . 词",
"i believe that that the script is 词 perfectly correct ."
]
list_of_references = [
["this is a test sentence to evaluate the bleu score .",
"this is a test sentence to evaluate the good score ."],
["i believe that the script is perfectly correct .".split()]
]
self._test_corpus_bleu(list_of_references, hypotheses,
False, False, 63.02)
self._test_corpus_bleu(list_of_references, hypotheses,
False, True, [63.02, 87.5, 77.3, 60.0, 38.9])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/evals/metrics.py
================================================
"""
Various metrics.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import tensorflow as tf
__all__ = [
"accuracy",
"binary_clas_accuracy"
]
def accuracy(labels, preds):
"""Calculates the accuracy of predictions.
Args:
labels: The ground truth values. A Tensor of the same shape of
:attr:`preds`.
preds: A Tensor of any shape containing the predicted values.
Returns:
A float scalar Tensor containing the accuracy.
"""
labels = tf.cast(labels, preds.dtype)
return tf.reduce_mean(tf.to_float(tf.equal(preds, labels)))
def binary_clas_accuracy(pos_preds=None, neg_preds=None):
"""Calculates the accuracy of binary predictions.
Args:
pos_preds (optional): A Tensor of any shape containing the
predicted values on positive data (i.e., ground truth labels are
`1`).
neg_preds (optional): A Tensor of any shape containing the
predicted values on negative data (i.e., ground truth labels are
`0`).
Returns:
A float scalar Tensor containing the accuracy.
"""
pos_accu = accuracy(tf.ones_like(pos_preds), pos_preds)
neg_accu = accuracy(tf.zeros_like(neg_preds), neg_preds)
psize = tf.to_float(tf.size(pos_preds))
nsize = tf.to_float(tf.size(neg_preds))
accu = (pos_accu * psize + neg_accu * nsize) / (psize + nsize)
return accu
================================================
FILE: texar_repo/texar/hyperparams.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Hyperparameter manager
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import copy
import json
from texar.utils.dtypes import is_callable
__all__ = [
"HParams"
]
def _type_name(value):
return type(value).__name__
class HParams(object):
"""A class that maintains hyperparameters for configing Texar modules.
The class has several useful features:
- **Auto-completion of missing values.** Users can specify only a subset of\
hyperparameters they care about. Other hyperparameters will automatically\
take the default values. The auto-completion performs **recursively** so \
that hyperparameters taking `dict` values will also be auto-completed \
**All Texar modules** provide a \
:meth:`default_hparams` containing allowed hyperparameters and their \
default values. For example
.. code-block:: python
## Recursive auto-completion
default_hparams = {"a": 1, "b": {"c": 2, "d": 3}}
hparams = {"b": {"c": 22}}
hparams_ = HParams(hparams, default_hparams)
hparams_.todict() == {"a": 1, "b": {"c": 22, "d": 3}}
# "a" and "d" are auto-completed
## All Texar modules have built-in `default_hparams`
hparams = {"dropout_rate": 0.1}
emb = tx.modules.WordEmbedder(hparams=hparams, ...)
emb.hparams.todict() == {
"dropout_rate": 0.1, # provided value
"dim": 100 # default value
...
}
- **Automatic typecheck.** For most hyperparameters, provided value must\
have the same or compatible dtype with the default value. HParams does\
necessary typecheck, and raises Error if improper dtype is provided.\
Also, hyperparameters not listed in `default_hparams` are not allowed,\
except for "kwargs" as detailed below.
- **Flexible dtype for specified hyperparameters.** Some hyperparameters\
may allow different dtypes of values.
- Hyperparameters named "type" are not typechecked.\
For example, in :func:`~texar.core.get_rnn_cell`, hyperparameter \
`"type"` can take value of an RNNCell class, its string name of module \
path, or an RNNCell class instance. (String name or module path is \
allowd so that users can specify the value in YAML config files.)
- For other hyperparameters, list them\
in the "@no_typecheck" field in `default_hparams` to skip typecheck. \
For example, in :func:`~texar.core.get_rnn_cell`, hyperparameter \
"*_keep_prob" can be set to either a `float` or a `tf.placeholder`.
- **Special flexibility of keyword argument hyparameters.** \
Hyperparameters named "kwargs" are used as keyword arguments for a class\
constructor or a function call. Such hyperparameters take a `dict`, and \
users can add arbitrary valid keyword arguments to the dict. For example:
.. code-block:: python
default_rnn_cell_hparams = {
"type": "LSTMCell",
"kwargs": {"num_units": 256}
# Other hyperparameters
...
}
my_hparams = {
"kwargs" {
"num_units": 123,
"forget_bias": 0.0 # Other valid keyword arguments
"activation": "tf.nn.relu" # for LSTMCell constructor
}
}
_ = HParams(my_hparams, default_rnn_cell_hparams)
- **Rich interfaces.** An HParams instance provides rich interfaces for\
accessing, updating, or adding hyperparameters.
.. code-block:: python
hparams = HParams(my_hparams, default_hparams)
# Access
hparams.type == hparams["type"]
# Update
hparams.type = "GRUCell"
hparams.kwargs = { "num_units": 100 }
hparams.kwargs.num_units == 100
# Add new
hparams.add_hparam("index", 1)
hparams.index == 1
# Convert to `dict` (recursively)
type(hparams.todic()) == dict
# I/O
pickle.dump(hparams, "hparams.dump")
with open("hparams.dump", 'rb') as f:
hparams_loaded = pickle.load(f)
Args:
hparams: A `dict` or an `HParams` instance containing hyperparameters.
If `None`, all hyperparameters are set to default values.
default_hparams (dict): Hyperparameters with default values. If `None`,
Hyperparameters are fully defined by :attr:`hparams`.
allow_new_hparam (bool): If `False` (default), :attr:`hparams` cannot
contain hyperparameters that are not included in
:attr:`default_hparams`, except for the case of :attr:`"kwargs"` as
above.
"""
# - The default hyperparameters in :attr:`"kwargs"` are used (for typecheck\
# and complementing missing hyperparameters) only when :attr:`"type"` \
# takes default value (i.e., missing in :attr:`hparams` or set to \
# the same value with the default). In this case :attr:`kwargs` allows to \
# contain new keys not included in :attr:`default_hparams["kwargs"]`.
#
# - If :attr:`"type"` is set to an other \
# value and :attr:`"kwargs"` is missing in :attr:`hparams`, \
# :attr:`"kwargs"` is set to an empty dictionary.
def __init__(self, hparams, default_hparams, allow_new_hparam=False):
if isinstance(hparams, HParams):
hparams = hparams.todict()
if default_hparams is not None:
parsed_hparams = self._parse(
hparams, default_hparams, allow_new_hparam)
else:
parsed_hparams = self._parse(hparams, hparams)
super(HParams, self).__setattr__('_hparams', parsed_hparams)
@staticmethod
def _parse(hparams, # pylint: disable=too-many-branches, too-many-statements
default_hparams,
allow_new_hparam=False):
"""Parses hyperparameters.
Args:
hparams (dict): Hyperparameters. If `None`, all hyperparameters are
set to default values.
default_hparams (dict): Hyperparameters with default values.
If `None`,Hyperparameters are fully defined by :attr:`hparams`.
allow_new_hparam (bool): If `False` (default), :attr:`hparams`
cannot contain hyperparameters that are not included in
:attr:`default_hparams`, except the case of :attr:`"kwargs"`.
Return:
A dictionary of parsed hyperparameters. Returns `None` if both
:attr:`hparams` and :attr:`default_hparams` are `None`.
Raises:
ValueError: If :attr:`hparams` is not `None` and
:attr:`default_hparams` is `None`.
ValueError: If :attr:`default_hparams` contains "kwargs" not does
not contains "type".
"""
if hparams is None and default_hparams is None:
return None
if hparams is None:
return HParams._parse(default_hparams, default_hparams)
if default_hparams is None:
raise ValueError("`default_hparams` cannot be `None` if `hparams` "
"is not `None`.")
no_typecheck_names = default_hparams.get("@no_typecheck", [])
if "kwargs" in default_hparams and "type" not in default_hparams:
raise ValueError("Ill-defined hyperparameter structure: 'kwargs' "
"must accompany with 'type'.")
parsed_hparams = copy.deepcopy(default_hparams)
# Parse recursively for params of type dictionary that are missing
# in `hparams`.
for name, value in default_hparams.items():
if name not in hparams and isinstance(value, dict):
if name == "kwargs" and "type" in hparams and \
hparams["type"] != default_hparams["type"]:
# Set params named "kwargs" to empty dictionary if "type"
# takes value other than default.
parsed_hparams[name] = HParams({}, {})
else:
parsed_hparams[name] = HParams(value, value)
# Parse hparams
for name, value in hparams.items():
if name not in default_hparams:
if allow_new_hparam:
parsed_hparams[name] = HParams._parse_value(value, name)
continue
else:
raise ValueError(
"Unknown hyperparameter: %s. Only hyperparameters "
"named 'kwargs' hyperparameters can contain new "
"entries undefined in default hyperparameters." % name)
if value is None:
parsed_hparams[name] = \
HParams._parse_value(parsed_hparams[name])
default_value = default_hparams[name]
if default_value is None:
parsed_hparams[name] = HParams._parse_value(value)
continue
# Parse recursively for params of type dictionary.
if isinstance(value, dict):
if name not in no_typecheck_names \
and not isinstance(default_value, dict):
raise ValueError(
"Hyperparameter '%s' must have type %s, got %s" %
(name, _type_name(default_value), _type_name(value)))
if name == "kwargs":
if "type" in hparams and \
hparams["type"] != default_hparams["type"]:
# Leave "kwargs" as-is if "type" takes value
# other than default.
parsed_hparams[name] = HParams(value, value)
else:
# Allow new hyperparameters if "type" takes default
# value
parsed_hparams[name] = HParams(
value, default_value, allow_new_hparam=True)
elif name in no_typecheck_names:
parsed_hparams[name] = HParams(value, value)
else:
parsed_hparams[name] = HParams(
value, default_value, allow_new_hparam)
continue
# Do not type-check hyperparameter named "type" and accompanied
# with "kwargs"
if name == "type" and "kwargs" in default_hparams:
parsed_hparams[name] = value
continue
if name in no_typecheck_names:
parsed_hparams[name] = value
elif isinstance(value, type(default_value)):
parsed_hparams[name] = value
elif is_callable(value) and is_callable(default_value):
parsed_hparams[name] = value
else:
try:
parsed_hparams[name] = type(default_value)(value)
except TypeError:
raise ValueError(
"Hyperparameter '%s' must have type %s, got %s" %
(name, _type_name(default_value), _type_name(value)))
return parsed_hparams
@staticmethod
def _parse_value(value, name=None):
if isinstance(value, dict) and (name is None or name != "kwargs"):
return HParams(value, None)
else:
return value
def __getattr__(self, name):
"""Retrieves the value of the hyperparameter.
"""
if name == '_hparams':
return super(HParams, self).__getattribute__('_hparams')
if name not in self._hparams:
# Raise AttributeError to allow copy.deepcopy, etc
raise AttributeError("Unknown hyperparameter: %s" % name)
return self._hparams[name]
def __getitem__(self, name):
"""Retrieves the value of the hyperparameter.
"""
return self.__getattr__(name)
def __setattr__(self, name, value):
"""Sets the value of the hyperparameter.
"""
if name not in self._hparams:
raise ValueError(
"Unknown hyperparameter: %s. Only the `kwargs` "
"hyperparameters can contain new entries undefined "
"in default hyperparameters." % name)
self._hparams[name] = self._parse_value(value, name)
def items(self):
"""Returns the list of hyperparam `(name, value)` pairs
"""
return iter(self)
def keys(self):
"""Returns the list of hyperparam names
"""
return self._hparams.keys()
def __iter__(self):
for name, value in self._hparams.items():
yield name, value
def __len__(self):
return len(self._hparams)
def __contains__(self, name):
return name in self._hparams
def __str__(self):
"""Return a string of the hparams.
"""
hparams_dict = self.todict()
return json.dumps(hparams_dict, sort_keys=True, indent=2)
def get(self, name, default=None):
"""Returns the hyperparameter value for the given name. If name is not
available then returns :attr:`default`.
Args:
name (str): the name of hyperparameter.
default: the value to be returned in case name does not exist.
"""
try:
return self.__getattr__(name)
except AttributeError:
return default
def add_hparam(self, name, value):
"""Adds a new hyperparameter.
"""
if (name in self._hparams) or hasattr(self, name):
raise ValueError("Hyperparameter name already exists: %s" % name)
self._hparams[name] = self._parse_value(value, name)
def todict(self):
"""Returns a copy of hyperparameters as a dictionary.
"""
dict_ = copy.deepcopy(self._hparams)
for name, value in self._hparams.items():
if isinstance(value, HParams):
dict_[name] = value.todict()
return dict_
================================================
FILE: texar_repo/texar/hyperparams_test.py
================================================
"""
Unit tests of :class:`HParams`.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import copy
import pickle
import tempfile
import tensorflow as tf
from texar.hyperparams import HParams
# pylint: disable=no-member
class HParamsTest(tf.test.TestCase):
"""Tests hyperparameter related operations.
"""
def test_hparams(self):
"""Tests the HParams class.
"""
default_hparams = {
"str": "str",
"list": ['item1', 'item2'],
"dict": {
"key1": "value1",
"key2": "value2"
},
"nested_dict": {
"dict_l2": {
"key1_l2": "value1_l2"
}
},
"type": "type",
"kwargs": {
"arg1": "argv1"
},
}
# Test HParams.items() function
hparams_ = HParams(None, default_hparams)
names = []
for name, _ in hparams_.items():
names.append(name)
self.assertEqual(set(names), set(default_hparams.keys()))
hparams = {
"dict": {"key1": "new_value"},
"kwargs": {"arg2": "argv2"}
}
hparams_ = HParams(hparams, default_hparams)
# Test HParams construction
self.assertEqual(hparams_.str, default_hparams["str"])
self.assertEqual(hparams_.list, default_hparams["list"])
self.assertEqual(hparams_.dict.key1, hparams["dict"]["key1"])
self.assertEqual(hparams_.kwargs.arg2, hparams["kwargs"]["arg2"])
self.assertEqual(hparams_.nested_dict.dict_l2.key1_l2,
default_hparams["nested_dict"]["dict_l2"]["key1_l2"])
self.assertEqual(len(hparams_), len(default_hparams))
new_hparams = copy.deepcopy(default_hparams)
new_hparams["dict"]["key1"] = hparams["dict"]["key1"]
new_hparams["kwargs"].update(hparams["kwargs"])
self.assertEqual(hparams_.todict(), new_hparams)
self.assertTrue("dict" in hparams_)
self.assertIsNone(hparams_.get('not_existed_name', None))
self.assertEqual(hparams_.get('str'), default_hparams['str'])
# Test HParams update related operations
hparams_.str = "new_str"
hparams_.dict = {"key3": "value3"}
self.assertEqual(hparams_.str, "new_str")
self.assertEqual(hparams_.dict.key3, "value3")
hparams_.add_hparam("added_str", "added_str")
hparams_.add_hparam("added_dict", {"key4": "value4"})
hparams_.kwargs.add_hparam("added_arg", "added_argv")
self.assertEqual(hparams_.added_str, "added_str")
self.assertEqual(hparams_.added_dict.todict(), {"key4": "value4"})
self.assertEqual(hparams_.kwargs.added_arg, "added_argv")
# Test HParams I/O
hparams_file = tempfile.NamedTemporaryFile()
pickle.dump(hparams_, hparams_file)
with open(hparams_file.name, 'rb') as hparams_file:
hparams_loaded = pickle.load(hparams_file)
self.assertEqual(hparams_loaded.todict(), hparams_.todict())
def test_typecheck(self):
"""Tests type-check functionality.
"""
def _foo():
pass
def _bar():
pass
default_hparams = {
"fn": _foo,
"fn_2": _foo
}
hparams = {
"fn": _foo,
"fn_2": _bar
}
hparams_ = HParams(hparams, default_hparams)
self.assertEqual(hparams_.fn, default_hparams["fn"])
def test_type_kwargs(self):
"""The the special cases involving "type" and "kwargs"
hyperparameters.
"""
default_hparams = {
"type": "type_name",
"kwargs": {
"arg1": "argv1"
}
}
hparams = {
"type": "type_name"
}
hparams_ = HParams(hparams, default_hparams)
self.assertEqual(hparams_.kwargs.todict(), default_hparams["kwargs"])
hparams = {
"type": "type_name",
"kwargs": {
"arg2": "argv2"
}
}
hparams_ = HParams(hparams, default_hparams)
full_kwargs = {}
full_kwargs.update(default_hparams["kwargs"])
full_kwargs.update(hparams["kwargs"])
self.assertEqual(hparams_.kwargs.todict(), full_kwargs)
hparams = {
"kwargs": {
"arg2": "argv2"
}
}
hparams_ = HParams(hparams, default_hparams)
self.assertEqual(hparams_.kwargs.todict(), full_kwargs)
hparams = {
"type": "type_name2"
}
hparams_ = HParams(hparams, default_hparams)
self.assertEqual(hparams_.kwargs.todict(), {})
hparams = {
"type": "type_name2",
"kwargs": {
"arg3": "argv3"
}
}
hparams_ = HParams(hparams, default_hparams)
self.assertEqual(hparams_.kwargs.todict(), hparams["kwargs"])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/losses/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar losses.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.losses.losses_utils import *
from texar.losses.mle_losses import *
from texar.losses.pg_losses import *
from texar.losses.adv_losses import *
from texar.losses.rewards import *
from texar.losses.entropy import *
================================================
FILE: texar_repo/texar/losses/adv_losses.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Adversarial losses.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
def binary_adversarial_losses(real_data,
fake_data,
discriminator_fn,
mode="max_real"):
"""Computes adversarial losses of real/fake binary discrimination game.
.. role:: python(code)
:language: python
Args:
real_data (Tensor or array): Real data of shape
`[num_real_examples, ...]`.
fake_data (Tensor or array): Fake data of shape
`[num_fake_examples, ...]`. `num_real_examples` does not
necessarily equal `num_fake_examples`.
discriminator_fn: A callable takes data (e.g., :attr:`real_data` and
:attr:`fake_data`) and returns the logits of being real. The
signature of `discriminator_fn` must be:
:python:`logits, ... = discriminator_fn(data)`.
The return value of `discriminator_fn` can be the logits, or
a tuple where the logits are the first element.
mode (str): Mode of the generator loss. Either "max_real" or "min_fake".
- **"max_real"** (default): minimizing the generator loss is to\
maximize the probability of fake data being classified as real.
- **"min_fake"**: minimizing the generator loss is to minimize the\
probability of fake data being classified as fake.
Returns:
A tuple `(generator_loss, discriminator_loss)` each of which is
a scalar Tensor, loss to be minimized.
"""
real_logits = discriminator_fn(real_data)
if isinstance(real_logits, (list, tuple)):
real_logits = real_logits[0]
real_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=real_logits, labels=tf.ones_like(real_logits)))
fake_logits = discriminator_fn(fake_data)
if isinstance(fake_logits, (list, tuple)):
fake_logits = fake_logits[0]
fake_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=fake_logits, labels=tf.zeros_like(fake_logits)))
d_loss = real_loss + fake_loss
if mode == "min_fake":
g_loss = - fake_loss
elif mode == "max_real":
g_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=fake_logits, labels=tf.ones_like(fake_logits)))
else:
raise ValueError("Unknown mode: %s. Only 'min_fake' and 'max_real' "
"are allowed.")
return g_loss, d_loss
================================================
FILE: texar_repo/texar/losses/adv_losses_test.py
================================================
#
"""
Tests adversarial loss related functions.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.losses.adv_losses import binary_adversarial_losses
class AdvLossesTest(tf.test.TestCase):
"""Tests adversarial losses.
"""
def test_binary_adversarial_losses(self):
"""Tests :meth:`~texar.losses.adv_losses.binary_adversarial_losse`.
"""
batch_size = 16
data_dim = 64
real_data = tf.zeros([batch_size, data_dim], dtype=tf.float32)
fake_data = tf.ones([batch_size, data_dim], dtype=tf.float32)
const_logits = tf.zeros([batch_size], dtype=tf.float32)
# Use a dumb discriminator that always outputs logits=0.
gen_loss, disc_loss = binary_adversarial_losses(
real_data, fake_data, lambda x: const_logits)
gen_loss_2, disc_loss_2 = binary_adversarial_losses(
real_data, fake_data, lambda x: const_logits, mode="min_fake")
with self.test_session() as sess:
gen_loss_, disc_loss_ = sess.run([gen_loss, disc_loss])
gen_loss_2_, disc_loss_2_ = sess.run([gen_loss_2, disc_loss_2])
self.assertAlmostEqual(gen_loss_, -gen_loss_2_)
self.assertAlmostEqual(disc_loss_, disc_loss_2_)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/losses/entropy.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various entropies.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.losses.losses_utils import mask_and_reduce, reduce_dimensions
from texar.utils.shapes import get_rank
# pylint: disable=too-many-arguments
__all__ = [
"entropy_with_logits",
"sequence_entropy_with_logits"
]
def _get_entropy(logits):
probs = tf.nn.softmax(logits) + 1e-8
entropy = - probs * tf.log(probs)
entropy = tf.reduce_sum(entropy, -1)
return entropy
def entropy_with_logits(logits,
rank=None,
average_across_batch=True,
average_across_remaining=False,
sum_over_batch=False,
sum_over_remaining=True):
"""Shannon entropy given logits.
Args:
logits: Unscaled log probabilities of shape
`[batch_size, d_2, ..., d_{rank-1}, distribution_dim]`
and of dtype `float32` or `float64`.
The rank of the tensor is optionally specified by the argument
:attr:`rank`.
The tensor is considered as having `[batch_size, .., d_{rank-1}]`
elements, each of which has a distribution of length `d_rank`
(i.e., `distribution_dim`). So the last dimension is always
summed out to compute the entropy.
rank (int, optional): The rank of :attr:`logits`.
If `None` (default), `rank` is inferred automatically from
`logits`. If the inference fails, `rank` is
set to 2, i.e., assuming :attr:`logits` is of shape
`[batch_size, distribution_dim]`
average_across_batch (bool): If set, average the entropy across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_remaining (bool): If set, average the entropy across the
remaining dimensions. Must not set `average_across_remaining`'
and `sum_over_remaining` at the same time.
Used only when :attr:`logits` has rank >= 3.
sum_over_batch (bool): If set, sum the entropy across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_remaining (bool): If set, sum the entropy across the
remaining dimension. Must not set `average_across_remaining`
and `sum_over_remaining` at the same time.
Used only when :attr:`logits` has rank >= 3.
Returns:
A Tensor containing the shannon entropy. The dimensionality of the
Tensor depends on the configuration of reduction arguments. For
example, if both batch and remaining dimensions are reduced (by
either sum or average), the returned Tensor is a scalar Tensor.
"""
entropy = _get_entropy(logits)
if rank is None:
rank = get_rank(logits)
if rank is None:
rank = 2
rank -= 1 # reduced last dimension
# Reduces
if average_across_batch and sum_over_batch:
raise ValueError("Only one of `average_across_batch` and "
"`sum_over_batch` can be set.")
if average_across_remaining and sum_over_remaining:
raise ValueError("Only one of `average_across_remaining` and "
"`sum_over_remaining` can be set.")
sum_axes, average_axes = [], []
if sum_over_batch:
sum_axes.append(0)
if average_across_batch:
average_axes.append(0)
if sum_over_remaining and rank >= 2:
sum_axes += list(range(1, rank))
if average_across_remaining and rank >= 2:
average_axes += list(range(1, rank))
entropy = reduce_dimensions(
entropy, average_axes=average_axes, sum_axes=sum_axes)
return entropy
def sequence_entropy_with_logits(logits,
rank=None,
sequence_length=None,
average_across_batch=True,
average_across_timesteps=False,
average_across_remaining=False,
sum_over_batch=False,
sum_over_timesteps=True,
sum_over_remaining=True,
time_major=False):
"""Shannon entropy given logits.
Args:
logits: Unscaled log probabilities of shape
`[batch_size, max_time, d_3, ..., d_{rank-1}, distribution_dim]`
and of dtype `float32` or `float64`.
The rank of the tensor is optionally specified by the argument
:attr:`rank`.
The tensor is considered as having `[batch_size, .., d_{rank-1}]`
elements, each of which has a distribution of length `d_rank`
(i.e., `distribution_dim`). So the last dimension is always
summed out to compute the entropy.
The batch and time dimensions are exchanged if :attr:`time_major`
is `True`.
rank (int, optional): The rank of :attr:`logits`.
If `None` (default), `rank` is inferred automatically from
`logits`. If the inference fails, `rank` is
set to 3, i.e., assuming `logits` is of shape
`[batch_size, max_time, distribution_dim]`
sequence_length (optional): A Tensor of shape `[batch_size]`.
Time steps beyond the respective sequence lengths are
counted into the entropy.
average_across_timesteps (bool): If set, average the entropy across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the entropy across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_remaining (bool): If set, average the entropy across the
remaining dimensions. Must not set `average_across_remaining`'
and `sum_over_remaining` at the same time.
Used only when :attr:`logits` has rank >= 4.
sum_over_timesteps (bool): If set, sum the entropy across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the entropy across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_remaining (bool): If set, sum the entropy across the
remaining dimension. Must not set `average_across_remaining`
and `sum_over_remaining` at the same time.
Used only when :attr:`logits` has rank >= 4.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`logits` must have shape `[max_time, batch_size, ...]`.
If `False` (default), it must have shape
`[batch_size, max_time, ...]`.
Returns:
A Tensor containing the shannon entropy. The dimensionality of the
Tensor depends on the configuration of reduction arguments. For
example, if batch, time, and remaining dimensions are all reduced (by
either sum or average), the returned Tensor is a scalar Tensor.
"""
entropy = _get_entropy(logits)
if rank is None:
rank = get_rank(logits)
if rank is None:
rank = 3
rank -= 1 # reduced last dimension
entropy = mask_and_reduce(
entropy,
sequence_length,
rank=rank,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
average_across_remaining=average_across_remaining,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
sum_over_remaining=sum_over_remaining,
time_major=time_major)
return entropy
================================================
FILE: texar_repo/texar/losses/losses_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various utilities for losses.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from tensorflow.python.ops import rnn # pylint: disable=E0611
from texar.utils.shapes import mask_sequences
# pylint: disable=invalid-name, not-context-manager, protected-access,
# pylint: disable=too-many-arguments
__all__ = [
"mask_and_reduce",
"reduce_batch_time",
"reduce_dimensions"
]
def mask_and_reduce(sequence,
sequence_length,
rank=2,
average_across_batch=True,
average_across_timesteps=False,
average_across_remaining=False,
sum_over_batch=False,
sum_over_timesteps=True,
sum_over_remaining=True,
dtype=None,
time_major=False):
"""Masks out sequence entries that are beyond the respective sequence
lengths, and reduces (average or sum) away dimensions.
This is a combination of :func:`~texar.utils.shapes.mask_sequences`
and :func:`~texar.losses.losses_utils.reduce_batch_time`.
Args:
sequence: A Tensor of sequence values.
If `time_major=False` (default), this must be a Tensor of shape
`[batch_size, max_time, d_2, ..., d_rank]`, where the rank of
the Tensor is specified with :attr:`rank`.
The batch and time dimensions are exchanged if `time_major` is True.
sequence_length: A Tensor of shape `[batch_size]`. Time steps beyond
the respective sequence lengths will be made zero. If `None`,
not masking is performed.
rank (int): The rank of :attr:`sequence`. Must be >= 2. Default is 2,
i.e., `sequence` is a 2D Tensor consisting of batch and time
dimensions.
average_across_timesteps (bool): If set, average the sequence across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the sequence across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_remaining (bool): If set, average the sequence across the
remaining dimensions. Must not set `average_across_remaining`'
and `sum_over_remaining` at the same time.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_remaining (bool): If set, sum the loss across the
remaining dimension. Must not set `average_across_remaining`
and `sum_over_remaining` at the same time.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`sequence` must have shape `[max_time, batch_size, ...]`.
If `False` (default), `sequence` must have
shape `[batch_size, max_time, ...]`.
dtype (dtype): Type of :attr:`sequence`. If `None`, infer from
:attr:`sequence` automatically.
Returns
A Tensor containing the masked and reduced sequence.
"""
if rank < 2:
raise ValueError('`rank` must be >= 2.')
if time_major:
sequence = rnn._transpose_batch_time(sequence)
if sequence_length is not None:
sequence = mask_sequences(sequence, sequence_length, dtype=dtype,
time_major=False, tensor_rank=rank)
if rank > 2:
if average_across_remaining and sum_over_remaining:
raise ValueError("Only one of `average_across_remaining` and "
"`sum_over_remaining` can be set.")
if average_across_remaining:
sequence = tf.reduce_mean(sequence, axis=np.arange(2, rank))
elif sum_over_remaining:
sequence = tf.reduce_sum(sequence, axis=np.arange(2, rank))
sequence = reduce_batch_time(sequence,
sequence_length,
average_across_batch,
average_across_timesteps,
sum_over_batch,
sum_over_timesteps)
reduce_time = average_across_timesteps or sum_over_timesteps
reduce_batch = average_across_batch or sum_over_batch
if not reduce_time and not reduce_batch and time_major:
sequence = rnn._transpose_batch_time(sequence)
return sequence
def reduce_batch_time(sequence,
sequence_length,
average_across_batch=True,
average_across_timesteps=False,
sum_over_batch=False,
sum_over_timesteps=True):
"""Average or sum over the respective dimensions of :attr:`sequence`, which
is of shape `[batch_size, max_time, ...]`.
Assumes :attr:`sequence` has been properly masked according to
:attr:`sequence_length`.
"""
if average_across_timesteps and sum_over_timesteps:
raise ValueError("Only one of `average_across_timesteps` and "
"`sum_over_timesteps` can be set.")
if average_across_batch and sum_over_batch:
raise ValueError("Only one of `average_across_batch` and "
"`sum_over_batch` can be set.")
if sum_over_timesteps:
sequence = tf.reduce_sum(sequence, axis=[1])
elif average_across_timesteps:
if sequence_length is None:
sequence = tf.reduce_mean(sequence, axis=[1])
else:
sequence = tf.reduce_sum(sequence, axis=[1])
if average_across_timesteps:
sequence = sequence / tf.to_float(sequence_length)
if sum_over_batch:
sequence = tf.reduce_sum(sequence, axis=[0])
elif average_across_batch:
sequence = tf.reduce_mean(sequence, axis=[0])
return sequence
def reduce_dimensions(tensor, average_axes=None, sum_axes=None, keepdims=None):
"""Average or sum over dimensions of :attr:`tensor`.
:attr:`average_axes` and :attr:`sum_axes` must be mutually exclusive. That
is, elements in `average_axes` must not be contained in
`sum_axes`, and vice versa.
Args:
tensor: A tensor to reduce.
average_axes (optional): A (list of) `int` that indicates the
dimensions to reduce by taking average.
sum_axes (optional): A (list of) `int` that indicates the
dimensions to reduce by taking sum.
keepdims (optional): If `True`, retains reduced dimensions with
length 1.
"""
reduced_axes = []
if average_axes is not None and len(average_axes) > 0:
tensor = tf.reduce_mean(tensor, axis=average_axes, keepdims=True)
if not isinstance(average_axes, (list, tuple)):
average_axes = [average_axes]
reduced_axes += average_axes
if sum_axes is not None and len(sum_axes) > 0:
tensor = tf.reduce_sum(tensor, axis=sum_axes, keepdims=True)
if not isinstance(sum_axes, (list, tuple)):
sum_axes = [sum_axes]
reduced_axes += sum_axes
if average_axes is not None:
if len(reduced_axes) != len(average_axes) + len(sum_axes):
raise ValueError('`average_axes` and `sum_axes` must not have '
'overlapped elements.')
if not keepdims:
tensor = tf.squeeze(tensor, axis=reduced_axes)
return tensor
================================================
FILE: texar_repo/texar/losses/mle_losses.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various losses
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.losses.losses_utils import mask_and_reduce, reduce_dimensions
from texar.utils import shapes
# pylint: disable=invalid-name, not-context-manager, protected-access,
# pylint: disable=too-many-arguments
__all__ = [
"sequence_softmax_cross_entropy",
"sequence_sparse_softmax_cross_entropy",
"sequence_sigmoid_cross_entropy",
"binary_sigmoid_cross_entropy",
"binary_sigmoid_cross_entropy_with_clas"
]
def sequence_softmax_cross_entropy(labels,
logits,
sequence_length,
average_across_batch=True,
average_across_timesteps=False,
sum_over_batch=False,
sum_over_timesteps=True,
time_major=False,
stop_gradient_to_label=False,
name=None):
"""Computes softmax cross entropy for each time step of sequence
predictions.
Args:
labels: Target class distributions.
- If :attr:`time_major` is `False` (default), this must be a\
Tensor of shape `[batch_size, max_time, num_classes]`.
- If `time_major` is `True`, this must be a Tensor of shape\
`[max_time, batch_size, num_classes]`.
Each row of `labels` should be a valid probability
distribution, otherwise, the computation of the gradient will be
incorrect.
logits: Unscaled log probabilities. This must have the shape of
`[max_time, batch_size, num_classes]` or
`[batch_size, max_time, num_classes]` according to
the value of `time_major`.
sequence_length: A Tensor of shape `[batch_size]`. Time steps beyond
the respective sequence lengths will have zero losses.
average_across_timesteps (bool): If set, average the loss across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`labels` and :attr:`logits` must have shape
`[max_time, batch_size, ...]`. If `False`
(default), they must have shape `[batch_size, max_time, ...]`.
stop_gradient_to_label (bool): If set, gradient propagation to
:attr:`labels` will be disabled.
name (str, optional): A name for the operation.
Returns:
A Tensor containing the loss, of rank 0, 1, or 2 depending on the
arguments :attr:`{average_across}/{sum_over}_{timesteps}/{batch}`.
For example:
- If :attr:`sum_over_timesteps` and :attr:`average_across_batch` \
are `True` (default), the return Tensor is of rank 0.
- If :attr:`average_across_batch` is `True` and other arguments are \
`False`, the return Tensor is of shape `[max_time]`.
"""
with tf.name_scope(name, "sequence_softmax_cross_entropy"):
if stop_gradient_to_label:
labels = tf.stop_gradient(labels)
losses = tf.nn.softmax_cross_entropy_with_logits_v2(
labels=labels, logits=logits)
losses = mask_and_reduce(
losses,
sequence_length,
rank=2,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
time_major=time_major)
return losses
def sequence_sparse_softmax_cross_entropy(labels,
logits,
sequence_length,
average_across_batch=True,
average_across_timesteps=False,
sum_over_batch=False,
sum_over_timesteps=True,
time_major=False,
name=None):
"""Computes sparse softmax cross entropy for each time step of sequence
predictions.
Args:
labels: Target class indexes. I.e., classes are mutually exclusive
(each entry is in exactly one class).
- If :attr:`time_major` is `False` (default), this must be\
a Tensor of shape `[batch_size, max_time]`.
- If `time_major` is `True`, this must be a Tensor of shape\
`[max_time, batch_size].`
logits: Unscaled log probabilities. This must have the shape of
`[max_time, batch_size, num_classes]` or
`[batch_size, max_time, num_classes]` according to
the value of `time_major`.
sequence_length: A Tensor of shape `[batch_size]`. Time steps beyond
the respective sequence lengths will have zero losses.
average_across_timesteps (bool): If set, average the loss across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`labels` and :attr:`logits` must have shape
`[max_time, batch_size, ...]`. If `False`
(default), they must have shape `[batch_size, max_time, ...]`.
name (str, optional): A name for the operation.
Returns:
A Tensor containing the loss, of rank 0, 1, or 2 depending on the
arguments :attr:`{average_across}/{sum_over}_{timesteps}/{batch}`.
For example:
- If :attr:`sum_over_timesteps` and :attr:`average_across_batch` \
are `True` (default), the return Tensor is of rank 0.
- If :attr:`average_across_batch` is `True` and other arguments are \
`False`, the return Tensor is of shape `[max_time]`.
Example:
.. code-block:: python
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=embedder(data_batch['text_ids']),
sequence_length=data_batch['length']-1)
loss = sequence_sparse_softmax_cross_entropy(
labels=data_batch['text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=data_batch['length']-1)
"""
with tf.name_scope(name, "sequence_sparse_softmax_cross_entropy"):
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits)
losses = mask_and_reduce(
losses,
sequence_length,
rank=2,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
time_major=time_major)
return losses
def sequence_sigmoid_cross_entropy(labels,
logits,
sequence_length,
average_across_batch=True,
average_across_timesteps=False,
average_across_classes=True,
sum_over_batch=False,
sum_over_timesteps=True,
sum_over_classes=False,
time_major=False,
stop_gradient_to_label=False,
name=None):
"""Computes sigmoid cross entropy for each time step of sequence
predictions.
Args:
labels: Target class distributions.
- If :attr:`time_major` is `False` (default), this must be a\
Tensor of shape `[batch_size, max_time(, num_classes)]`.
- If `time_major` is `True`, this must be a Tensor of shape\
`[max_time, batch_size(, num_classes)]`.
Each row of `labels` should be a valid probability
distribution, otherwise, the computation of the gradient will be
incorrect.
logits: Unscaled log probabilities having the same shape as with
:attr:`labels`.
sequence_length: A Tensor of shape `[batch_size]`. Time steps beyond
the respective sequence lengths will have zero losses.
average_across_timesteps (bool): If set, average the loss across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_classes (bool): If set, average the loss across the
class dimension (if exists). Must not set
`average_across_classes`' and `sum_over_classes` at
the same time. Ignored if :attr:`logits` is a 2D Tensor.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_classes (bool): If set, sum the loss across the
class dimension. Must not set `average_across_classes`
and `sum_over_classes` at the same time. Ignored if
:attr:`logits` is a 2D Tensor.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`labels` and :attr:`logits` must have shape
`[max_time, batch_size, ...]`. If `False`
(default), they must have shape `[batch_size, max_time, ...]`.
stop_gradient_to_label (bool): If set, gradient propagation to
:attr:`labels` will be disabled.
name (str, optional): A name for the operation.
Returns:
A Tensor containing the loss, of rank 0, 1, or 2 depending on the
arguments
:attr:`{average_across}/{sum_over}_{timesteps}/{batch}/{classes}`.
For example, if the class dimension does not exist, and
- If :attr:`sum_over_timesteps` and :attr:`average_across_batch` \
are `True` (default), the return Tensor is of rank 0.
- If :attr:`average_across_batch` is `True` and other arguments are \
`False`, the return Tensor is of shape `[max_time]`.
"""
with tf.name_scope(name, "sequence_sigmoid_cross_entropy"):
if stop_gradient_to_label:
labels = tf.stop_gradient(labels)
losses = tf.nn.sigmoid_cross_entropy_with_logits(
labels=labels, logits=logits)
rank = shapes.get_rank(logits) or shapes.get_rank(labels)
if rank is None:
raise ValueError(
'Cannot determine the rank of `logits` or `labels`.')
losses = mask_and_reduce(
losses,
sequence_length,
rank=rank,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
average_across_remaining=average_across_classes,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
sum_over_remaining=sum_over_classes,
time_major=time_major)
return losses
def binary_sigmoid_cross_entropy(pos_logits=None,
neg_logits=None,
average_across_batch=True,
average_across_classes=True,
sum_over_batch=False,
sum_over_classes=False,
return_pos_neg_losses=False,
name=None):
"""Computes sigmoid cross entropy of binary predictions.
Args:
pos_logits: The logits of predicting positive on positive data. A
tensor of shape `[batch_size(, num_classes)]`.
neg_logits: The logits of predicting positive on negative data. A
tensor of shape `[batch_size(, num_classes)]`.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_classes (bool): If set, average the loss across the
class dimension (if exists). Must not set
`average_across_classes`' and `sum_over_classes` at
the same time. Ignored if :attr:`logits` is a 1D Tensor.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_classes (bool): If set, sum the loss across the
class dimension. Must not set `average_across_classes`
and `sum_over_classes` at the same time. Ignored if
:attr:`logits` is a 2D Tensor.
return_pos_neg_losses (bool): If set, additionally returns the losses
on :attr:`pos_logits` and :attr:`neg_logits`, respectively.
name (str, optional): A name for the operation.
Returns:
By default, a Tensor containing the loss, of rank 0, 1, or 2 depending
on the arguments :attr:`{average_across}/{sum_over}_{batch}/{classes}`.
For example:
- If :attr:`sum_over_batch` and :attr:`average_across_classes` \
are `True` (default), the return Tensor is of rank 0.
- If arguments are `False`, the return Tensor is of shape \
`[batch_size(, num_classes)]`.
If :attr:`return_pos_neg_losses` is `True`, returns a tuple
`(loss, pos_loss, neg_loss)`, where `loss` is the loss above;
`pos_loss` is the loss on `pos_logits` only; and
`neg_loss` is the loss on `neg_logits` only. They have
`loss = pos_loss + neg_loss`.
"""
with tf.name_scope(name, "binary_sigmoid_cross_entropy"):
average_axes, sum_axes = [], []
average_axes += [0] if average_across_batch else []
average_axes += [1] if average_across_classes else []
sum_axes += [0] if sum_over_batch else []
sum_axes += [1] if sum_over_classes else []
pos_loss = 0
if pos_logits is not None:
pos_loss = tf.nn.sigmoid_cross_entropy_with_logits(
logits=pos_logits, labels=tf.ones_like(pos_logits))
pos_loss = reduce_dimensions(pos_loss, average_axes, sum_axes)
neg_loss = 0
if neg_logits is not None:
neg_loss = tf.nn.sigmoid_cross_entropy_with_logits(
logits=neg_logits, labels=tf.zeros_like(neg_logits))
neg_loss = reduce_dimensions(neg_loss, average_axes, sum_axes)
loss = pos_loss + neg_loss
if return_pos_neg_losses:
return loss, pos_loss, neg_loss
else:
return loss
def binary_sigmoid_cross_entropy_with_clas(clas_fn,
pos_inputs=None,
neg_inputs=None,
average_across_batch=True,
average_across_classes=True,
sum_over_batch=False,
sum_over_classes=False,
return_pos_neg_losses=False,
name=None):
"""Computes sigmoid cross entropy of binary classifier.
.. role:: python(code)
:language: python
Args:
clas_fn: A callable takes data (e.g., :attr:`pos_inputs` and
:attr:`fake_inputs`) and returns the logits of being positive. The
signature of `clas_fn` must be:
:python:`logits (, ...) = clas_fn(inputs)`.
The return value of `clas_fn` can be the logits, or
a tuple where the logits are the first element.
pos_inputs: The positive data fed into `clas_fn`.
neg_inputs: The negative data fed into `clas_fn`.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
average_across_classes (bool): If set, average the loss across the
class dimension (if exists). Must not set
`average_across_classes`' and `sum_over_classes` at
the same time. Ignored if :attr:`logits` is a 1D Tensor.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
sum_over_classes (bool): If set, sum the loss across the
class dimension. Must not set `average_across_classes`
and `sum_over_classes` at the same time. Ignored if
:attr:`logits` is a 2D Tensor.
return_pos_neg_losses (bool): If set, additionally returns the losses
on :attr:`pos_logits` and :attr:`neg_logits`, respectively.
name (str, optional): A name for the operation.
Returns:
By default, a Tensor containing the loss, of rank 0, 1, or 2 depending
on the arguments :attr:`{average_across}/{sum_over}_{batch}/{classes}`.
For example:
- If :attr:`sum_over_batch` and :attr:`average_across_classes` \
are `True` (default), the return Tensor is of rank 0.
- If arguments are `False`, the return Tensor is of shape \
`[batch_size(, num_classes)]`.
If :attr:`return_pos_neg_losses`=`True`, returns a tuple
`(loss, pos_loss, neg_loss)`, where `loss` is the loss above;
`pos_loss` is the loss on `pos_logits` only; and
`neg_loss` is the loss on `neg_logits` only. They have
`loss = pos_loss + neg_loss`.
"""
pos_logits = None
if pos_inputs is not None:
pos_logits = clas_fn(pos_inputs)
if isinstance(pos_logits, (list, tuple)):
pos_logits = pos_logits[0]
neg_logits = None
if neg_inputs is not None:
neg_logits = clas_fn(neg_inputs)
if isinstance(neg_logits, (list, tuple)):
neg_logits = neg_logits[0]
return binary_sigmoid_cross_entropy(
pos_logits=pos_logits,
neg_logits=neg_logits,
average_across_batch=average_across_batch,
average_across_classes=average_across_classes,
sum_over_batch=sum_over_batch,
sum_over_classes=sum_over_classes,
return_pos_neg_losses=return_pos_neg_losses,
name=name)
================================================
FILE: texar_repo/texar/losses/mle_losses_test.py
================================================
# -*- coding: utf-8 -*-
#
"""
Unit tests for mle losses.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=invalid-name
import numpy as np
import tensorflow as tf
import texar as tx
class MLELossesTest(tf.test.TestCase):
"""Tests mle losses.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._batch_size = 64
self._max_time = 16
self._num_classes = 100
self._labels = tf.ones([self._batch_size, self._max_time],
dtype=tf.int32)
one_hot_labels = tf.one_hot(
self._labels, self._num_classes, dtype=tf.float32)
self._one_hot_labels = tf.reshape(
one_hot_labels, [self._batch_size, self._max_time, -1])
self._logits = tf.random_uniform(
[self._batch_size, self._max_time, self._num_classes])
self._sequence_length = tf.random_uniform(
[self._batch_size], maxval=self._max_time, dtype=tf.int32)
def _test_sequence_loss(self, loss_fn, labels, logits, sequence_length):
with self.test_session() as sess:
loss = loss_fn(labels, logits, sequence_length)
rank = sess.run(tf.rank(loss))
self.assertEqual(rank, 0)
loss = loss_fn(
labels, logits, sequence_length, sum_over_timesteps=False)
rank = sess.run(tf.rank(loss))
self.assertEqual(rank, 1)
self.assertEqual(loss.shape, tf.TensorShape([self._max_time]))
loss = loss_fn(
labels, logits, sequence_length, sum_over_timesteps=False,
average_across_timesteps=True, average_across_batch=False)
rank = sess.run(tf.rank(loss))
self.assertEqual(rank, 1)
self.assertEqual(loss.shape, tf.TensorShape([self._batch_size]))
loss = loss_fn(
labels, logits, sequence_length, sum_over_timesteps=False,
average_across_batch=False)
rank = sess.run(tf.rank(loss))
self.assertEqual(rank, 2)
self.assertEqual(loss.shape,
tf.TensorShape([self._batch_size, self._max_time]))
sequence_length_time = tf.random_uniform(
[self._max_time], maxval=self._max_time, dtype=tf.int32)
loss = loss_fn(
labels, logits, sequence_length_time, sum_over_timesteps=False,
average_across_batch=False, time_major=True)
self.assertEqual(loss.shape,
tf.TensorShape([self._batch_size, self._max_time]))
def test_sequence_softmax_cross_entropy(self):
"""Tests `sequence_softmax_cross_entropy`
"""
self._test_sequence_loss(
tx.losses.sequence_softmax_cross_entropy,
self._one_hot_labels, self._logits, self._sequence_length)
def test_sequence_sparse_softmax_cross_entropy(self):
"""Tests `sequence_sparse_softmax_cross_entropy`
"""
self._test_sequence_loss(
tx.losses.sequence_sparse_softmax_cross_entropy,
self._labels, self._logits, self._sequence_length)
def test_sequence_sigmoid_cross_entropy(self):
"""Tests `texar.losses.test_sequence_sigmoid_cross_entropy`.
"""
self._test_sequence_loss(
tx.losses.sequence_sigmoid_cross_entropy,
self._one_hot_labels, self._logits, self._sequence_length)
self._test_sequence_loss(
tx.losses.sequence_sigmoid_cross_entropy,
self._one_hot_labels[:, :, 0],
self._logits[:, :, 0],
self._sequence_length)
labels = tf.placeholder(dtype=tf.int32, shape=None)
loss = tx.losses.sequence_sigmoid_cross_entropy(
logits=self._logits[:, :, 0],
labels=tf.to_float(labels),
sequence_length=self._sequence_length)
with self.test_session() as sess:
rank = sess.run(
tf.rank(loss),
feed_dict={labels: np.ones([self._batch_size, self._max_time])})
self.assertEqual(rank, 0)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/losses/pg_losses.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various loss functions for policy gradients.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.losses.losses_utils import mask_and_reduce
from texar.utils.shapes import get_rank
# pylint: disable=too-many-arguments, protected-access
__all__ = [
"pg_loss_with_logits",
"pg_loss_with_log_probs"
]
def pg_loss_with_logits(actions,
logits,
advantages,
rank=None,
batched=False,
sequence_length=None,
average_across_batch=True,
average_across_timesteps=False,
average_across_remaining=False,
sum_over_batch=False,
sum_over_timesteps=True,
sum_over_remaining=True,
time_major=False):
"""Policy gradient loss with logits. Used for discrete actions.
`pg_loss = reduce( advantages * -log_prob( actions ) )`,
where `advantages` and `actions` do not back-propagate gradients.
All arguments except :attr:`logits` and :attr:`actions` are the same with
:func:`pg_loss_with_log_probs`.
Args:
actions: Tensor of shape
`[(batch_size,) max_time, d_3, ..., d_rank]` and of dtype
`int32` or `int64`.
The rank of the Tensor is specified with :attr:`rank`.
The batch dimension exists only if :attr:`batched` is `True`.
The batch and time dimensions
are exchanged, i.e., `[max_time, batch_size, ...]` if
:attr:`time_major` is `True`.
logits: Unscaled log probabilities of shape
`[(batch_size,) max_time, d_3, ..., d_{rank+1}]`
and dtype `float32` or `float64`.
The batch and time dimensions are exchanged if `time_major`
is `True`.
advantages: Tensor of shape
`[(batch_size,) max_time, d_3, ..., d_rank]` and
dtype `float32` or `float64`.
The batch and time dimensions are exchanged if `time_major`
is `True`.
rank (int, optional): The rank of :attr:`actions`.
If `None` (default), rank is automatically inferred from
`actions` or `advantages`. If the inference fails,
`rank` is set to 1 if :attr:`batched` is `False`,
and set to 2 if :attr:`batched` is `True`.
batched (bool): `True` if the inputs are batched.
sequence_length (optional): A Tensor of shape `[batch_size]`.
Time steps beyond the respective sequence lengths will have zero
losses. Used if :attr:`batched` is `True`.
average_across_timesteps (bool): If set, average the loss across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
Ignored if `batched` is `False`.
average_across_remaining (bool): If set, average the sequence across the
remaining dimensions. Must not set `average_across_remaining`'
and `sum_over_remaining` at the same time. Ignored if
no more dimensions other than the batch and time dimensions.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
Ignored if `batched` is `False`.
sum_over_remaining (bool): If set, sum the loss across the
remaining dimension. Must not set `average_across_remaining`
and `sum_over_remaining` at the same time. Ignored if
no more dimensions other than the batch and time dimensions.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`logits`, :attr:`actions` and :attr:`advantages` must
have shape `[max_time, batch_size, ...]`. If `False` (default),
they must have shape `[batch_size, max_time, ...]`.
Ignored if `batched` is `False`.
Returns:
A Tensor containing the loss to minimize, whose rank depends on the
reduce arguments. For example, the batch dimension is reduced if
either :attr:`average_across_batch` or :attr:`sum_over_batch` is
`True`, which decreases the rank of output tensor by 1.
"""
actions = tf.stop_gradient(actions)
neg_log_probs = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=actions)
return pg_loss_with_log_probs(
log_probs=-neg_log_probs,
advantages=advantages,
rank=rank,
batched=batched,
sequence_length=sequence_length,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
average_across_remaining=average_across_remaining,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
sum_over_remaining=sum_over_remaining,
time_major=time_major)
def pg_loss_with_log_probs(log_probs,
advantages,
rank=None,
batched=False,
sequence_length=None,
average_across_batch=True,
average_across_timesteps=False,
average_across_remaining=False,
sum_over_batch=False,
sum_over_timesteps=True,
sum_over_remaining=True,
time_major=False):
"""Policy gradient loss with log probs of actions.
`pg_loss = reduce( advantages * -log_probs )`,
where `advantages` does not back-propagate gradients.
All arguments except :attr:`log_probs` are the same as
:func:`pg_loss_with_logits`.
Args:
log_probs: Log probabilities of shape
`[(batch_size,) max_time, ..., d_rank]` and dtype `float32`
or `float64`. The rank of the Tensor is specified
with :attr:`rank`.
The batch dimension exists only if :attr:`batched` is `True`.
The batch and time dimensions are exchanged, i.e.,
`[max_time, batch_size, ...]` if :attr:`time_major` is `True`.
advantages: Tensor of shape
`[(batch_size,) max_time, d_3, ..., d_rank]` and
dtype `float32` or `float64`.
The batch dimension exists only if `batched` is `True`.
The batch and time dimensions
are exchanged if `time_major` is `True`.
rank (int, optional): The rank of :attr:`log_probs`.
If `None` (default), rank is automatically inferred from
`log_probs` or `advantages`. If the inference fails,
`rank` is set to 1 if `batched``==False`,
and set to 2 if `batched``==True`.
batched (bool): `True` if the inputs are batched.
sequence_length (optional): A Tensor of shape `[batch_size]`.
Time steps beyond the respective sequence lengths will have zero
losses. Used if :attr:`batched` is `True`.
average_across_timesteps (bool): If set, average the loss across
the time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
average_across_batch (bool): If set, average the loss across the
batch dimension. Must not set `average_across_batch`'
and `sum_over_batch` at the same time.
Ignored if `batched` is `False`.
average_across_remaining (bool): If set, average the sequence across the
remaining dimensions. Must not set `average_across_remaining`'
and `sum_over_remaining` at the same time. Ignored if
no more dimensions other than the batch and time dimensions.
sum_over_timesteps (bool): If set, sum the loss across the
time dimension. Must not set `average_across_timesteps`
and `sum_over_timesteps` at the same time.
sum_over_batch (bool): If set, sum the loss across the
batch dimension. Must not set `average_across_batch`
and `sum_over_batch` at the same time.
Ignored if `batched` is `False`.
sum_over_remaining (bool): If set, sum the loss across the
remaining dimension. Must not set `average_across_remaining`
and `sum_over_remaining` at the same time. Ignored if
no more dimensions other than the batch and time dimensions.
time_major (bool): The shape format of the inputs. If `True`,
:attr:`log_probs` and :attr:`advantages` must have shape
`[max_time, batch_size, ...]`. If `False` (default),
they must have shape `[batch_size, max_time, ...]`.
Ignored if :attr:`batched` is `False`.
Returns:
A Tensor containing the loss to minimize, whose rank depends on the
reduce arguments. For example, the batch dimension is reduced if
either :attr:`average_across_batch` or :attr:`sum_over_batch` is
`True`, which decreases the rank of output tensor by 1.
"""
advantages = tf.stop_gradient(advantages)
losses = -log_probs * advantages
if rank is None:
rank = get_rank(log_probs) or get_rank(advantages)
if rank is None:
rank = 2 if batched else 1
if batched:
losses = mask_and_reduce(
losses,
sequence_length,
rank=rank,
average_across_batch=average_across_batch,
average_across_timesteps=average_across_timesteps,
average_across_remaining=average_across_remaining,
sum_over_batch=sum_over_batch,
sum_over_timesteps=sum_over_timesteps,
sum_over_remaining=sum_over_remaining,
time_major=time_major)
elif rank > 1:
if average_across_remaining and sum_over_remaining:
raise ValueError("Only one of `average_across_remaining` and "
"`sum_over_remaining` can be set.")
if average_across_remaining:
losses = tf.reduce_mean(losses, axis=range(1, rank))
elif sum_over_remaining:
losses = tf.reduce_sum(losses, axis=range(1, rank))
if not batched:
if average_across_timesteps and sum_over_timesteps:
raise ValueError("Only one of `average_across_timesteps` and "
"`sum_over_timesteps` can be set.")
if average_across_timesteps:
losses = tf.reduce_mean(losses)
elif sum_over_timesteps:
losses = tf.reduce_mean(losses)
return losses
================================================
FILE: texar_repo/texar/losses/rewards.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various reward related functions.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from texar.utils.shapes import mask_sequences
# pylint: disable=invalid-name, too-many-arguments, no-member
__all__ = [
"discount_reward",
"_discount_reward_py_1d",
"_discount_reward_tensor_1d",
"_discount_reward_py_2d",
"_discount_reward_tensor_2d"
]
def discount_reward(reward,
sequence_length=None,
discount=1.,
normalize=False,
dtype=None,
tensor_rank=1):
"""Computes discounted reward.
:attr:`reward` and :attr:`sequence_length` can be either Tensors or python
arrays. If both are python array (or `None`), the return will be a python
array as well. Otherwise tf Tensors are returned.
Args:
reward: A Tensor or python array. Can be 1D with shape `[batch_size]`,
or 2D with shape `[batch_size, max_time]`.
sequence_length (optional): A Tensor or python array of shape
`[batch_size]`. Time steps beyond the respective sequence lengths
will be masked. Required if :attr:`reward` is 1D.
discount (float): A scalar. The discount factor.
normalize (bool): Whether to normalize the discounted reward, by
`(discounted_reward - mean) / std`. Here `mean` and `std` are
over all time steps and all samples in the batch.
dtype (dtype): Type of :attr:`reward`. If `None`, infer from
`reward` automatically.
tensor_rank (int): The number of dimensions of :attr:`reward`.
Default is 1, i.e., :attr:`reward` is a 1D Tensor consisting
of a batch dimension. Ignored if :attr:`reward`
and :attr:`sequence_length` are python arrays (or `None`).
Returns:
A 2D Tensor or python array of the discounted reward.
If :attr:`reward` and :attr:`sequence_length` are python
arrays (or `None`), the returned value is a python array as well.
Example:
.. code-block:: python
r = [2., 1.]
seq_length = [3, 2]
discounted_r = discount_reward(r, seq_length, discount=0.1)
# discounted_r == [[2. * 0.1^2, 2. * 0.1, 2.],
# [1. * 0.1, 1., 0.]]
r = [[3., 4., 5.], [6., 7., 0.]]
seq_length = [3, 2]
discounted_r = discount_reward(r, seq_length, discount=0.1)
# discounted_r == [[3. + 4.*0.1 + 5.*0.1^2, 4. + 5.*0.1, 5.],
# [6. + 7.*0.1, 7., 0.]]
"""
is_tensor = tf.contrib.framework.is_tensor
if is_tensor(reward) or is_tensor(sequence_length):
if tensor_rank == 1:
disc_reward = _discount_reward_tensor_1d(
reward, sequence_length, discount, dtype)
elif tensor_rank == 2:
disc_reward = _discount_reward_tensor_2d(
reward, sequence_length, discount, dtype)
else:
raise ValueError("`tensor_rank` can only be 1 or 2.")
if normalize:
mu, var = tf.nn.moments(disc_reward, axes=[0, 1], keep_dims=True)
disc_reward = (disc_reward - mu) / (tf.sqrt(var) + 1e-8)
else:
reward = np.array(reward)
tensor_rank = reward.ndim
if tensor_rank == 1:
disc_reward = _discount_reward_py_1d(
reward, sequence_length, discount, dtype)
elif tensor_rank == 2:
disc_reward = _discount_reward_py_2d(
reward, sequence_length, discount, dtype)
else:
raise ValueError("`reward` can only be 1D or 2D.")
if normalize:
mu = np.mean(disc_reward)
std = np.std(disc_reward)
disc_reward = (disc_reward - mu) / (std + 1e-8)
return disc_reward
def _discount_reward_py_1d(reward, sequence_length, discount=1., dtype=None):
if sequence_length is None:
raise ValueError('sequence_length must not be `None` for 1D reward.')
reward = np.array(reward)
sequence_length = np.array(sequence_length)
batch_size = reward.shape[0]
max_seq_length = np.max(sequence_length)
dtype = dtype or reward.dtype
if discount == 1.:
dmat = np.ones([batch_size, max_seq_length], dtype=dtype)
else:
steps = np.tile(np.arange(max_seq_length), [batch_size, 1])
mask = np.asarray(steps < (sequence_length-1)[:, None], dtype=dtype)
# Make each row = [discount, ..., discount, 1, ..., 1]
dmat = mask * discount + (1 - mask)
dmat = np.cumprod(dmat[:, ::-1], axis=1)[:, ::-1]
disc_reward = dmat * reward[:, None]
disc_reward = mask_sequences(disc_reward, sequence_length, dtype=dtype)
#mask = np.asarray(steps < sequence_length[:, None], dtype=dtype)
#disc_reward = mask * disc_reward
return disc_reward
def _discount_reward_tensor_1d(reward, sequence_length,
discount=1., dtype=None):
if sequence_length is None:
raise ValueError('sequence_length must not be `None` for 1D reward.')
batch_size = tf.shape(reward)[0]
max_seq_length = tf.reduce_max(sequence_length)
dtype = dtype or reward.dtype
if discount == 1.:
dmat = tf.ones(
tf.concat([[batch_size], [max_seq_length]], 0), dtype=dtype)
else:
mask = tf.sequence_mask(sequence_length, dtype=dtype)
mask = tf.concat([mask[:, 1:], tf.zeros_like(mask[:, -1:])], axis=1)
# Make each row = [discount, ..., discount, 1, ..., 1]
dmat = mask * discount + (1 - mask)
dmat = tf.cumprod(dmat, axis=1, reverse=True)
disc_reward = dmat * tf.expand_dims(reward, -1)
disc_reward = mask_sequences(
disc_reward, sequence_length, dtype=dtype, tensor_rank=2)
return disc_reward
def _discount_reward_py_2d(reward, sequence_length=None,
discount=1., dtype=None):
if sequence_length is not None:
reward = mask_sequences(reward, sequence_length, dtype=dtype)
dtype = dtype or reward.dtype
if discount == 1.:
disc_reward = np.cumsum(
reward[:, ::-1], axis=1, dtype=dtype)[:, ::-1]
else:
disc_reward = np.copy(reward)
for i in range(reward.shape[1]-2, -1, -1):
disc_reward[:, i] += disc_reward[:, i+1] * discount
return disc_reward
def _discount_reward_tensor_2d(reward, sequence_length=None,
discount=1., dtype=None):
if sequence_length is not None:
reward = mask_sequences(
reward, sequence_length, dtype=dtype, tensor_rank=2)
if discount == 1.:
disc_reward = tf.cumsum(reward, axis=1, reverse=True)
else:
# [max_time, batch_size]
rev_reward_T = tf.transpose(tf.reverse(reward, [1]), [1, 0])
rev_reward_T_cum = tf.scan(
fn=lambda acc, cur: cur + discount * acc,
elems=rev_reward_T,
initializer=tf.zeros_like(reward[:, 1]),
back_prop=False)
disc_reward = tf.reverse(
tf.transpose(rev_reward_T_cum, [1, 0]), [1])
return disc_reward
================================================
FILE: texar_repo/texar/losses/rewards_test.py
================================================
"""
Unit tests for RL rewards.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=invalid-name, no-member
import numpy as np
import tensorflow as tf
from texar.losses.rewards import \
_discount_reward_tensor_2d, _discount_reward_tensor_1d, \
_discount_reward_py_1d, _discount_reward_py_2d, \
discount_reward
class RewardTest(tf.test.TestCase):
"""Tests reward related functions.
"""
def test_discount_reward(self):
"""Tests :func:`texar.losses.rewards.discount_reward`
"""
# 1D
reward = np.ones([2], dtype=np.float64)
sequence_length = [3, 5]
discounted_reward = discount_reward(
reward, sequence_length, discount=1.)
discounted_reward_n = discount_reward(
reward, sequence_length, discount=.1, normalize=True)
discounted_reward_ = discount_reward(
tf.constant(reward, dtype=tf.float64),
sequence_length, discount=1.)
discounted_reward_n_ = discount_reward(
tf.constant(reward, dtype=tf.float64),
sequence_length, discount=.1, normalize=True)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
r, r_n = sess.run([discounted_reward_, discounted_reward_n_])
np.testing.assert_array_almost_equal(
discounted_reward, r, decimal=6)
np.testing.assert_array_almost_equal(
discounted_reward_n, r_n, decimal=6)
# 2D
reward = np.ones([2, 10], dtype=np.float64)
sequence_length = [5, 10]
discounted_reward = discount_reward(
reward, sequence_length, discount=1.)
discounted_reward_n = discount_reward(
reward, sequence_length, discount=.1, normalize=True)
discounted_reward_ = discount_reward(
tf.constant(reward, dtype=tf.float64), sequence_length,
discount=1., tensor_rank=2)
discounted_reward_n_ = discount_reward(
tf.constant(reward, dtype=tf.float64), sequence_length,
discount=.1, tensor_rank=2, normalize=True)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
r, r_n = sess.run([discounted_reward_, discounted_reward_n_])
np.testing.assert_array_almost_equal(
discounted_reward, r, decimal=6)
np.testing.assert_array_almost_equal(
discounted_reward_n, r_n, decimal=6)
def test_discount_reward_py_1d(self):
"""Tests :func:`texar.losses.rewards._discount_reward_py_1d`
"""
reward = np.ones([2], dtype=np.float64)
sequence_length = [3, 5]
discounted_reward_1 = _discount_reward_py_1d(
reward, sequence_length, discount=1.)
discounted_reward_2 = _discount_reward_py_1d(
reward, sequence_length, discount=.1)
r = discounted_reward_1
for i in range(5):
if i < 3:
self.assertEqual(r[0, i], 1)
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], 1)
r = discounted_reward_2
for i in range(5):
if i < 3:
self.assertAlmostEqual(r[0, i], 0.1**(2-i))
else:
self.assertAlmostEqual(r[0, i], 0)
self.assertAlmostEqual(r[1, i], 0.1**(4-i))
def test_discount_reward_tensor_1d(self):
"""Tests :func:`texar.losses.rewards._discount_reward_tensor_1d`
"""
reward = tf.ones([2], dtype=tf.float64)
sequence_length = [3, 5]
discounted_reward_1 = _discount_reward_tensor_1d(
reward, sequence_length, discount=1.)
discounted_reward_2 = _discount_reward_tensor_1d(
reward, sequence_length, discount=.1)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
r = sess.run(discounted_reward_1)
for i in range(5):
if i < 3:
self.assertEqual(r[0, i], 1)
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], 1)
r = sess.run(discounted_reward_2)
for i in range(5):
if i < 3:
self.assertAlmostEqual(r[0, i], 0.1**(2-i))
else:
self.assertAlmostEqual(r[0, i], 0)
self.assertAlmostEqual(r[1, i], 0.1**(4-i))
def test_discount_reward_py_2d(self):
"""Tests :func:`texar.losses.rewards._discount_reward_py_2d`
"""
reward = np.ones([2, 10], dtype=np.float64)
sequence_length = [5, 10]
discounted_reward_1 = _discount_reward_py_2d(
reward, sequence_length, discount=1.)
discounted_reward_2 = _discount_reward_py_2d(
reward, sequence_length, discount=.1)
r = discounted_reward_1
for i in range(10):
if i < 5:
self.assertEqual(r[0, i], 5 - i)
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], 10 - i)
r = discounted_reward_2
for i in range(10):
if i < 5:
self.assertEqual(r[0, i], int(11111./10**i) / 10**(4-i))
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], int(1111111111./10**i) / 10**(9-i))
def test_discount_reward_tensor_2d(self):
"""Tests :func:`texar.losses.rewards._discount_reward_tensor_2d`
"""
reward = tf.ones([2, 10], dtype=tf.float64)
sequence_length = [5, 10]
discounted_reward_1 = _discount_reward_tensor_2d(
reward, sequence_length, discount=1.)
discounted_reward_2 = _discount_reward_tensor_2d(
reward, sequence_length, discount=.1)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
r = sess.run(discounted_reward_1)
for i in range(10):
if i < 5:
self.assertEqual(r[0, i], 5 - i)
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], 10 - i)
r = sess.run(discounted_reward_2)
for i in range(10):
if i < 5:
self.assertEqual(r[0, i], int(11111./10**i) / 10**(4-i))
else:
self.assertEqual(r[0, i], 0)
self.assertEqual(r[1, i], int(1111111111./10**i) / 10**(9-i))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/losses/rl_losses.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various RL losses
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.losses.mle_losses import _mask_sequences
def reinforce_loss(sample_fn,
global_reward_fn,
local_reward_fn=None,
num_samples=1):
"""Computes REINFORCE loss with global and local rewards.
Args:
sample_fn: A callable that takes :attr:`num_samples` and returns
`(samples, probabilities, sequence_lengths)`, where:
`samples` is a Tensor of shape `[num_samples, max_sequence_length]`
containing the generated samples;
`probabilities` is a Tensor of shape
`[num_samples, max_sequence_length]` containing the probabilities of
generating each position of the samples. Probabilities beyond the
respective sequence lengths are ignored.
`sequence_lengths` is a Tensor of shape `[num_samples]` containing
the length of each samples.
global_reward_fn: A callable that takes `(samples, sequence_lengths)`
and returns a Tensor of shape `[num_samples]` containing the reward
of each of the samples.
local_reward_fn (optional): A callable that takes
`(samples, sequence_lengths)` and returns a Tensor of shape
`[num_samples, max_sequence_length]` containing the local reward
at each time step of samples.
num_samples (int scalar Tensor): the number of sequences to sample.
Returns:
A scalar Tensor of the REINFORCE loss.
"""
# shape = [batch, length]
sequences, probs, seq_lens = sample_fn(num_samples)
batch, _ = tf.shape(sequences)
rewards_local = tf.constant(0., dtype=probs.dtype, shape=probs.shape)
if local_reward_fn is not None:
rewards_local = local_reward_fn(sequences, seq_lens)
# shape = [batch, ]
rewards_global = global_reward_fn(sequences, seq_lens)
# add broadcast to rewards_global to match the shape of rewards_local
rewards = rewards_local + tf.reshape(rewards_global, [batch, 1])
eps = 1e-12
log_probs = _mask_sequences(tf.log(probs + eps), seq_lens)
loss = - tf.reduce_mean(
tf.reduce_sum(log_probs * rewards, axis=1) / seq_lens)
return loss
def reinforce_loss_with_MCtree(sample_fn, # pylint: disable=invalid-name
global_reward_fn,
local_reward_fn=None,
num_samples=1):
"""Computes REINFORCE loss with Monte Carlo tree search.
Args:
sample_fn: A callable that takes :attr:`num_samples`, 'given_actions'
and returns `(samples, probabilities, sequence_lengths)`, where:
`samples` is a Tensor of shape `[num_samples, max_sequence_length]`
containing the generated samples;
`probabilities` is a Tensor of shape
`[num_samples, max_sequence_length]` containing the probabilities of
generating each position of the samples. Probabilities beyond the
respective sequence lengths are ignored.
`sequence_lengths` is a Tensor of shape `[num_samples]` containing
the length of each samples.
global_reward_fn: A callable that takes `(samples, sequence_lengths)`
and returns a Tensor of shape `[num_samples]` containing the reward
of each of the samples.
local_reward_fn (optional): A callable that takes
`(samples, sequence_lengths)` and returns a Tensor of shape
`[num_samples, max_sequence_length]` containing the local reward
at each time step of samples.
num_samples (int scalar Tensor): the number of sequences to sample.
Returns:
A scalar Tensor of the REINFORCE loss.
"""
raise NotImplementedError
================================================
FILE: texar_repo/texar/models/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library models.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.models.model_base import *
from texar.models.seq2seq import *
================================================
FILE: texar_repo/texar/models/model_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for models.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar import HParams
# pylint: disable=too-many-arguments
__all__ = [
"ModelBase"
]
class ModelBase(object):
"""Base class inherited by all model classes.
A model class implements interfaces that are compatible with
:tf_main:`TF Estimator `. In particular,
:meth:`_build` implements the
:tf_main:`model_fn ` interface; and
:meth:`get_input_fn` is for the :attr:`input_fn` interface.
.. document private functions
.. automethod:: _build
"""
def __init__(self, hparams=None):
self._hparams = HParams(hparams, self.default_hparams(),
allow_new_hparam=True)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
"""
hparams = {
"name": "model"
}
return hparams
def __call__(self, features, labels, params, mode, config=None):
"""Used for the :tf_main:`model_fn `
argument when constructing
:tf_main:`tf.estimator.Estimator `.
"""
return self._build(features, labels, params, mode, config=config)
def _build(self, features, labels, params, mode, config=None):
"""Used for the :tf_main:`model_fn `
argument when constructing
:tf_main:`tf.estimator.Estimator `.
"""
raise NotImplementedError
def get_input_fn(self, *args, **kwargs):
"""Returns the :attr:`input_fn` function that constructs the input
data, used in :tf_main:`tf.estimator.Estimator `.
"""
raise NotImplementedError
@property
def hparams(self):
"""A :class:`~texar.HParams` instance. The hyperparameters
of the module.
"""
return self._hparams
================================================
FILE: texar_repo/texar/models/seq2seq/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library seq2seq models.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.models.seq2seq.seq2seq_base import *
from texar.models.seq2seq.basic_seq2seq import *
================================================
FILE: texar_repo/texar/models/seq2seq/basic_seq2seq.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The basic seq2seq model without attention.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.models.seq2seq.seq2seq_base import Seq2seqBase
from texar.modules.decoders.beam_search_decode import beam_search_decode
from texar.utils import utils
from texar.utils.shapes import get_batch_size
# pylint: disable=protected-access, too-many-arguments, unused-argument
__all__ = [
"BasicSeq2seq"
]
class BasicSeq2seq(Seq2seqBase):
"""The basic seq2seq model (without attention).
Example:
.. code-block:: python
model = BasicSeq2seq(data_hparams, model_hparams)
exor = tx.run.Executor(
model=model,
data_hparams=data_hparams,
config=run_config)
exor.train_and_evaluate(
max_train_steps=10000,
eval_steps=100)
.. document private functions
.. automethod:: _build
"""
def __init__(self, data_hparams, hparams=None):
Seq2seqBase.__init__(self, data_hparams, hparams=hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
Same as :meth:`~texar.models.Seq2seqBase.default_hparams` of
:class:`~texar.models.Seq2seqBase`.
"""
hparams = Seq2seqBase.default_hparams()
hparams.update({
"name": "basic_seq2seq"
})
return hparams
def _build_decoder(self):
kwargs = {
"vocab_size": self._tgt_vocab.size,
"hparams": self._hparams.decoder_hparams.todict()
}
self._decoder = utils.check_or_get_instance(
self._hparams.decoder, kwargs,
["texar.modules", "texar.custom"])
def _get_predictions(self, decoder_results, features, labels, loss=None):
preds = {}
preds.update(features)
if labels is not None:
preds.update(labels)
preds.update(utils.flatten_dict({'decode': decoder_results}))
preds['decode.outputs.sample'] = self._tgt_vocab.map_ids_to_tokens(
preds['decode.outputs.sample_id'])
if loss is not None:
preds['loss'] = loss
return preds
def embed_source(self, features, labels, mode):
"""Embeds the inputs.
"""
return self._src_embedder(ids=features["source_text_ids"], mode=mode)
def embed_target(self, features, labels, mode):
"""Embeds the target inputs. Used in training.
"""
return self._tgt_embedder(ids=labels["target_text_ids"], mode=mode)
def encode(self, features, labels, mode):
"""Encodes the inputs.
"""
embedded_source = self.embed_source(features, labels, mode)
outputs, final_state = self._encoder(
embedded_source,
sequence_length=features["source_length"],
mode=mode)
return {'outputs': outputs, 'final_state': final_state}
def _connect(self, encoder_results, features, labels, mode):
"""Transforms encoder final state into decoder initial state.
"""
enc_state = encoder_results["final_state"]
possible_kwargs = {
"inputs": enc_state,
"batch_size": get_batch_size(enc_state)
}
outputs = utils.call_function_with_redundant_kwargs(
self._connector._build, possible_kwargs)
return outputs
def _decode_train(self, initial_state, encoder_results, features,
labels, mode):
return self._decoder(
initial_state=initial_state,
decoding_strategy=self._hparams.decoding_strategy_train,
inputs=self.embed_target(features, labels, mode),
sequence_length=labels['target_length']-1,
mode=mode)
def _decode_infer(self, initial_state, encoder_results, features,
labels, mode):
start_token = self._tgt_vocab.bos_token_id
start_tokens = tf.ones_like(features['source_length']) * start_token
max_l = self._decoder.hparams.max_decoding_length_infer
if self._hparams.beam_search_width > 1:
return beam_search_decode(
decoder_or_cell=self._decoder,
embedding=self._tgt_embedder.embedding,
start_tokens=start_tokens,
end_token=self._tgt_vocab.eos_token_id,
beam_width=self._hparams.beam_search_width,
initial_state=initial_state,
max_decoding_length=max_l)
else:
return self._decoder(
initial_state=initial_state,
decoding_strategy=self._hparams.decoding_strategy_infer,
embedding=self._tgt_embedder.embedding,
start_tokens=start_tokens,
end_token=self._tgt_vocab.eos_token_id,
mode=mode)
def decode(self, encoder_results, features, labels, mode):
"""Decodes.
"""
initial_state = self._connect(encoder_results, features, labels, mode)
if mode == tf.estimator.ModeKeys.PREDICT:
outputs, final_state, sequence_length = self._decode_infer(
initial_state, encoder_results, features, labels, mode)
else:
outputs, final_state, sequence_length = self._decode_train(
initial_state, encoder_results, features, labels, mode)
return {'outputs': outputs,
'final_state': final_state,
'sequence_length': sequence_length}
================================================
FILE: texar_repo/texar/models/seq2seq/seq2seq_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for seq2seq models.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.models.model_base import ModelBase
from texar.losses.mle_losses import sequence_sparse_softmax_cross_entropy
from texar.data.data.paired_text_data import PairedTextData
from texar.core.optimization import get_train_op
from texar import HParams
from texar.utils import utils
from texar.utils.variables import collect_trainable_variables
# pylint: disable=too-many-instance-attributes, unused-argument,
# pylint: disable=too-many-arguments, no-self-use
__all__ = [
"Seq2seqBase"
]
class Seq2seqBase(ModelBase):
"""Base class inherited by all seq2seq model classes.
.. document private functions
.. automethod:: _build
"""
def __init__(self, data_hparams, hparams=None):
ModelBase.__init__(self, hparams)
self._data_hparams = HParams(data_hparams,
PairedTextData.default_hparams())
self._src_vocab = None
self._tgt_vocab = None
self._src_embedder = None
self._tgt_embedder = None
self._connector = None
self._encoder = None
self._decoder = None
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"source_embedder": "WordEmbedder",
"source_embedder_hparams": {},
"target_embedder": "WordEmbedder",
"target_embedder_hparams": {},
"embedder_share": True,
"embedder_hparams_share": True,
"encoder": "UnidirectionalRNNEncoder",
"encoder_hparams": {},
"decoder": "BasicRNNDecoder",
"decoder_hparams": {},
"decoding_strategy_train": "train_greedy",
"decoding_strategy_infer": "infer_greedy",
"beam_search_width": 0,
"connector": "MLPTransformConnector",
"connector_hparams": {},
"optimization": {},
"name": "seq2seq",
}
Here:
"source_embedder" : str or class or instance
Word embedder for source text. Can be a class, its name or module
path, or a class instance.
"source_embedder_hparams" : dict
Hyperparameters for constructing the source embedder. E.g.,
See :meth:`~texar.modules.WordEmbedder.default_hparams` for
hyperparameters of :class:`~texar.modules.WordEmbedder`. Ignored
if "source_embedder" is an instance.
"target_embedder", "target_embedder_hparams" :
Same as "source_embedder" and "source_embedder_hparams" but for
target text embedder.
"embedder_share" : bool
Whether to share the source and target embedder. If `True`,
source embedder will be used to embed target text.
"embedder_hparams_share" : bool
Whether to share the embedder configurations. If `True`,
target embedder will be created with "source_embedder_hparams".
But the two embedders have different set of trainable variables.
"encoder", "encoder_hparams" :
Same as "source_embedder" and "source_embedder_hparams" but for
encoder.
"decoder", "decoder_hparams" :
Same as "source_embedder" and "source_embedder_hparams" but for
decoder.
"decoding_strategy_train" : str
The decoding strategy in training mode. See
:meth:`~texar.modules.RNNDecoderBase._build` for details.
"decoding_strategy_infer" : str
The decoding strategy in eval/inference mode.
"beam_search_width" : int
Beam width. If > 1, beam search is used in eval/inference mode.
"connector", "connector_hparams" :
The connector class and hyperparameters. A connector transforms
an encoder final state to a decoder initial state.
"optimization" : dict
Hyperparameters of optimizating the model. See
:func:`~texar.core.default_optimization_hparams` for details.
"name" : str
Name of the model.
"""
hparams = ModelBase.default_hparams()
hparams.update({
"name": "seq2seq",
"source_embedder": "WordEmbedder",
"source_embedder_hparams": {},
"target_embedder": "WordEmbedder",
"target_embedder_hparams": {},
"embedder_share": True,
"embedder_hparams_share": True,
"encoder": "UnidirectionalRNNEncoder",
"encoder_hparams": {},
"decoder": "BasicRNNDecoder",
"decoder_hparams": {},
"decoding_strategy_train": "train_greedy",
"decoding_strategy_infer": "infer_greedy",
"beam_search_width": 0,
"connector": "MLPTransformConnector",
"connector_hparams": {},
"optimization": {}
})
return hparams
def _build_vocab(self):
self._src_vocab, self._tgt_vocab = PairedTextData.make_vocab(
self._data_hparams.source_dataset,
self._data_hparams.target_dataset)
def _build_embedders(self):
kwargs = {
"vocab_size": self._src_vocab.size,
"hparams": self._hparams.source_embedder_hparams.todict()
}
self._src_embedder = utils.check_or_get_instance(
self._hparams.source_embedder, kwargs,
["texar.modules", "texar.custom"])
if self._hparams.embedder_share:
self._tgt_embedder = self._src_embedder
else:
kwargs = {
"vocab_size": self._tgt_vocab.size,
}
if self._hparams.embedder_hparams_share:
kwargs["hparams"] = \
self._hparams.source_embedder_hparams.todict()
else:
kwargs["hparams"] = \
self._hparams.target_embedder_hparams.todict()
self._tgt_embedder = utils.check_or_get_instance(
self._hparams.target_embedder, kwargs,
["texar.modules", "texar.custom"])
def _build_encoder(self):
kwargs = {
"hparams": self._hparams.encoder_hparams.todict()
}
self._encoder = utils.check_or_get_instance(
self._hparams.encoder, kwargs,
["texar.modules", "texar.custom"])
def _build_decoder(self):
raise NotImplementedError
def _build_connector(self):
kwargs = {
"output_size": self._decoder.state_size,
"hparams": self._hparams.connector_hparams.todict()
}
self._connector = utils.check_or_get_instance(
self._hparams.connector, kwargs,
["texar.modules", "texar.custom"])
def get_loss(self, decoder_results, features, labels):
"""Computes the training loss.
"""
return sequence_sparse_softmax_cross_entropy(
labels=labels['target_text_ids'][:, 1:],
logits=decoder_results['outputs'].logits,
sequence_length=decoder_results['sequence_length'])
def _get_predictions(self, decoder_results, features, labels, loss=None):
raise NotImplementedError
def _get_train_op(self, loss):
varlist = collect_trainable_variables(
[self._src_embedder, self._tgt_embedder, self._encoder,
self._connector, self._decoder])
return get_train_op(
loss, variables=varlist, hparams=self._hparams.optimization)
def _get_eval_metric_ops(self, decoder_results, features, labels):
return None
def embed_source(self, features, labels, mode):
"""Embeds the inputs.
"""
raise NotImplementedError
def embed_target(self, features, labels, mode):
"""Embeds the target inputs. Used in training.
"""
raise NotImplementedError
def encode(self, features, labels, mode):
"""Encodes the inputs.
"""
raise NotImplementedError
def _connect(self, encoder_results, features, labels, mode):
"""Transforms encoder final state into decoder initial state.
"""
raise NotImplementedError
def decode(self, encoder_results, features, labels, mode):
"""Decodes.
"""
raise NotImplementedError
def _build(self, features, labels, params, mode, config=None):
self._build_vocab()
self._build_embedders()
self._build_encoder()
self._build_decoder()
self._build_connector()
encoder_results = self.encode(features, labels, mode)
decoder_results = self.decode(encoder_results, features, labels, mode)
loss, train_op, preds, eval_metric_ops = None, None, None, None
if mode == tf.estimator.ModeKeys.PREDICT:
preds = self._get_predictions(decoder_results, features, labels)
else:
loss = self.get_loss(decoder_results, features, labels)
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = self._get_train_op(loss)
if mode == tf.estimator.ModeKeys.EVAL:
eval_metric_ops = self._get_eval_metric_ops(
decoder_results, features, labels)
preds = self._get_predictions(decoder_results, features, labels,
loss)
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=preds,
loss=loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops)
def get_input_fn(self, mode, hparams=None): #pylint:disable=arguments-differ
"""Creates an input function `input_fn` that provides input data
for the model in an :tf_main:`Estimator `.
See, e.g., :tf_main:`tf.estimator.train_and_evaluate
`.
Args:
mode: One of members in
:tf_main:`tf.estimator.ModeKeys `.
hparams: A `dict` or an :class:`~texar.HParams` instance
containing the hyperparameters of
:class:`~texar.data.PairedTextData`. See
:meth:`~texar.data.PairedTextData.default_hparams` for the
the structure and default values of the hyperparameters.
Returns:
An input function that returns a tuple `(features, labels)`
when called. `features` contains data fields that are related
to source text, and `labels` contains data fields related
to target text. See :class:`~texar.data.PairedTextData` for
all data fields.
"""
def _input_fn():
data = PairedTextData(hparams)
iterator = data.dataset.make_initializable_iterator()
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS,
iterator.initializer)
batch = iterator.get_next()
features, labels = {}, {}
for key, value in batch.items():
if key.startswith('source_'):
features[key] = value
else:
labels[key] = value
return features, labels
return _input_fn
================================================
FILE: texar_repo/texar/module_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for modules.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import re
import tensorflow as tf
from texar.utils.exceptions import TexarError
from texar.hyperparams import HParams
__all__ = [
"ModuleBase"
]
class ModuleBase(object):
"""Base class inherited by modules that create Variables and are
configurable through hyperparameters.
A Texar module inheriting :class:`~texar.ModuleBase` has following key
features:
- **Convenient variable re-use**: A module instance creates \
its own sets of variables, and automatically re-uses its variables on \
subsequent calls. Hence TF variable/name scope is \
transparent to users. For example:
.. code-block:: python
encoder = UnidirectionalRNNEncoder(hparams) # create instance
output_1 = encoder(inputs_1) # variables are created
output_2 = encoder(inputs_2) # variables are re-used
print(encoder.trainable_variables) # access trainable variables
# [ ... ]
- **Configurable through hyperparameters**: Each module defines \
allowed hyperparameters and default values. Hyperparameters not \
specified by users will take default values.
- **Callable**: As the above example, a module instance is "called" \
with input tensors and returns output tensors. Every call of a module \
will add ops to the Graph to perform the module's logic.
Args:
hparams (dict, optional): Hyperparameters of the module. See
:meth:`default_hparams` for the structure and default values.
.. document private functions
.. automethod:: _build
"""
def __init__(self, hparams=None):
self._hparams = HParams(hparams, self.default_hparams())
self._template = tf.make_template(self._hparams.name, self._build,
create_scope_now_=True)
self._unique_name = self.variable_scope.name.split("/")[-1]
self._trainable_variables = []
self._built = False
@staticmethod
def default_hparams():
"""Returns a `dict` of hyperparameters of the module with default
values. Used to replace the missing values of input `hparams`
during module construction.
.. code-block:: python
{
"name": "module"
}
"""
return {
"name": "module"
}
def _build(self, *args, **kwargs):
"""Subclass must implement this method to build the logic.
Args:
*args: Arguments.
**kwargs: Keyword arguments.
Returns:
Output Tensor(s).
"""
raise NotImplementedError
def __call__(self, *args, **kwargs):
"""Executes the module logic defined in _build method
Args:
*args: Arguments of _build method.
**kwargs: Keyword arguments of _build method.
Returns:
The output of _build method.
"""
return self._template(*args, **kwargs)
def _add_internal_trainable_variables(self): # pylint: disable=invalid-name
"""Collects trainable variables constructured internally in this module.
This is typically called at the end of `_build()` where all necessary
trainable variables have been constructed.
"""
scope_name = self.variable_scope.name
# Escape to handle possible "." characters in the name.
# Append a slash to the end to avoid searching scopes that have this
# scope name as a prefix.
scope_name = re.escape(scope_name) + "/"
internal_trainable_variables = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope_name)
self._add_trainable_variable(internal_trainable_variables)
def _add_trainable_variable(self, variable):
"""Adds a trainable variable to the trainable variable list of the
module.
Args:
variable: a (list of) trainable variable(s) constructed either
internally in the module or constructured outside but used
inside the module.
"""
if isinstance(variable, (list, tuple)):
for var in variable:
self._add_trainable_variable(var)
else:
if variable not in self._trainable_variables:
self._trainable_variables.append(variable)
@property
def variable_scope(self):
"""The variable scope of the module.
"""
return self._template.variable_scope
@property
def name(self):
"""The uniquified name of the module.
"""
return self._unique_name
@property
def trainable_variables(self):
"""The list of trainable variables of the module.
"""
if not self._built:
raise TexarError(
"Attempting to access trainable_variables before module %s "
"was fully built. The module is built once it is called, "
"e.g., with `%s(...)`" % (self.name, self.name))
return self._trainable_variables
@property
def hparams(self):
"""An :class:`~texar.HParams` instance. The hyperparameters
of the module.
"""
return self._hparams
================================================
FILE: texar_repo/texar/modules/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library module.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.networks import *
from texar.modules.embedders import *
from texar.modules.encoders import *
from texar.modules.decoders import *
from texar.modules.connectors import *
from texar.modules.classifiers import *
from texar.modules.policies import *
from texar.modules.qnets import *
from texar.modules.memory import *
================================================
FILE: texar_repo/texar/modules/classifiers/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library classifiers.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.classifiers.conv_classifiers import *
from texar.modules.classifiers.rnn_classifiers import *
================================================
FILE: texar_repo/texar/modules/classifiers/classifier_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.module_base import ModuleBase
__all__ = [
"ClassifierBase"
]
class ClassifierBase(ModuleBase):
"""Base class inherited by all classifier classes.
"""
def __init__(self, hparams=None):
ModuleBase.__init__(self, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
"""
return {
"name": "classifier"
}
def _build(self, inputs, *args, **kwargs):
"""Classifies the inputs.
Args:
inputs: Inputs to the classifier.
*args: Other arguments.
**kwargs: Keyword arguments.
Returns:
Classification results.
"""
raise NotImplementedError
================================================
FILE: texar_repo/texar/modules/classifiers/conv_classifiers.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various classifier classes.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=not-context-manager, too-many-arguments, too-many-locals
import tensorflow as tf
from texar.utils.exceptions import TexarError
from texar.modules.classifiers.classifier_base import ClassifierBase
from texar.modules.encoders.conv_encoders import Conv1DEncoder
from texar.utils import utils
from texar.hyperparams import HParams
__all__ = [
"Conv1DClassifier"
]
class Conv1DClassifier(ClassifierBase):
"""Simple Conv-1D classifier.
This is a combination of the
:class:`~texar.modules.Conv1DEncoder` with a classification layer.
Args:
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
Example:
.. code-block:: python
clas = Conv1DClassifier(hparams={'num_classes': 10})
inputs = tf.random_uniform([64, 20, 256])
logits, pred = clas(inputs)
# logits == Tensor of shape [64, 10]
# pred == Tensor of shape [64]
.. document private functions
.. automethod:: _build
"""
def __init__(self, hparams=None):
ClassifierBase.__init__(self, hparams)
with tf.variable_scope(self.variable_scope):
encoder_hparams = utils.dict_fetch(
hparams, Conv1DEncoder.default_hparams())
self._encoder = Conv1DEncoder(hparams=encoder_hparams)
# Add an additional dense layer if needed
self._num_classes = self._hparams.num_classes
if self._num_classes > 0:
if self._hparams.num_dense_layers <= 0:
self._encoder.append_layer({"type": "Flatten"})
logit_kwargs = self._hparams.logit_layer_kwargs
if logit_kwargs is None:
logit_kwargs = {}
elif not isinstance(logit_kwargs, HParams):
raise ValueError(
"hparams['logit_layer_kwargs'] must be a dict.")
else:
logit_kwargs = logit_kwargs.todict()
logit_kwargs.update({"units": self._num_classes})
if 'name' not in logit_kwargs:
logit_kwargs['name'] = "logit_layer"
self._encoder.append_layer(
{"type": "Dense", "kwargs": logit_kwargs})
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
# (1) Same hyperparameters as in Conv1DEncoder
...
# (2) Additional hyperparameters
"num_classes": 2,
"logit_layer_kwargs": {
"use_bias": False
},
"name": "conv1d_classifier"
}
Here:
1. Same hyperparameters as in :class:`~texar.modules.Conv1DEncoder`.
See the :meth:`~texar.modules.Conv1DEncoder.default_hparams`.
An instance of Conv1DEncoder is created for feature extraction.
2. Additional hyperparameters:
"num_classes" : int
Number of classes:
- If **`> 0`**, an additional :tf_main:`Dense ` \
layer is appended to the encoder to compute the logits over \
classes.
- If **`<= 0`**, no dense layer is appended. The number of \
classes is assumed to be the final dense layer size of the \
encoder.
"logit_layer_kwargs" : dict
Keyword arguments for the logit Dense layer constructor,
except for argument "units" which is set to "num_classes".
Ignored if no extra logit layer is appended.
"name" : str
Name of the classifier.
"""
hparams = Conv1DEncoder.default_hparams()
hparams.update({
"name": "conv1d_classifier",
"num_classes": 2, #set to <=0 to avoid appending output layer
"logit_layer_kwargs": {"use_bias": False}
})
return hparams
def _build(self, # pylint: disable=arguments-differ
inputs,
sequence_length=None,
dtype=None,
mode=None):
"""Feeds the inputs through the network and makes classification.
The arguments are the same as in :class:`~texar.modules.Conv1DEncoder`.
The predictions of binary classification ("num_classes"=1) and
multi-way classification ("num_classes">1) are different, as explained
below.
Args:
inputs: The inputs to the network, which is a 3D tensor. See
:class:`~texar.modules.Conv1DEncoder` for more details.
sequence_length (optional): An int tensor of shape `[batch_size]`
containing the length of each element in :attr:`inputs`.
If given, time steps beyond the length will first be masked out
before feeding to the layers.
dtype (optional): Type of the inputs. If not provided, infers
from inputs automatically.
mode (optional): A tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. If `None`,
:func:`texar.global_mode` is used.
Returns:
A tuple `(logits, pred)`, where
- **`logits`** is a Tensor of shape `[batch_size, num_classes]`\
for `num_classes` >1, and `[batch_size]` for `num_classes` =1 \
(i.e., binary classification).
- **`pred`** is the prediction, a Tensor of shape `[batch_size]` \
and type `tf.int64`. For binary classification, the standard \
sigmoid function is used for prediction, and the class labels are \
`{0, 1}`.
"""
logits = self._encoder(inputs, sequence_length, dtype, mode)
num_classes = self._hparams.num_classes
is_binary = num_classes == 1
is_binary = is_binary or (num_classes <= 0 and logits.shape[1] == 1)
if is_binary:
pred = tf.greater(logits, 0)
logits = tf.reshape(logits, [-1])
else:
pred = tf.argmax(logits, 1)
pred = tf.to_int64(tf.reshape(pred, [-1]))
self._built = True
return logits, pred
@property
def trainable_variables(self):
"""The list of trainable variables of the module.
"""
if not self._built:
raise TexarError(
"Attempting to access trainable_variables before module %s "
"was fully built. The module is built once it is called, "
"e.g., with `%s(...)`" % (self.name, self.name))
return self._encoder.trainable_variables
@property
def num_classes(self):
"""The number of classes.
"""
return self._num_classes
@property
def nn(self): # pylint: disable=invalid-name
"""The classifier neural network.
"""
return self._encoder
def has_layer(self, layer_name):
"""Returns `True` if the network with the name exists. Returns `False`
otherwise.
Args:
layer_name (str): Name of the layer.
"""
return self._encoder.has_layer(layer_name)
def layer_by_name(self, layer_name):
"""Returns the layer with the name. Returns 'None' if the layer name
does not exist.
Args:
layer_name (str): Name of the layer.
"""
return self._encoder.layer_by_name(layer_name)
@property
def layers_by_name(self):
"""A dictionary mapping layer names to the layers.
"""
return self._encoder.layers_by_name
@property
def layers(self):
"""A list of the layers.
"""
return self._encoder.layers
@property
def layer_names(self):
"""A list of uniquified layer names.
"""
return self._encoder.layer_names
def layer_outputs_by_name(self, layer_name):
"""Returns the output tensors of the layer with the specified name.
Returns `None` if the layer name does not exist.
Args:
layer_name (str): Name of the layer.
"""
return self._encoder.layer_outputs_by_name(layer_name)
@property
def layer_outputs(self):
"""A list containing output tensors of each layer.
"""
return self._encoder.layer_outputs
================================================
FILE: texar_repo/texar/modules/classifiers/conv_classifiers_test.py
================================================
#
"""
Unit tests for conv encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
import texar as tx
from texar.modules.classifiers.conv_classifiers import Conv1DClassifier
class Conv1DClassifierTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.Conv1DClassifier` class.
"""
def test_classifier(self):
"""Tests classification.
"""
# case 1: default hparams
classifier = Conv1DClassifier()
self.assertEqual(len(classifier.layers), 5)
self.assertTrue(isinstance(classifier.layers[-1],
tf.layers.Dense))
inputs = tf.ones([64, 16, 300], tf.float32)
logits, pred = classifier(inputs)
self.assertEqual(logits.shape, [64, 2])
self.assertEqual(pred.shape, [64])
inputs = tf.placeholder(tf.float32, [64, None, 300])
logits, pred = classifier(inputs)
self.assertEqual(logits.shape, [64, 2])
self.assertEqual(pred.shape, [64])
# case 1
hparams = {
"num_classes": 10,
"logit_layer_kwargs": {"use_bias": False}
}
classifier = Conv1DClassifier(hparams=hparams)
inputs = tf.ones([64, 16, 300], tf.float32)
logits, pred = classifier(inputs)
self.assertEqual(logits.shape, [64, 10])
self.assertEqual(pred.shape, [64])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/classifiers/rnn_classifiers.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various RNN classifiers.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from tensorflow.contrib.framework import nest
from texar.modules.classifiers.classifier_base import ClassifierBase
from texar.modules.encoders.rnn_encoders import \
UnidirectionalRNNEncoder, _forward_single_output_layer
from texar.core import layers
from texar.utils import utils, shapes
from texar.hyperparams import HParams
# pylint: disable=too-many-arguments, invalid-name, no-member,
# pylint: disable=too-many-branches, too-many-locals, too-many-statements
__all__ = [
"UnidirectionalRNNClassifier"
]
#def RNNClassifierBase(ClassifierBase):
# """Base class inherited by all RNN classifiers.
# """
#
# def __init__(self, hparams=None):
# ClassifierBase.__init__(self, hparams)
class UnidirectionalRNNClassifier(ClassifierBase):
"""One directional RNN classifier.
This is a combination of the
:class:`~texar.modules.UnidirectionalRNNEncoder` with a classification
layer. Both step-wise classification and sequence-level classification
are supported, specified in :attr:`hparams`.
Arguments are the same as in
:class:`~texar.modules.UnidirectionalRNNEncoder`.
Args:
cell: (RNNCell, optional) If not specified,
a cell is created as specified in :attr:`hparams["rnn_cell"]`.
cell_dropout_mode (optional): A Tensor taking value of
:tf_main:`tf.estimator.ModeKeys `, which
toggles dropout in the RNN cell (e.g., activates dropout in
TRAIN mode). If `None`, :func:`~texar.global_mode` is used.
Ignored if :attr:`cell` is given.
output_layer (optional): An instance of
:tf_main:`tf.layers.Layer `. Applies to the RNN cell
output of each step. If `None` (default), the output layer is
created as specified in :attr:`hparams["output_layer"]`.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
.. document private functions
.. automethod:: _build
"""
def __init__(self,
cell=None,
cell_dropout_mode=None,
output_layer=None,
hparams=None):
ClassifierBase.__init__(self, hparams)
with tf.variable_scope(self.variable_scope):
# Creates the underlying encoder
encoder_hparams = utils.dict_fetch(
hparams, UnidirectionalRNNEncoder.default_hparams())
if encoder_hparams is not None:
encoder_hparams['name'] = None
self._encoder = UnidirectionalRNNEncoder(
cell=cell,
cell_dropout_mode=cell_dropout_mode,
output_layer=output_layer,
hparams=encoder_hparams)
# Creates an additional classification layer if needed
self._num_classes = self._hparams.num_classes
if self._num_classes <= 0:
self._logit_layer = None
else:
logit_kwargs = self._hparams.logit_layer_kwargs
if logit_kwargs is None:
logit_kwargs = {}
elif not isinstance(logit_kwargs, HParams):
raise ValueError(
"hparams['logit_layer_kwargs'] must be a dict.")
else:
logit_kwargs = logit_kwargs.todict()
logit_kwargs.update({"units": self._num_classes})
if 'name' not in logit_kwargs:
logit_kwargs['name'] = "logit_layer"
layer_hparams = {"type": "Dense", "kwargs": logit_kwargs}
self._logit_layer = layers.get_layer(hparams=layer_hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
# (1) Same hyperparameters as in UnidirectionalRNNEncoder
...
# (2) Additional hyperparameters
"num_classes": 2,
"logit_layer_kwargs": None,
"clas_strategy": "final_time",
"max_seq_length": None,
"name": "unidirectional_rnn_classifier"
}
Here:
1. Same hyperparameters as in
:class:`~texar.modules.UnidirectionalRNNEncoder`.
See the :meth:`~texar.modules.UnidirectionalRNNEncoder.default_hparams`.
An instance of UnidirectionalRNNEncoder is created for feature
extraction.
2. Additional hyperparameters:
"num_classes" : int
Number of classes:
- If **`> 0`**, an additional :tf_main:`Dense ` \
layer is appended to the encoder to compute the logits over \
classes.
- If **`<= 0`**, no dense layer is appended. The number of \
classes is assumed to be the final dense layer size of the \
encoder.
"logit_layer_kwargs" : dict
Keyword arguments for the logit Dense layer constructor,
except for argument "units" which is set to "num_classes".
Ignored if no extra logit layer is appended.
"clas_strategy" : str
The classification strategy, one of:
- **"final_time"**: Sequence-leve classification based on \
the output of the final time step. One sequence has one class.
- **"all_time"**: Sequence-level classification based on \
the output of all time steps. One sequence has one class.
- **"time_wise"**: Step-wise classfication, i.e., make \
classification for each time step based on its output.
"max_seq_length" : int, optional
Maximum possible length of input sequences. Required if
"clas_strategy" is "all_time".
"name" : str
Name of the classifier.
"""
hparams = UnidirectionalRNNEncoder.default_hparams()
hparams.update({
"num_classes": 2,
"logit_layer_kwargs": None,
"clas_strategy": "final_time",
"max_seq_length": None,
"name": "unidirectional_rnn_classifier"
})
return hparams
def _build(self,
inputs,
sequence_length=None,
initial_state=None,
time_major=False,
mode=None,
**kwargs):
"""Feeds the inputs through the network and makes classification.
The arguments are the same as in
:class:`~texar.modules.UnidirectionalRNNEncoder`.
Args:
inputs: A 3D Tensor of shape `[batch_size, max_time, dim]`.
The first two dimensions
`batch_size` and `max_time` may be exchanged if
`time_major=True` is specified.
sequence_length (optional): A 1D int tensor of shape `[batch_size]`.
Sequence lengths
of the batch inputs. Used to copy-through state and zero-out
outputs when past a batch element's sequence length.
initial_state (optional): Initial state of the RNN.
time_major (bool): The shape format of the :attr:`inputs` and
:attr:`outputs` Tensors. If `True`, these tensors are of shape
`[max_time, batch_size, depth]`. If `False` (default),
these tensors are of shape `[batch_size, max_time, depth]`.
mode (optional): A tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. Controls output layer dropout
if the output layer is specified with :attr:`hparams`.
If `None` (default), :func:`texar.global_mode()`
is used.
return_cell_output (bool): Whether to return the output of the RNN
cell. This is the results prior to the output layer.
**kwargs: Optional keyword arguments of
:tf_main:`tf.nn.dynamic_rnn `,
such as `swap_memory`, `dtype`, `parallel_iterations`, etc.
Returns:
A tuple `(logits, pred)`, containing the logits over classes and
the predictions, respectively.
- If "clas_strategy"=="final_time" or "all_time"
- If "num_classes"==1, `logits` and `pred` are of both \
shape `[batch_size]`
- If "num_classes">1, `logits` is of shape \
`[batch_size, num_classes]` and `pred` is of shape \
`[batch_size]`.
- If "clas_strategy"=="time_wise",
- If "num_classes"==1, `logits` and `pred` are of both \
shape `[batch_size, max_time]`
- If "num_classes">1, `logits` is of shape \
`[batch_size, max_time, num_classes]` and `pred` is of shape \
`[batch_size, max_time]`.
- If `time_major` is `True`, the batch and time dimensions are\
exchanged.
"""
enc_outputs, _, enc_output_size = self._encoder(
inputs=inputs,
sequence_length=sequence_length,
initial_state=initial_state,
time_major=time_major,
mode=mode,
return_output_size=True,
**kwargs)
# Flatten enc_outputs
enc_outputs_flat = nest.flatten(enc_outputs)
enc_output_size_flat = nest.flatten(enc_output_size)
enc_output_dims_flat = [np.prod(xs) for xs in enc_output_size_flat]
enc_outputs_flat = [shapes.flatten(x, 2, xs) for x, xs
in zip(enc_outputs_flat, enc_output_dims_flat)]
if len(enc_outputs_flat) == 1:
enc_outputs_flat = enc_outputs_flat[0]
else:
enc_outputs_flat = tf.concat(enc_outputs_flat, axis=2)
# Compute logits
stra = self._hparams.clas_strategy
if stra == 'time_wise':
logits = enc_outputs_flat
elif stra == 'final_time':
if time_major:
logits = enc_outputs_flat[-1, :, :]
else:
logits = enc_outputs_flat[:, -1, :]
elif stra == 'all_time':
if self._logit_layer is None:
raise ValueError(
'logit layer must not be `None` if '
'clas_strategy="all_time". Specify the logit layer by '
'either passing the layer in the constructor or '
'specifying the hparams.')
if self._hparams.max_seq_length is None:
raise ValueError(
'hparams.max_seq_length must not be `None` if '
'clas_strategy="all_time"')
else:
raise ValueError('Unknown classification strategy: {}'.format(stra))
if self._logit_layer is not None:
logit_input_dim = np.sum(enc_output_dims_flat)
if stra == 'time_wise':
logits, _ = _forward_single_output_layer(
logits, logit_input_dim, self._logit_layer)
elif stra == 'final_time':
logits = self._logit_layer(logits)
elif stra == 'all_time':
# Pad `enc_outputs_flat` to have max_seq_length before flatten
length_diff = self._hparams.max_seq_length - tf.shape(inputs)[1]
length_diff = tf.reshape(length_diff, [1, 1])
# Set `paddings = [[0, 0], [0, length_dif], [0, 0]]`
paddings = tf.pad(length_diff, paddings=[[1, 1], [1, 0]])
logit_input = tf.pad(enc_outputs_flat, paddings=paddings)
logit_input_dim *= self._hparams.max_seq_length
logit_input = tf.reshape(logit_input, [-1, logit_input_dim])
logits = self._logit_layer(logit_input)
# Compute predications
num_classes = self._hparams.num_classes
is_binary = num_classes == 1
is_binary = is_binary or (num_classes <= 0 and logits.shape[-1] == 1)
if stra == 'time_wise':
if is_binary:
pred = tf.squeeze(tf.greater(logits, 0), -1)
logits = tf.squeeze(logits, -1)
else:
pred = tf.argmax(logits, axis=-1)
else:
if is_binary:
pred = tf.greater(logits, 0)
logits = tf.reshape(logits, [-1])
else:
pred = tf.argmax(logits, axis=-1)
pred = tf.reshape(pred, [-1])
pred = tf.to_int64(pred)
if not self._built:
self._add_internal_trainable_variables()
# Add trainable variables of `self._logit_layer`
# which may be constructed externally.
if self._logit_layer:
self._add_trainable_variable(
self._logit_layer.trainable_variables)
self._built = True
return logits, pred
@property
def num_classes(self):
"""The number of classes, specified in :attr:`hparams`.
"""
return self._hparams.num_classes
================================================
FILE: texar_repo/texar/modules/classifiers/rnn_classifiers_test.py
================================================
#
"""
Unit tests for RNN classifiers.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from texar.modules.classifiers.rnn_classifiers import \
UnidirectionalRNNClassifier
# pylint: disable=too-many-locals, no-member
class UnidirectionalRNNClassifierTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.UnidirectionalRNNClassifierTest` class.
"""
def test_trainable_variables(self):
"""Tests the functionality of automatically collecting trainable
variables.
"""
inputs = tf.placeholder(dtype=tf.float32, shape=[None, None, 100])
# case 1
clas = UnidirectionalRNNClassifier()
_, _ = clas(inputs)
self.assertEqual(len(clas.trainable_variables), 2+2)
# case 2
hparams = {
"output_layer": {"num_layers": 2},
"logit_layer_kwargs": {"use_bias": False}
}
clas = UnidirectionalRNNClassifier(hparams=hparams)
_, _ = clas(inputs)
self.assertEqual(len(clas.trainable_variables), 2+2+2+1)
_, _ = clas(inputs)
self.assertEqual(len(clas.trainable_variables), 2+2+2+1)
def test_encode(self):
"""Tests encoding.
"""
max_time = 8
batch_size = 16
emb_dim = 100
inputs = tf.random_uniform([batch_size, max_time, emb_dim],
maxval=1., dtype=tf.float32)
# case 1
clas = UnidirectionalRNNClassifier()
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run([logits, pred])
self.assertEqual(logits_.shape, (batch_size, clas.num_classes))
self.assertEqual(pred_.shape, (batch_size, ))
# case 2
hparams = {
"num_classes": 10,
"clas_strategy": "time_wise"
}
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run([logits, pred])
self.assertEqual(logits_.shape,
(batch_size, max_time, clas.num_classes))
self.assertEqual(pred_.shape, (batch_size, max_time))
# case 3
hparams = {
"output_layer": {
"num_layers": 1,
"layer_size": 10
},
"num_classes": 0,
"clas_strategy": "time_wise"
}
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run([logits, pred])
self.assertEqual(logits_.shape,
(batch_size, max_time, 10))
self.assertEqual(pred_.shape, (batch_size, max_time))
# case 4
hparams = {
"num_classes": 10,
"clas_strategy": "all_time",
"max_seq_length": max_time
}
inputs = tf.placeholder(tf.float32, shape=[batch_size, 6, emb_dim])
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run(
[logits, pred],
feed_dict={inputs: np.random.randn(batch_size, 6, emb_dim)})
self.assertEqual(logits_.shape, (batch_size, clas.num_classes))
self.assertEqual(pred_.shape, (batch_size, ))
def test_binary(self):
"""Tests binary classification.
"""
max_time = 8
batch_size = 16
emb_dim = 100
inputs = tf.random_uniform([batch_size, max_time, emb_dim],
maxval=1., dtype=tf.float32)
# case 1 omittd
# case 2
hparams = {
"num_classes": 1,
"clas_strategy": "time_wise"
}
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run([logits, pred])
self.assertEqual(logits_.shape, (batch_size, max_time))
self.assertEqual(pred_.shape, (batch_size, max_time))
# case 3
hparams = {
"output_layer": {
"num_layers": 1,
"layer_size": 10
},
"num_classes": 1,
"clas_strategy": "time_wise"
}
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run([logits, pred])
self.assertEqual(logits_.shape, (batch_size, max_time))
self.assertEqual(pred_.shape, (batch_size, max_time))
# case 4
hparams = {
"num_classes": 1,
"clas_strategy": "all_time",
"max_seq_length": max_time
}
inputs = tf.placeholder(tf.float32, shape=[batch_size, 6, emb_dim])
clas = UnidirectionalRNNClassifier(hparams=hparams)
logits, pred = clas(inputs)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
logits_, pred_ = sess.run(
[logits, pred],
feed_dict={inputs: np.random.randn(batch_size, 6, emb_dim)})
self.assertEqual(logits_.shape, (batch_size, ))
self.assertEqual(pred_.shape, (batch_size, ))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/connectors/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library connectors.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.connectors.connector_base import *
from texar.modules.connectors.connectors import *
================================================
FILE: texar_repo/texar/modules/connectors/connector_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for connectors that transform inputs into specified output shape.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.module_base import ModuleBase
__all__ = [
"ConnectorBase"
]
class ConnectorBase(ModuleBase):
"""Base class inherited by all connector classes. A connector is to
transform inputs into outputs with any specified structure and shape.
For example, tranforming the final state of an encoder to the initial
state of a decoder, and performing stochastic sampling in between as
in Variational Autoencoders (VAEs).
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self, output_size, hparams=None):
ModuleBase.__init__(self, hparams)
self._output_size = output_size
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
"""
return {
"name": "connector"
}
def _build(self, *args, **kwargs):
"""Transforms inputs to outputs with specified shape.
"""
raise NotImplementedError
@property
def output_size(self):
"""The output size.
"""
return self._output_size
================================================
FILE: texar_repo/texar/modules/connectors/connectors.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various connectors.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
from tensorflow import distributions as tf_dstr
from tensorflow.python.util import nest # pylint: disable=E0611
from texar.modules.connectors.connector_base import ConnectorBase
from texar.core import layers
from texar.utils.utils import get_function, check_or_get_instance
# pylint: disable=too-many-locals, arguments-differ
# pylint: disable=too-many-arguments, invalid-name, no-member
__all__ = [
"ConstantConnector",
"ForwardConnector",
"MLPTransformConnector",
"ReparameterizedStochasticConnector",
"StochasticConnector",
#"ConcatConnector"
]
def _assert_same_size(outputs, output_size):
"""Check if outputs match output_size
Args:
outputs: A Tensor or a (nested) tuple of tensors
output_size: Can be an Integer, a TensorShape, or a (nested) tuple of
Integers or TensorShape.
"""
nest.assert_same_structure(outputs, output_size)
flat_output_size = nest.flatten(output_size)
flat_output = nest.flatten(outputs)
for (output, size) in zip(flat_output, flat_output_size):
if output[0].shape != tf.TensorShape(size):
raise ValueError(
"The output size does not match the the required output_size")
def _get_tensor_depth(x):
"""Returns the size of a tensor excluding the first dimension
(typically the batch dimension).
Args:
x: A tensor.
"""
return np.prod(x.get_shape().as_list()[1:])
def _mlp_transform(inputs, output_size, activation_fn=tf.identity):
"""Transforms inputs through a fully-connected layer that creates the output
with specified size.
Args:
inputs: A Tensor of shape `[batch_size, ...]` (i.e., batch-major), or a
(nested) tuple of such elements. A Tensor or a (nested) tuple of
Tensors with shape `[max_time, batch_size, ...]` (i.e., time-major)
can be transposed to batch-major using
:func:`~texar.utils.transpose_batch_time` prior to this
function.
output_size: Can be an Integer, a TensorShape, or a (nested) tuple of
Integers or TensorShape.
activation_fn: Activation function applied to the output.
Returns:
If :attr:`output_size` is an Integer or a TensorShape, returns a Tensor
of shape `[batch_size x output_size]`. If :attr:`output_size` is a tuple
of Integers or TensorShape, returns a tuple having the same structure as
:attr:`output_size`, where each element Tensor has the same size as
defined in :attr:`output_size`.
"""
# Flatten inputs
flat_input = nest.flatten(inputs)
dims = [_get_tensor_depth(x) for x in flat_input]
flat_input = [tf.reshape(x, ([-1, d])) for x, d in zip(flat_input, dims)]
concat_input = tf.concat(flat_input, 1)
# Get output dimension
flat_output_size = nest.flatten(output_size)
if isinstance(flat_output_size[0], tf.TensorShape):
size_list = [0] * len(flat_output_size)
for (i, shape) in enumerate(flat_output_size):
size_list[i] = np.prod([dim.value for dim in shape])
else:
size_list = flat_output_size
sum_output_size = sum(size_list)
#fc_output = tf.contrib.layers.fully_connected(
# concat_input, sum_output_size, activation_fn=activation_fn)
fc_output = tf.layers.dense(
concat_input, sum_output_size, activation=activation_fn)
flat_output = tf.split(fc_output, size_list, axis=1)
if isinstance(flat_output_size[0], tf.TensorShape):
for (i, shape) in enumerate(flat_output_size):
flat_output[i] = tf.reshape(flat_output[i], [-1] + shape.as_list())
output = nest.pack_sequence_as(structure=output_size,
flat_sequence=flat_output)
return output
class ConstantConnector(ConnectorBase):
"""Creates a constant Tensor or (nested) tuple of Tensors that
contains a constant value.
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
This connector does not have trainable parameters.
See :meth:`_build` for the inputs and outputs of the connector.
Example:
.. code-block:: python
connector = Connector(cell.state_size)
zero_state = connector(batch_size=64, value=0.)
one_state = connector(batch_size=64, value=1.)
.. document private functions
.. automethod:: _build
"""
def __init__(self, output_size, hparams=None):
ConnectorBase.__init__(self, output_size, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"value": 0.,
"name": "constant_connector"
}
Here:
"value" : float
The constant scalar that the output tensor(s) has. Ignored if
`value` is given to :meth:`_build`.
"name" : str
Name of the connector.
"""
return {
"value": 0.,
"name": "constant_connector"
}
def _build(self, batch_size, value=None):
"""Creates output tensor(s) that has the given value.
Args:
batch_size: An `int` or `int` scalar Tensor, the batch size.
value (optional): A scalar, the value that
the output tensor(s) has. If `None`, "value" in :attr:`hparams`
is used.
Returns:
A (structure of) tensor whose structure is the same as
:attr:`output_size`, with value speicified by
`value` or :attr:`hparams`.
"""
value_ = value
if value_ is None:
value_ = self.hparams.value
output = nest.map_structure(
lambda x: tf.constant(value_, shape=[batch_size, x]),
self._output_size)
self._built = True
return output
class ForwardConnector(ConnectorBase):
"""Transforms inputs to have specified structure.
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
This connector does not have trainable parameters.
See :meth:`_build` for the inputs and outputs of the connector.
The input to the connector must have the same structure with
:attr:`output_size`, or must have the same number of elements and be
re-packable into the structure of :attr:`output_size`. Note that if input
is or contains a `dict` instance, the keys will be sorted to pack in
deterministic order (See
:tf_main:`pack_sequence_as `
for more details).
Example:
.. code-block:: python
cell = LSTMCell(num_units=256)
# cell.state_size == LSTMStateTuple(c=256, h=256)
connector = ForwardConnector(cell.state_size)
output = connector([tensor_1, tensor_2])
# output == LSTMStateTuple(c=tensor_1, h=tensor_2)
.. document private functions
.. automethod:: _build
"""
def __init__(self, output_size, hparams=None):
ConnectorBase.__init__(self, output_size, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"name": "forward_connector"
}
Here:
"name" : str
Name of the connector.
"""
return {
"name": "forward_connector"
}
def _build(self, inputs):
"""Transforms inputs to have the same structure as with
:attr:`output_size`. Values of the inputs are not changed.
:attr:`inputs` must either have the same structure, or have the same
number of elements with :attr:`output_size`.
Args:
inputs: The input (structure of) tensor to pass forward.
Returns:
A (structure of) tensors that re-packs `inputs` to have
the specified structure of `output_size`.
"""
output = inputs
try:
nest.assert_same_structure(inputs, self._output_size)
except (ValueError, TypeError):
flat_input = nest.flatten(inputs)
output = nest.pack_sequence_as(
self._output_size, flat_input)
self._built = True
return output
class MLPTransformConnector(ConnectorBase):
"""Transforms inputs with an MLP layer and packs the results into the
specified structure and size.
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
See :meth:`_build` for the inputs and outputs of the connector.
The input to the connector can have arbitrary structure and size.
Example:
.. code-block:: python
cell = LSTMCell(num_units=256)
# cell.state_size == LSTMStateTuple(c=256, h=256)
connector = MLPTransformConnector(cell.state_size)
inputs = tf.zeros([64, 10])
output = connector(inputs)
# output == LSTMStateTuple(c=tensor_of_shape_(64, 256),
# h=tensor_of_shape_(64, 256))
.. code-block:: python
## Use to connect encoder and decoder with different state size
encoder = UnidirectionalRNNEncoder(...)
_, final_state = encoder(inputs=...)
decoder = BasicRNNDecoder(...)
connector = MLPTransformConnector(decoder.state_size)
_ = decoder(
initial_state=connector(final_state),
...)
.. document private functions
.. automethod:: _build
"""
def __init__(self, output_size, hparams=None):
ConnectorBase.__init__(self, output_size, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"activation_fn": "identity",
"name": "mlp_connector"
}
Here:
"activation_fn" : str or callable
The activation function applied to the outputs of the MLP
transformation layer. Can
be a function, or its name or module path.
"name" : str
Name of the connector.
"""
return {
"activation_fn": "identity",
"name": "mlp_connector"
}
def _build(self, inputs):
"""Transforms inputs with an MLP layer and packs the results to have
the same structure as specified by :attr:`output_size`.
Args:
inputs: Input (structure of) tensors to be transformed. Must be a
Tensor of shape `[batch_size, ...]` or a (nested) tuple of
such Tensors. That is, the first dimension of (each) tensor
must be the batch dimension.
Returns:
A Tensor or a (nested) tuple of Tensors of the same structure of
`output_size`.
"""
activation_fn = layers.get_activation_fn(self.hparams.activation_fn)
output = _mlp_transform(inputs, self._output_size, activation_fn)
if not self._built:
self._add_internal_trainable_variables()
self._built = True
return output
class ReparameterizedStochasticConnector(ConnectorBase):
"""Samples from a distribution with reparameterization trick, and
transforms samples into specified size.
Reparameterization allows gradients to be back-propagated through the
stochastic samples. Used in, e.g., Variational Autoencoders (VAEs).
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
Example:
.. code-block:: python
cell = LSTMCell(num_units=256)
# cell.state_size == LSTMStateTuple(c=256, h=256)
connector = ReparameterizedStochasticConnector(cell.state_size)
kwargs = {
'loc': tf.zeros([batch_size, 10]),
'scale_diag': tf.ones([batch_size, 10])
}
output, sample = connector(distribution_kwargs=kwargs)
# output == LSTMStateTuple(c=tensor_of_shape_(batch_size, 256),
# h=tensor_of_shape_(batch_size, 256))
# sample == Tensor([batch_size, 10])
kwargs = {
'loc': tf.zeros([10]),
'scale_diag': tf.ones([10])
}
output_, sample_ = connector(distribution_kwargs=kwargs,
num_samples=batch_size_)
# output_ == LSTMStateTuple(c=tensor_of_shape_(batch_size_, 256),
# h=tensor_of_shape_(batch_size_, 256))
# sample == Tensor([batch_size_, 10])
.. document private functions
.. automethod:: _build
"""
def __init__(self, output_size, hparams=None):
ConnectorBase.__init__(self, output_size, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"activation_fn": "identity",
"name": "reparameterized_stochastic_connector"
}
Here:
"activation_fn" : str
The activation function applied to the outputs of the MLP
transformation layer. Can
be a function, or its name or module path.
"name" : str
Name of the connector.
"""
return {
"activation_fn": "tensorflow.identity",
"name": "reparameterized_stochastic_connector"
}
def _build(self,
distribution='MultivariateNormalDiag',
distribution_kwargs=None,
transform=True,
num_samples=None):
"""Samples from a distribution and optionally performs transformation
with an MLP layer.
The distribution must be reparameterizable, i.e.,
`distribution.reparameterization_type = FULLY_REPARAMETERIZED`.
Args:
distribution: A instance of subclass of
:tf_main:`TF Distribution `,
or :tf_hmpg:`tensorflow_probability Distribution `,
Can be a class, its name or module path, or a class instance.
distribution_kwargs (dict, optional): Keyword arguments for the
distribution constructor. Ignored if `distribution` is a
class instance.
transform (bool): Whether to perform MLP transformation of the
distribution samples. If `False`, the structure/shape of a
sample must match :attr:`output_size`.
num_samples (optional): An `int` or `int` Tensor. Number of samples
to generate. If not given, generate a single sample. Note
that if batch size has already been included in
`distribution`'s dimensionality, `num_samples` should be
left as `None`.
Returns:
A tuple (output, sample), where
- output: A Tensor or a (nested) tuple of Tensors with the same \
structure and size of :attr:`output_size`. The batch dimension \
equals :attr:`num_samples` if specified, or is determined by the \
distribution dimensionality.
- sample: The sample from the distribution, prior to transformation.
Raises:
ValueError: If distribution cannot be reparametrized.
ValueError: The output does not match :attr:`output_size`.
"""
dstr = check_or_get_instance(
distribution, distribution_kwargs,
["tensorflow.distributions", "tensorflow_probability.distributions",
"texar.custom"])
if dstr.reparameterization_type == tf_dstr.NOT_REPARAMETERIZED:
raise ValueError(
"Distribution is not reparameterized: %s" % dstr.name)
if num_samples:
sample = dstr.sample(num_samples)
else:
sample = dstr.sample()
#if dstr.event_shape == []:
# sample = tf.reshape(
# sample,
# sample.shape.concatenate(tf.TensorShape(1)))
# sample = tf.cast(sample, tf.float32)
if transform:
fn_modules = ['tensorflow', 'tensorflow.nn', 'texar.custom']
activation_fn = get_function(self.hparams.activation_fn, fn_modules)
output = _mlp_transform(sample, self._output_size, activation_fn)
_assert_same_size(output, self._output_size)
if not self._built:
self._add_internal_trainable_variables()
self._built = True
return output, sample
class StochasticConnector(ConnectorBase):
"""Samples from a distribution and transforms samples into specified size.
The connector is the same as
:class:`~texar.modules.ReparameterizedStochasticConnector`, except that
here reparameterization is disabled, and thus the gradients cannot be
back-propagated through the stochastic samples.
Args:
output_size: Size of output **excluding** the batch dimension. For
example, set `output_size` to `dim` to generate output of
shape `[batch_size, dim]`.
Can be an `int`, a tuple of `int`, a Tensorshape, or a tuple of
TensorShapes.
For example, to transform inputs to have decoder state size, set
`output_size=decoder.state_size`.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
.. document private functions
.. automethod:: _build
"""
def __init__(self, output_size, hparams=None):
ConnectorBase.__init__(self, output_size, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"activation_fn": "identity",
"name": "stochastic_connector"
}
Here:
"activation_fn" : str
The activation function applied to the outputs of the MLP
transformation layer. Can
be a function, or its name or module path.
"name" : str
Name of the connector.
"""
return {
"activation_fn": "tensorflow.identity",
"name": "stochastic_connector"
}
def _build(self,
distribution='MultivariateNormalDiag',
distribution_kwargs=None,
transform=False,
num_samples=None):
"""Samples from a distribution and optionally performs transformation
with an MLP layer.
The inputs and outputs are the same as
:class:`~texar.modules.ReparameterizedStochasticConnector` except that
the distribution does not need to be reparameterizable, and gradient
cannot be back-propagate through the samples.
Args:
distribution: A instance of subclass of
:tf_main:`TF Distribution `,
or :tf_hmpg:`tensorflow_probability Distribution `.
Can be a class, its name or module path, or a class instance.
distribution_kwargs (dict, optional): Keyword arguments for the
distribution constructor. Ignored if `distribution` is a
class instance.
transform (bool): Whether to perform MLP transformation of the
distribution samples. If `False`, the structure/shape of a
sample must match :attr:`output_size`.
num_samples (optional): An `int` or `int` Tensor. Number of samples
to generate. If not given, generate a single sample. Note
that if batch size has already been included in
`distribution`'s dimensionality, `num_samples` should be
left as `None`.
Returns:
A tuple (output, sample), where
- output: A Tensor or a (nested) tuple of Tensors with the same \
structure and size of :attr:`output_size`. The batch dimension \
equals :attr:`num_samples` if specified, or is determined by the \
distribution dimensionality.
- sample: The sample from the distribution, prior to transformation.
Raises:
ValueError: The output does not match :attr:`output_size`.
"""
dstr = check_or_get_instance(
distribution, distribution_kwargs,
["tensorflow.distributions", "tensorflow_probability.distributions",
"tensorflow.contrib.distributions", "texar.custom"])
if num_samples:
output = dstr.sample(num_samples)
else:
output = dstr.sample()
if dstr.event_shape == []:
output = tf.reshape(output,
output.shape.concatenate(tf.TensorShape(1)))
# Disable gradients through samples
output = tf.stop_gradient(output)
output = tf.cast(output, tf.float32)
if transform:
fn_modules = ['tensorflow', 'tensorflow.nn', 'texar.custom']
activation_fn = get_function(self.hparams.activation_fn, fn_modules)
output = _mlp_transform(output, self._output_size, activation_fn)
_assert_same_size(output, self._output_size)
if not self._built:
self._add_internal_trainable_variables()
self._built = True
return output
#class ConcatConnector(ConnectorBase):
# """Concatenates multiple connectors into one connector. Used in, e.g.,
# semi-supervised variational autoencoders, disentangled representation
# learning, and other models.
#
# Args:
# output_size: Size of output excluding the batch dimension (eg.
# :attr:`output_size = p` if :attr:`output.shape` is :attr:`[N, p]`).
# Can be an int, a tuple of int, a Tensorshape, or a tuple of
# TensorShapes.
# For example, to transform to decoder state size, set
# `output_size=decoder.cell.state_size`.
# hparams (dict): Hyperparameters of the connector.
# """
#
# def __init__(self, output_size, hparams=None):
# ConnectorBase.__init__(self, output_size, hparams)
#
# @staticmethod
# def default_hparams():
# """Returns a dictionary of hyperparameters with default values.
#
# Returns:
# .. code-block:: python
#
# {
# "activation_fn": "tensorflow.identity",
# "name": "concat_connector"
# }
#
# Here:
#
# "activation_fn" : (str or callable)
# The name or full path to the activation function applied to
# the outputs of the MLP layer. The activation functions can be:
#
# - Built-in activation functions defined in :mod:`tf` or \
# :mod:`tf.nn`, e.g., :tf_main:`identity `.
# - User-defined activation functions in `texar.custom`.
# - External activation functions. Must provide the full path, \
# e.g., "my_module.my_activation_fn".
#
# The default value is :attr:`"identity"`, i.e., the MLP
# transformation is linear.
#
# "name" : str
# Name of the connector.
#
# The default value is "concat_connector".
# """
# return {
# "activation_fn": "tensorflow.identity",
# "name": "concat_connector"
# }
#
# def _build(self, connector_inputs, transform=True):
# """Concatenate multiple input connectors
#
# Args:
# connector_inputs: a list of connector states
# transform (bool): If `True`, then the output are automatically
# transformed to match :attr:`output_size`.
#
# Returns:
# A Tensor or a (nested) tuple of Tensors of the same structure of
# the decoder state.
# """
# connector_inputs = [tf.cast(connector, tf.float32)
# for connector in connector_inputs]
# output = tf.concat(connector_inputs, axis=1)
#
# if transform:
# fn_modules = ['texar.custom', 'tensorflow', 'tensorflow.nn']
# activation_fn = get_function(self.hparams.activation_fn,
# fn_modules)
# output = _mlp_transform(output, self._output_size, activation_fn)
# _assert_same_size(output, self._output_size)
#
# self._add_internal_trainable_variables()
# self._built = True
#
# return output
================================================
FILE: texar_repo/texar/modules/connectors/connectors_test.py
================================================
#
"""
Unit tests for connectors.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from tensorflow_probability import distributions as tfpd
from tensorflow.python.util import nest # pylint: disable=E0611
from texar.core import layers
from texar.modules import ConstantConnector
from texar.modules import MLPTransformConnector
from texar.modules import ReparameterizedStochasticConnector
from texar.modules.connectors.connectors import _assert_same_size
# pylint: disable=too-many-locals, invalid-name
class TestConnectors(tf.test.TestCase):
"""Tests various connectors.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._batch_size = 100
self._decoder_cell = layers.get_rnn_cell(
layers.default_rnn_cell_hparams())
def test_constant_connector(self):
"""Tests the logic of
:class:`~texar.modules.connectors.ConstantConnector`.
"""
connector = ConstantConnector(self._decoder_cell.state_size)
decoder_initial_state_0 = connector(self._batch_size)
decoder_initial_state_1 = connector(self._batch_size, value=1.)
nest.assert_same_structure(decoder_initial_state_0,
self._decoder_cell.state_size)
nest.assert_same_structure(decoder_initial_state_1,
self._decoder_cell.state_size)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
s_0, s_1 = sess.run(
[decoder_initial_state_0, decoder_initial_state_1])
self.assertEqual(nest.flatten(s_0)[0][0, 0], 0.)
self.assertEqual(nest.flatten(s_1)[0][0, 0], 1.)
def test_forward_connector(self):
"""Tests the logic of
:class:`~texar.modules.connectors.ForwardConnector`.
"""
# TODO(zhiting)
pass
def test_mlp_transform_connector(self):
"""Tests the logic of
:class:`~texar.modules.connectors.MLPTransformConnector`.
"""
connector = MLPTransformConnector(self._decoder_cell.state_size)
output = connector(tf.zeros([5, 10]))
nest.assert_same_structure(output, self._decoder_cell.state_size)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
output_ = sess.run(output)
nest.assert_same_structure(output_, self._decoder_cell.state_size)
def test_reparameterized_stochastic_connector(self):
"""Tests the logic of
:class:`~texar.modules.ReparameterizedStochasticConnector`.
"""
state_size = (10, 10)
variable_size = 100
state_size_ts = (tf.TensorShape([10, 10]), tf.TensorShape([2, 3, 4]))
sample_num = 10
mu = tf.zeros([self._batch_size, variable_size])
var = tf.ones([self._batch_size, variable_size])
mu_vec = tf.zeros([variable_size])
var_vec = tf.ones([variable_size])
gauss_ds = tfpd.MultivariateNormalDiag(loc=mu, scale_diag=var)
gauss_ds_vec = tfpd.MultivariateNormalDiag(loc=mu_vec,
scale_diag=var_vec)
gauss_connector = ReparameterizedStochasticConnector(state_size)
gauss_connector_ts = ReparameterizedStochasticConnector(state_size_ts)
output_1, _ = gauss_connector(gauss_ds)
output_2, _ = gauss_connector(
distribution="MultivariateNormalDiag",
distribution_kwargs={"loc": mu, "scale_diag": var})
sample_ts, _ = gauss_connector_ts(gauss_ds)
# specify sample num
sample_test_num, _ = gauss_connector(
gauss_ds_vec, num_samples=sample_num)
# test when :attr:`transform` is False
#sample_test_no_transform = gauss_connector(gauss_ds, transform=False)
test_list = [output_1, output_2, sample_ts, sample_test_num]
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
out_list = sess.run(test_list)
out1 = out_list[0]
out2 = out_list[1]
out_ts = out_list[2]
out_test_num = out_list[3]
# check the same size
self.assertEqual(out_test_num[0].shape,
tf.TensorShape([sample_num, state_size[0]]))
self.assertEqual(out1[0].shape,
tf.TensorShape([self._batch_size, state_size[0]]))
self.assertEqual(out2[0].shape,
tf.TensorShape([self._batch_size, state_size[0]]))
_assert_same_size(out_ts, state_size_ts)
# sample_mu = np.mean(sample_outputs, axis=0)
# # pylint: disable=no-member
# sample_var = np.var(sample_outputs, axis=0)
## check if the value is approximated N(0, 1)
# for i in range(variable_size):
# self.assertAlmostEqual(0, sample_mu[i], delta=0.2)
# self.assertAlmostEqual(1, sample_var[i], delta=0.2)
#def test_concat_connector(self): # pylint: disable=too-many-locals
# """Tests the logic of
# :class:`~texar.modules.connectors.ConcatConnector`.
# """
# gauss_size = 5
# constant_size = 7
# variable_size = 13
# decoder_size1 = 16
# decoder_size2 = (16, 32)
# gauss_connector = StochasticConnector(gauss_size)
# categorical_connector = StochasticConnector(1)
# constant_connector = ConstantConnector(constant_size)
# concat_connector1 = ConcatConnector(decoder_size1)
# concat_connector2 = ConcatConnector(decoder_size2)
# # pylint: disable=invalid-name
# mu = tf.zeros([self._batch_size, gauss_size])
# var = tf.ones([self._batch_size, gauss_size])
# categorical_prob = tf.constant(
# [[0.1, 0.2, 0.7] for _ in xrange(self._batch_size)])
# categorical_ds = tfds.Categorical(probs = categorical_prob)
# gauss_ds = tfds.MultivariateNormalDiag(loc = mu, scale_diag = var)
# gauss_state = gauss_connector(gauss_ds)
# categorical_state = categorical_connector(categorical_ds)
# constant_state = constant_connector(self._batch_size, value=1.)
# with tf.Session() as debug_sess:
# debug_cater = debug_sess.run(categorical_state)
# state1 = concat_connector1(
# [gauss_state, categorical_state, constant_state])
# state2 = concat_connector2(
# [gauss_state, categorical_state, constant_state])
# with self.test_session() as sess:
# sess.run(tf.global_variables_initializer())
# [output1, output2] = sess.run([state1, state2])
# # check the same size
# self.assertEqual(output1.shape[1], decoder_size1)
# self.assertEqual(output2[1].shape[1], decoder_size2[1])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/decoders/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.decoders.rnn_decoder_base import *
from texar.modules.decoders.rnn_decoders import *
from texar.modules.decoders.rnn_decoder_helpers import *
from texar.modules.decoders.transformer_decoders import *
from texar.modules.decoders.beam_search_decode import *
================================================
FILE: texar_repo/texar/modules/decoders/beam_search_decode.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Beam search decoding for RNN decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib.seq2seq import \
dynamic_decode, AttentionWrapperState, AttentionWrapper, \
BeamSearchDecoder, tile_batch
from texar.modules.decoders.rnn_decoder_base import RNNDecoderBase
# pylint: disable=too-many-arguments, protected-access, too-many-locals
# pylint: disable=invalid-name
__all__ = [
"beam_search_decode"
]
def _get_initial_state(initial_state,
tiled_initial_state,
cell,
batch_size,
beam_width,
dtype):
if tiled_initial_state is None:
if isinstance(initial_state, AttentionWrapperState):
raise ValueError(
'`initial_state` must not be an AttentionWrapperState. Use '
'a plain cell state instead, which will be wrapped into an '
'AttentionWrapperState automatically.')
if initial_state is None:
tiled_initial_state = cell.zero_state(batch_size * beam_width,
dtype)
else:
tiled_initial_state = tile_batch(initial_state,
multiplier=beam_width)
if isinstance(cell, AttentionWrapper) and \
not isinstance(tiled_initial_state, AttentionWrapperState):
zero_state = cell.zero_state(batch_size * beam_width, dtype)
tiled_initial_state = zero_state.clone(cell_state=tiled_initial_state)
return tiled_initial_state
def beam_search_decode(decoder_or_cell,
embedding,
start_tokens,
end_token,
beam_width,
initial_state=None,
tiled_initial_state=None,
output_layer=None,
length_penalty_weight=0.0,
max_decoding_length=None,
output_time_major=False,
**kwargs):
"""Performs beam search sampling decoding.
Args:
decoder_or_cell: An instance of
subclass of :class:`~texar.modules.RNNDecoderBase`,
or an instance of :tf_main:`RNNCell `. The
decoder or RNN cell to perform decoding.
embedding: A callable that takes a vector tensor of indexes (e.g.,
an instance of subclass of :class:`~texar.modules.EmbedderBase`),
or the :attr:`params` argument for
:tf_main:`tf.nn.embedding_lookup `.
start_tokens: `int32` vector shaped `[batch_size]`, the start tokens.
end_token: `int32` scalar, the token that marks end of decoding.
beam_width (int): Python integer, the number of beams.
initial_state (optional): Initial state of decoding. If `None`
(default), zero state is used.
The state must **not** be tiled with
:tf_main:`tile_batch `.
If you have an already-tiled initial state, use
:attr:`tiled_initial_state` instead.
In the case of attention RNN decoder, `initial_state` must
**not** be an :tf_main:`AttentionWrapperState
`. Instead, it must be a
state of the wrapped `RNNCell`, which state will be wrapped into
`AttentionWrapperState` automatically.
Ignored if :attr:`tiled_initial_state` is given.
tiled_initial_state (optional): Initial state that has been tiled
(typicaly with :tf_main:`tile_batch `)
so that the batch dimension has size `batch_size * beam_width`.
In the case of attention RNN decoder, this can be either a state
of the wrapped `RNNCell`, or an `AttentionWrapperState`.
If not given, :attr:`initial_state` is used.
output_layer (optional): A :tf_main:`Layer ` instance to
apply to the RNN output prior to storing the result or sampling. If
`None` and :attr:`decoder_or_cell` is a decoder, the decoder's
output layer will be used.
length_penalty_weight: Float weight to penalize length.
Disabled with `0.0` (default).
max_decoding_length (optional): A int scalar Tensor indicating the
maximum allowed number of decoding steps. If `None` (default),
decoding will continue until the end token is encountered.
output_time_major (bool): If `True`, outputs are returned as
time major tensors. If `False` (default), outputs are returned
as batch major tensors.
**kwargs: Other keyword arguments for :tf_main:`dynamic_decode
` except argument
`maximum_iterations` which is set to :attr:`max_decoding_length`.
Returns:
A tuple `(outputs, final_state, sequence_length)`, where
- outputs: An instance of :tf_main:`FinalBeamSearchDecoderOutput \
`.
- final_state: An instance of :tf_main:`BeamSearchDecoderState \
`.
- sequence_length: A Tensor of shape `[batch_size]` containing \
the lengths of samples.
Example:
.. code-block:: python
## Beam search with basic RNN decoder
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
outputs, _, _, = beam_search_decode(
decoder_or_cell=decoder,
embedding=embedder,
start_tokens=[data.vocab.bos_token_id] * 100,
end_token=data.vocab.eos_token_id,
beam_width=5,
max_decoding_length=60)
sample_ids = sess.run(outputs.predicted_ids)
sample_text = tx.utils.map_ids_to_strs(sample_id[:,:,0], data.vocab)
print(sample_text)
# [
# the first sequence sample .
# the second sequence sample .
# ...
# ]
.. code-block:: python
## Beam search with attention RNN decoder
# Encodes the source
enc_embedder = WordEmbedder(data.source_vocab.size, ...)
encoder = UnidirectionalRNNEncoder(...)
enc_outputs, enc_state = encoder(
inputs=enc_embedder(data_batch['source_text_ids']),
sequence_length=data_batch['source_length'])
# Decodes while attending to the source
dec_embedder = WordEmbedder(vocab_size=data.target_vocab.size, ...)
decoder = AttentionRNNDecoder(
memory=enc_outputs,
memory_sequence_length=data_batch['source_length'],
vocab_size=data.target_vocab.size)
# Beam search
outputs, _, _, = beam_search_decode(
decoder_or_cell=decoder,
embedding=dec_embedder,
start_tokens=[data.vocab.bos_token_id] * 100,
end_token=data.vocab.eos_token_id,
beam_width=5,
initial_state=enc_state,
max_decoding_length=60)
"""
if isinstance(decoder_or_cell, RNNDecoderBase):
cell = decoder_or_cell._get_beam_search_cell(beam_width=beam_width)
elif isinstance(decoder_or_cell, tf.contrib.rnn.RNNCell):
cell = decoder_or_cell
else:
raise ValueError("`decoder` must be an instance of a subclass of "
"either `RNNDecoderBase` or `RNNCell`.")
start_tokens = tf.convert_to_tensor(
start_tokens, dtype=tf.int32, name="start_tokens")
if start_tokens.get_shape().ndims != 1:
raise ValueError("`start_tokens` must be a vector")
batch_size = tf.size(start_tokens)
initial_state = _get_initial_state(
initial_state, tiled_initial_state, cell,
batch_size, beam_width, tf.float32)
if output_layer is None and isinstance(decoder_or_cell, RNNDecoderBase):
output_layer = decoder_or_cell.output_layer
def _decode():
beam_docoder = BeamSearchDecoder(
cell=cell,
embedding=embedding,
start_tokens=start_tokens,
end_token=end_token,
initial_state=initial_state,
beam_width=beam_width,
output_layer=output_layer,
length_penalty_weight=length_penalty_weight)
if 'maximum_iterations' in kwargs:
raise ValueError('Use `max_decoding_length` to set the maximum '
'allowed number of decoding steps.')
outputs, final_state, _ = dynamic_decode(
decoder=beam_docoder,
output_time_major=output_time_major,
maximum_iterations=max_decoding_length,
**kwargs)
return outputs, final_state, final_state.lengths
if isinstance(decoder_or_cell, RNNDecoderBase):
vs = decoder_or_cell.variable_scope
with tf.variable_scope(vs, reuse=tf.AUTO_REUSE):
return _decode()
else:
return _decode()
================================================
FILE: texar_repo/texar/modules/decoders/beam_search_decode_test.py
================================================
"""
Unit tests for beam search decoding.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from tensorflow.contrib.seq2seq import dynamic_decode
from tensorflow.contrib.seq2seq import BeamSearchDecoder, tile_batch
import texar as tx
from texar.modules.decoders.beam_search_decode import beam_search_decode
from texar import context
# pylint: disable=no-member, too-many-instance-attributes, invalid-name
# pylint: disable=too-many-locals, too-many-arguments
class BeamSearchDecodeTest(tf.test.TestCase):
"""Tests
:func:`texar.modules.decoders.beam_search_decode.beam_search_decode`.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._vocab_size = 10
self._max_time = 16
self._batch_size = 8
self._emb_dim = 20
self._cell_dim = 256
self._attention_dim = self._cell_dim
self._beam_width = 11
self._inputs = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1., dtype=tf.float32)
self._embedding = tf.random_uniform(
[self._vocab_size, self._emb_dim], maxval=1., dtype=tf.float32)
self._encoder_output = tf.random_uniform(
[self._batch_size, self._max_time, 64])
def _test_beam_search(
self, decoder, initial_state=None, tiled_initial_state=None,
tf_initial_state=None, beam_width_1=1, initiated=False):
## Compare with tf built-in BeamSearchDecoder
outputs, final_state, _ = beam_search_decode(
decoder_or_cell=decoder,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
beam_width=beam_width_1,
max_decoding_length=20)
self.assertIsInstance(
outputs, tf.contrib.seq2seq.FinalBeamSearchDecoderOutput)
self.assertIsInstance(
final_state, tf.contrib.seq2seq.BeamSearchDecoderState)
num_trainable_variables = len(tf.trainable_variables())
_ = decoder(
decoding_strategy='infer_greedy',
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
max_decoding_length=20)
self.assertEqual(num_trainable_variables, len(tf.trainable_variables()))
if tf_initial_state is None:
tf_initial_state = decoder.cell.zero_state(
self._batch_size * beam_width_1, tf.float32)
beam_decoder = BeamSearchDecoder(
cell=decoder.cell,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
initial_state=tf_initial_state,
beam_width=beam_width_1,
output_layer=decoder.output_layer)
outputs_1, final_state_1, _ = dynamic_decode(
decoder=beam_decoder, maximum_iterations=20)
## Tests time major
outputs_2, _, _ = beam_search_decode(
decoder_or_cell=decoder,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
beam_width=self._beam_width,
initial_state=initial_state,
tiled_initial_state=tiled_initial_state,
max_decoding_length=21)
outputs_3, _, _ = beam_search_decode(
decoder_or_cell=decoder,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
beam_width=self._beam_width,
initial_state=initial_state,
tiled_initial_state=tiled_initial_state,
max_decoding_length=21,
output_time_major=True)
with self.test_session() as sess:
if not initiated:
sess.run(tf.global_variables_initializer())
outputs_, final_state_, outputs_1_, final_state_1_ = sess.run(
[outputs, final_state, outputs_1, final_state_1],
feed_dict={context.global_mode():
tf.estimator.ModeKeys.PREDICT})
np.testing.assert_array_equal(
outputs_.predicted_ids, outputs_1_.predicted_ids)
np.testing.assert_array_equal(
outputs_.beam_search_decoder_output.scores,
outputs_1_.beam_search_decoder_output.scores)
np.testing.assert_array_equal(
outputs_.beam_search_decoder_output.predicted_ids,
outputs_1_.beam_search_decoder_output.predicted_ids)
np.testing.assert_array_equal(
outputs_.beam_search_decoder_output.parent_ids,
outputs_1_.beam_search_decoder_output.parent_ids)
np.testing.assert_array_equal(
final_state_.log_probs, final_state_1_.log_probs)
np.testing.assert_array_equal(
final_state_.lengths, final_state_1_.lengths)
outputs_2_, outputs_3_ = sess.run(
[outputs_2, outputs_3],
feed_dict={context.global_mode():
tf.estimator.ModeKeys.PREDICT})
self.assertEqual(outputs_2_.predicted_ids.shape,
tuple([self._batch_size, 21, 11]))
self.assertEqual(outputs_3_.predicted_ids.shape,
tuple([21, self._batch_size, 11]))
def test_basic_rnn_decoder_beam_search(self):
"""Tests beam search with BasicRNNDecoder.
"""
hparams = {
"rnn_cell": {
"kwargs": {"num_units": self._cell_dim}
}
}
decoder = tx.modules.BasicRNNDecoder(
vocab_size=self._vocab_size,
hparams=hparams)
self._test_beam_search(decoder)
self._test_beam_search(
decoder, beam_width_1=self._beam_width, initiated=True)
def test_basic_rnn_decoder_given_initial_state(self):
"""Tests beam search with BasicRNNDecoder given initial state.
"""
hparams = {
"rnn_cell": {
"kwargs": {"num_units": self._cell_dim}
}
}
decoder = tx.modules.BasicRNNDecoder(
vocab_size=self._vocab_size,
hparams=hparams)
# (zhiting): The beam search decoder does not generate max-length
# samples if only one cell_state is created. Perhaps due to
# random seed or bugs?
cell_state = decoder.cell.zero_state(self._batch_size, tf.float32)
cell_state = decoder.cell.zero_state(self._batch_size, tf.float32)
self._test_beam_search(decoder, initial_state=cell_state)
tiled_cell_state = tile_batch(cell_state, multiplier=self._beam_width)
self._test_beam_search(
decoder, tiled_initial_state=tiled_cell_state, initiated=True)
def test_attention_decoder_beam_search(self):
"""Tests beam search with RNNAttentionDecoder.
"""
seq_length = np.random.randint(
self._max_time, size=[self._batch_size]) + 1
encoder_values_length = tf.constant(seq_length)
hparams = {
"attention": {
"kwargs": {"num_units": self._attention_dim}
},
"rnn_cell": {
"kwargs": {"num_units": self._cell_dim}
}
}
decoder = tx.modules.AttentionRNNDecoder(
vocab_size=self._vocab_size,
memory=self._encoder_output,
memory_sequence_length=encoder_values_length,
hparams=hparams)
self._test_beam_search(decoder)
def test_attention_decoder_given_initial_state(self):
"""Tests beam search with RNNAttentionDecoder given initial state.
"""
seq_length = np.random.randint(
self._max_time, size=[self._batch_size]) + 1
encoder_values_length = tf.constant(seq_length)
hparams = {
"attention": {
"kwargs": {"num_units": self._attention_dim}
},
"rnn_cell": {
"kwargs": {"num_units": self._cell_dim}
}
}
decoder = tx.modules.AttentionRNNDecoder(
vocab_size=self._vocab_size,
memory=self._encoder_output,
memory_sequence_length=encoder_values_length,
hparams=hparams)
state = decoder.cell.zero_state(self._batch_size, tf.float32)
cell_state = state.cell_state
self._test_beam_search(decoder, initial_state=cell_state)
tiled_cell_state = tile_batch(cell_state, multiplier=self._beam_width)
self._test_beam_search(
decoder, tiled_initial_state=tiled_cell_state, initiated=True)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/decoders/rnn_decoder_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for RNN decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=too-many-arguments, no-name-in-module
# pylint: disable=too-many-branches, protected-access, too-many-locals
# pylint: disable=arguments-differ, unused-argument
import copy
import tensorflow as tf
from tensorflow.contrib.seq2seq import Decoder as TFDecoder
from tensorflow.contrib.seq2seq import dynamic_decode
from tensorflow.python.framework import tensor_shape
from tensorflow.python.util import nest
from texar.core import layers
from texar.utils import utils
from texar.utils.mode import is_train_mode, is_train_mode_py
from texar.module_base import ModuleBase
from texar.modules.decoders import rnn_decoder_helpers
__all__ = [
"RNNDecoderBase"
]
class RNNDecoderBase(ModuleBase, TFDecoder):
"""Base class inherited by all RNN decoder classes.
See :class:`~texar.modules.BasicRNNDecoder` for the argumenrts.
See :meth:`_build` for the inputs and outputs of RNN decoders in general.
.. document private functions
.. automethod:: _build
"""
def __init__(self,
cell=None,
vocab_size=None,
output_layer=None,
cell_dropout_mode=None,
hparams=None):
ModuleBase.__init__(self, hparams)
self._helper = None
self._initial_state = None
# Make rnn cell
with tf.variable_scope(self.variable_scope):
if cell is not None:
self._cell = cell
else:
self._cell = layers.get_rnn_cell(
self._hparams.rnn_cell, cell_dropout_mode)
self._beam_search_cell = None
# Make the output layer
self._vocab_size = vocab_size
self._output_layer = output_layer
if output_layer is None:
if self._vocab_size is None:
raise ValueError(
"Either `output_layer` or `vocab_size` must be provided. "
"Set `output_layer=tf.identity` if no output layer is "
"wanted.")
with tf.variable_scope(self.variable_scope):
self._output_layer = tf.layers.Dense(units=self._vocab_size)
elif output_layer is not tf.identity:
if not isinstance(output_layer, tf.layers.Layer):
raise ValueError(
"`output_layer` must be either `tf.identity` or "
"an instance of `tf.layers.Layer`.")
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
The hyperparameters are the same as in
:meth:`~texar.modules.BasicRNNDecoder.default_hparams` of
:class:`~texar.modules.BasicRNNDecoder`, except that the default
"name" here is "rnn_decoder".
"""
return {
"rnn_cell": layers.default_rnn_cell_hparams(),
"helper_train": rnn_decoder_helpers.default_helper_train_hparams(),
"helper_infer": rnn_decoder_helpers.default_helper_infer_hparams(),
"max_decoding_length_train": None,
"max_decoding_length_infer": None,
"name": "rnn_decoder"
}
def _build(self,
decoding_strategy="train_greedy",
initial_state=None,
inputs=None,
sequence_length=None,
embedding=None,
start_tokens=None,
end_token=None,
softmax_temperature=None,
max_decoding_length=None,
impute_finished=False,
output_time_major=False,
input_time_major=False,
helper=None,
mode=None,
**kwargs):
"""Performs decoding. This is a shared interface for both
:class:`~texar.modules.BasicRNNDecoder` and
:class:`~texar.modules.AttentionRNNDecoder`.
The function provides **3 ways** to specify the
decoding method, with varying flexibility:
1. The :attr:`decoding_strategy` argument: A string taking value of:
- **"train_greedy"**: decoding in teacher-forcing fashion \
(i.e., feeding \
`ground truth` to decode the next step), and each sample is \
obtained by taking the `argmax` of the RNN output logits. \
Arguments :attr:`(inputs, sequence_length, input_time_major)` \
are required for this strategy, and argument :attr:`embedding` \
is optional.
- **"infer_greedy"**: decoding in inference fashion (i.e., feeding \
the `generated` sample to decode the next step), and each sample\
is obtained by taking the `argmax` of the RNN output logits.\
Arguments :attr:`(embedding, start_tokens, end_token)` are \
required for this strategy, and argument \
:attr:`max_decoding_length` is optional.
- **"infer_sample"**: decoding in inference fashion, and each
sample is obtained by `random sampling` from the RNN output
distribution. Arguments \
:attr:`(embedding, start_tokens, end_token)` are \
required for this strategy, and argument \
:attr:`max_decoding_length` is optional.
This argument is used only when argument :attr:`helper` is `None`.
Example:
.. code-block:: python
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
# Teacher-forcing decoding
outputs_1, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=embedder(data_batch['text_ids']),
sequence_length=data_batch['length']-1)
# Random sample decoding. Gets 100 sequence samples
outputs_2, _, sequence_length = decoder(
decoding_strategy='infer_sample',
start_tokens=[data.vocab.bos_token_id]*100,
end_token=data.vocab.eos.token_id,
embedding=embedder,
max_decoding_length=60)
2. The :attr:`helper` argument: An instance of subclass of \
:tf_main:`tf.contrib.seq2seq.Helper `. This \
provides a superset of decoding strategies than above, for example:
- :tf_main:`TrainingHelper
` corresponding to the \
"train_greedy" strategy.
- :tf_main:`ScheduledEmbeddingTrainingHelper
` and \
:tf_main:`ScheduledOutputTrainingHelper
` for scheduled \
sampling.
- :class:`~texar.modules.SoftmaxEmbeddingHelper` and \
:class:`~texar.modules.GumbelSoftmaxEmbeddingHelper` for \
soft decoding and gradient backpropagation.
This means gives the maximal flexibility of configuring the decoding\
strategy.
Example:
.. code-block:: python
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
# Teacher-forcing decoding, same as above with
# `decoding_strategy='train_greedy'`
helper_1 = tf.contrib.seq2seq.TrainingHelper(
inputs=embedders(data_batch['text_ids']),
sequence_length=data_batch['length']-1)
outputs_1, _, _ = decoder(helper=helper_1)
# Gumbel-softmax decoding
helper_2 = GumbelSoftmaxEmbeddingHelper(
embedding=embedder,
start_tokens=[data.vocab.bos_token_id]*100,
end_token=data.vocab.eos_token_id,
tau=0.1)
outputs_2, _, sequence_length = decoder(
max_decoding_length=60, helper=helper_2)
3. :attr:`hparams["helper_train"]` and :attr:`hparams["helper_infer"]`:\
Specifying the helper through hyperparameters. Train and infer \
strategy is toggled based on :attr:`mode`. Appriopriate arguments \
(e.g., :attr:`inputs`, :attr:`start_tokens`, etc) are selected to \
construct the helper. Additional arguments for helper constructor \
can be provided either through :attr:`**kwargs`, or through \
:attr:`hparams["helper_train/infer"]["kwargs"]`.
This means is used only when both :attr:`decoding_strategy` and \
:attr:`helper` are `None`.
Example:
.. code-block:: python
h = {
"helper_infer": {
"type": "GumbelSoftmaxEmbeddingHelper",
"kwargs": { "tau": 0.1 }
}
}
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size, hparams=h)
# Gumbel-softmax decoding
output, _, _ = decoder(
decoding_strategy=None, # Sets to None explicit
embedding=embedder,
start_tokens=[data.vocab.bos_token_id]*100,
end_token=data.vocab.eos_token_id,
max_decoding_length=60,
mode=tf.estimator.ModeKeys.PREDICT)
# PREDICT mode also shuts down dropout
Args:
decoding_strategy (str): A string specifying the decoding
strategy. Different arguments are required based on the
strategy.
Ignored if :attr:`helper` is given.
initial_state (optional): Initial state of decoding.
If `None` (default), zero state is used.
inputs (optional): Input tensors for teacher forcing decoding.
Used when `decoding_strategy` is set to "train_greedy", or
when `hparams`-configured helper is used.
- If :attr:`embedding` is `None`, `inputs` is directly \
fed to the decoder. E.g., in `"train_greedy"` strategy, \
`inputs` must be a 3D Tensor of shape \
`[batch_size, max_time, emb_dim]` (or \
`[max_time, batch_size, emb_dim]` if `input_time_major`==True).
- If `embedding` is given, `inputs` is used as index \
to look up embeddings and feed in the decoder. \
E.g., if `embedding` is an instance of \
:class:`~texar.modules.WordEmbedder`, \
then :attr:`inputs` is usually a 2D int Tensor \
`[batch_size, max_time]` (or \
`[max_time, batch_size]` if `input_time_major`==True) \
containing the token indexes.
sequence_length (optional): A 1D int Tensor containing the
sequence length of :attr:`inputs`.
Used when `decoding_strategy="train_greedy"` or
`hparams`-configured helper is used.
embedding (optional): A callable that returns embedding vectors
of `inputs` (e.g., an instance of subclass of
:class:`~texar.modules.EmbedderBase`), or the `params`
argument of
:tf_main:`tf.nn.embedding_lookup `.
If provided, `inputs` (if used) will be passed to
`embedding` to fetch the embedding vectors of the inputs.
Required when `decoding_strategy="infer_greedy"`
or `"infer_sample"`; optional when
`decoding_strategy="train_greedy"`.
start_tokens (optional): A int Tensor of shape `[batch_size]`,
the start tokens.
Used when `decoding_strategy="infer_greedy"` or
`"infer_sample"`, or when `hparams`-configured
helper is used.
Companying with Texar data module, to get `batch_size` samples
where batch_size is changing according to the data module,
this can be set as
`start_tokens=tf.ones_like(batch['length'])*bos_token_id`.
end_token (optional): A int 0D Tensor, the token that marks end
of decoding.
Used when `decoding_strategy="infer_greedy"` or
`"infer_sample"`, or when `hparams`-configured
helper is used.
softmax_temperature (optional): A float 0D Tensor, value to divide
the logits by before computing the softmax. Larger values
(above 1.0) result in more random samples. Must > 0. If `None`,
1.0 is used.
Used when `decoding_strategy="infer_sample"`.
max_decoding_length: A int scalar Tensor indicating the maximum
allowed number of decoding steps. If `None` (default), either
`hparams["max_decoding_length_train"]` or
`hparams["max_decoding_length_infer"]` is used
according to :attr:`mode`.
impute_finished (bool): If `True`, then states for batch
entries which are marked as finished get copied through and
the corresponding outputs get zeroed out. This causes some
slowdown at each time step, but ensures that the final state
and outputs have the correct values and that backprop ignores
time steps that were marked as finished.
output_time_major (bool): If `True`, outputs are returned as
time major tensors. If `False` (default), outputs are returned
as batch major tensors.
input_time_major (optional): Whether the :attr:`inputs` tensor is
time major.
Used when `decoding_strategy="train_greedy"` or
`hparams`-configured helper is used.
helper (optional): An instance of
:tf_main:`Helper `
that defines the decoding strategy. If given,
`decoding_strategy`
and helper configs in :attr:`hparams` are ignored.
mode (str, optional): A string taking value in
:tf_main:`tf.estimator.ModeKeys `. If
`TRAIN`, training related hyperparameters are used (e.g.,
`hparams['max_decoding_length_train']`), otherwise,
inference related hyperparameters are used (e.g.,
`hparams['max_decoding_length_infer']`).
If `None` (default), `TRAIN` mode is used.
**kwargs: Other keyword arguments for constructing helpers
defined by `hparams["helper_trainn"]` or
`hparams["helper_infer"]`.
Returns:
`(outputs, final_state, sequence_lengths)`, where
- **`outputs`**: an object containing the decoder output on all \
time steps.
- **`final_state`**: is the cell state of the final time step.
- **`sequence_lengths`**: is an int Tensor of shape `[batch_size]` \
containing the length of each sample.
"""
# Helper
if helper is not None:
pass
elif decoding_strategy is not None:
if decoding_strategy == "train_greedy":
helper = rnn_decoder_helpers._get_training_helper(
inputs, sequence_length, embedding, input_time_major)
elif decoding_strategy == "infer_greedy":
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
embedding, start_tokens, end_token)
elif decoding_strategy == "infer_sample":
helper = tf.contrib.seq2seq.SampleEmbeddingHelper(
embedding, start_tokens, end_token, softmax_temperature)
else:
raise ValueError(
"Unknown decoding strategy: {}".format(decoding_strategy))
else:
if is_train_mode_py(mode):
kwargs_ = copy.copy(self._hparams.helper_train.kwargs.todict())
helper_type = self._hparams.helper_train.type
else:
kwargs_ = copy.copy(self._hparams.helper_infer.kwargs.todict())
helper_type = self._hparams.helper_infer.type
kwargs_.update({
"inputs": inputs,
"sequence_length": sequence_length,
"time_major": input_time_major,
"embedding": embedding,
"start_tokens": start_tokens,
"end_token": end_token,
"softmax_temperature": softmax_temperature})
kwargs_.update(kwargs)
helper = rnn_decoder_helpers.get_helper(helper_type, **kwargs_)
self._helper = helper
# Initial state
if initial_state is not None:
self._initial_state = initial_state
else:
self._initial_state = self.zero_state(
batch_size=self.batch_size, dtype=tf.float32)
# Maximum decoding length
max_l = max_decoding_length
if max_l is None:
max_l_train = self._hparams.max_decoding_length_train
if max_l_train is None:
max_l_train = utils.MAX_SEQ_LENGTH
max_l_infer = self._hparams.max_decoding_length_infer
if max_l_infer is None:
max_l_infer = utils.MAX_SEQ_LENGTH
max_l = tf.cond(is_train_mode(mode),
lambda: max_l_train, lambda: max_l_infer)
# Decode
outputs, final_state, sequence_lengths = dynamic_decode(
decoder=self, impute_finished=impute_finished,
maximum_iterations=max_l, output_time_major=output_time_major)
if not self._built:
self._add_internal_trainable_variables()
# Add trainable variables of `self._cell` which may be
# constructed externally.
self._add_trainable_variable(
layers.get_rnn_cell_trainable_variables(self._cell))
if isinstance(self._output_layer, tf.layers.Layer):
self._add_trainable_variable(
self._output_layer.trainable_variables)
# Add trainable variables of `self._beam_search_rnn_cell` which
# may already be constructed and used.
if self._beam_search_cell is not None:
self._add_trainable_variable(
self._beam_search_cell.trainable_variables)
self._built = True
return outputs, final_state, sequence_lengths
def _get_beam_search_cell(self, **kwargs):
self._beam_search_cell = self._cell
return self._cell
def _rnn_output_size(self):
size = self._cell.output_size
if self._output_layer is tf.identity:
return size
else:
# To use layer's compute_output_shape, we need to convert the
# RNNCell's output_size entries into shapes with an unknown
# batch size. We then pass this through the layer's
# compute_output_shape and read off all but the first (batch)
# dimensions to get the output size of the rnn with the layer
# applied to the top.
output_shape_with_unknown_batch = nest.map_structure(
lambda s: tensor_shape.TensorShape([None]).concatenate(s),
size)
layer_output_shape = self._output_layer.compute_output_shape(
output_shape_with_unknown_batch)
return nest.map_structure(lambda s: s[1:], layer_output_shape)
@property
def batch_size(self):
return self._helper.batch_size
@property
def output_size(self):
"""Output size of one step.
"""
raise NotImplementedError
@property
def output_dtype(self):
"""Types of output of one step.
"""
raise NotImplementedError
def initialize(self, name=None):
# Inherits from TFDecoder
# All RNN decoder classes must implement this
raise NotImplementedError
def step(self, time, inputs, state, name=None):
# Inherits from TFDecoder
# All RNN decoder classes must implement this
raise NotImplementedError
def finalize(self, outputs, final_state, sequence_lengths):
# Inherits from TFDecoder
# All RNN decoder classes must implement this
raise NotImplementedError
@property
def cell(self):
"""The RNN cell.
"""
return self._cell
def zero_state(self, batch_size, dtype):
"""Zero state of the RNN cell.
Equivalent to :attr:`decoder.cell.zero_state`.
"""
return self._cell.zero_state(
batch_size=batch_size, dtype=dtype)
@property
def state_size(self):
"""The state size of decoder cell.
Equivalent to :attr:`decoder.cell.state_size`.
"""
return self.cell.state_size
@property
def vocab_size(self):
"""The vocab size.
"""
return self._vocab_size
@property
def output_layer(self):
"""The output layer.
"""
return self._output_layer
================================================
FILE: texar_repo/texar/modules/decoders/rnn_decoder_helpers.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various helper classes and utilities for RNN decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib.seq2seq import TrainingHelper as TFTrainingHelper
from tensorflow.contrib.seq2seq import Helper as TFHelper
from tensorflow.contrib.distributions import RelaxedOneHotCategorical \
as GumbelSoftmax
from texar.modules.embedders.embedder_base import EmbedderBase
from texar.utils import utils
# pylint: disable=not-context-manager, too-many-arguments
# pylint: disable=too-many-instance-attributes
__all__ = [
"default_helper_train_hparams",
"default_helper_infer_hparams",
"get_helper",
"_get_training_helper",
"GumbelSoftmaxEmbeddingHelper",
"SoftmaxEmbeddingHelper",
]
def default_helper_train_hparams():
"""Returns default hyperparameters of an RNN decoder helper in the training
phase.
See also :meth:`~texar.modules.decoders.rnn_decoder_helpers.get_helper`
for information of the hyperparameters.
Returns:
dict: A dictionary with following structure and values:
.. code-block:: python
{
# The `helper_type` argument for `get_helper`, i.e., the name
# or full path to the helper class.
"type": "TrainingHelper",
# The `**kwargs` argument for `get_helper`, i.e., additional
# keyword arguments for constructing the helper.
"kwargs": {}
}
"""
return {
"type": "TrainingHelper",
"kwargs": {}
}
def default_helper_infer_hparams():
"""Returns default hyperparameters of an RNN decoder helper in the inference
phase.
See also :meth:`~texar.modules.decoders.rnn_decoder_helpers.get_helper`
for information of the hyperparameters.
Returns:
dict: A dictionary with following structure and values:
.. code-block:: python
{
# The `helper_type` argument for `get_helper`, i.e., the name
# or full path to the helper class.
"type": "SampleEmbeddingHelper",
# The `**kwargs` argument for `get_helper`, i.e., additional
# keyword arguments for constructing the helper.
"kwargs": {}
}
"""
return {
"type": "SampleEmbeddingHelper",
"kwargs": {}
}
def get_helper(helper_type,
inputs=None,
sequence_length=None,
embedding=None,
start_tokens=None,
end_token=None,
**kwargs):
"""Creates a Helper instance.
Args:
helper_type: A :tf_main:`Helper ` class, its
name or module path, or a class instance. If a class instance
is given, it is returned directly.
inputs (optional): Inputs to the RNN decoder, e.g., ground truth
tokens for teacher forcing decoding.
sequence_length (optional): A 1D int Tensor containing the
sequence length of :attr:`inputs`.
embedding (optional): A callable that takes a vector tensor of
indexes (e.g., an instance of subclass of
:class:`~texar.modules.EmbedderBase`), or the `params` argument
for `embedding_lookup` (e.g., the embedding Tensor).
start_tokens (optional): A int Tensor of shape `[batch_size]`,
the start tokens.
end_token (optional): A int 0D Tensor, the token that marks end
of decoding.
**kwargs: Additional keyword arguments for constructing the helper.
Returns:
A helper instance.
"""
module_paths = [
'texar.modules.decoders.rnn_decoder_helpers',
'tensorflow.contrib.seq2seq',
'texar.custom']
class_kwargs = {"inputs": inputs,
"sequence_length": sequence_length,
"embedding": embedding,
"start_tokens": start_tokens,
"end_token": end_token}
class_kwargs.update(kwargs)
return utils.check_or_get_instance_with_redundant_kwargs(
helper_type, class_kwargs, module_paths)
def _get_training_helper( #pylint: disable=invalid-name
inputs, sequence_length, embedding=None, time_major=False, name=None):
"""Returns an instance of :tf_main:`TrainingHelper
` given embeddings.
Args:
inputs: If :attr:`embedding` is given, this is sequences of input
token indexes. If :attr:`embedding` is `None`, this is passed to
TrainingHelper directly.
sequence_length (1D Tensor): Lengths of input token sequences.
embedding (optional): The `params` argument of
:tf_main:`tf.nn.embedding_lookup
` (e.g., the embedding Tensor); or a callable
that takes a vector of integer indexes and returns respective
embedding (e.g., an instance of subclass of
:class:`~texar.modules.EmbedderBase`).
time_major (bool): Whether the tensors in `inputs` are time major.
If `False` (default), they are assumed to be batch major.
name (str, optional): Name scope for any created operations.
Returns:
An instance of TrainingHelper.
Raises:
ValueError: if `sequence_length` is not a 1D tensor.
"""
if embedding is None:
return TFTrainingHelper(inputs=inputs,
sequence_length=sequence_length,
time_major=time_major,
name=name)
with tf.name_scope(name, "TrainingHelper", [embedding, inputs]):
if callable(embedding):
embedding_fn = embedding
else:
embedding_fn = (
lambda ids: tf.nn.embedding_lookup(embedding, ids))
emb_inputs = embedding_fn(inputs)
helper = TFTrainingHelper(inputs=emb_inputs,
sequence_length=sequence_length,
time_major=time_major,
name=name)
return helper
class SoftmaxEmbeddingHelper(TFHelper):
"""A helper that feeds softmax probabilities over vocabulary
to the next step.
Uses the softmax probability vector to pass through word embeddings to
get the next input (i.e., a mixed word embedding).
A subclass of
:tf_main:`Helper `.
Used as a helper to :class:`~texar.modules.RNNDecoderBase` :meth:`_build`
in inference mode.
Args:
embedding: An embedding argument (:attr:`params`) for
:tf_main:`tf.nn.embedding_lookup `, or an
instance of subclass of :class:`texar.modules.EmbedderBase`.
Note that other callables are not acceptable here.
start_tokens: An int tensor shaped `[batch_size]`. The
start tokens.
end_token: An int scalar tensor. The token that marks end of
decoding.
tau: A float scalar tensor, the softmax temperature.
stop_gradient (bool): Whether to stop the gradient backpropagation
when feeding softmax vector to the next step.
use_finish (bool): Whether to stop decoding once `end_token` is
generated. If `False`, decoding will continue until
`max_decoding_length` of the decoder is reached.
"""
def __init__(self, embedding, start_tokens, end_token, tau,
stop_gradient=False, use_finish=True):
if isinstance(embedding, EmbedderBase):
embedding = embedding.embedding
if callable(embedding):
raise ValueError("`embedding` must be an embedding tensor or an "
"instance of subclass of `EmbedderBase`.")
else:
self._embedding = embedding
self._embedding_fn = (
lambda ids: tf.nn.embedding_lookup(embedding, ids))
self._start_tokens = tf.convert_to_tensor(
start_tokens, dtype=tf.int32, name="start_tokens")
self._end_token = tf.convert_to_tensor(
end_token, dtype=tf.int32, name="end_token")
self._start_inputs = self._embedding_fn(self._start_tokens)
self._batch_size = tf.size(self._start_tokens)
self._tau = tau
self._stop_gradient = stop_gradient
self._use_finish = use_finish
@property
def batch_size(self):
return self._batch_size
@property
def sample_ids_dtype(self):
return tf.float32
@property
def sample_ids_shape(self):
return self._embedding.get_shape()[:1]
def initialize(self, name=None):
finished = tf.tile([False], [self._batch_size])
return (finished, self._start_inputs)
def sample(self, time, outputs, state, name=None):
"""Returns `sample_id` which is softmax distributions over vocabulary
with temperature `tau`. Shape = `[batch_size, vocab_size]`
"""
sample_ids = tf.nn.softmax(outputs / self._tau)
return sample_ids
def next_inputs(self, time, outputs, state, sample_ids, name=None):
if self._use_finish:
hard_ids = tf.argmax(sample_ids, axis=-1, output_type=tf.int32)
finished = tf.equal(hard_ids, self._end_token)
else:
finished = tf.tile([False], [self._batch_size])
if self._stop_gradient:
sample_ids = tf.stop_gradient(sample_ids)
next_inputs = tf.matmul(sample_ids, self._embedding)
return (finished, next_inputs, state)
class GumbelSoftmaxEmbeddingHelper(SoftmaxEmbeddingHelper):
"""A helper that feeds gumbel softmax sample to the next step.
Uses the gumbel softmax vector to pass through word embeddings to
get the next input (i.e., a mixed word embedding).
A subclass of
:tf_main:`Helper `.
Used as a helper to :class:`~texar.modules.RNNDecoderBase` :meth:`_build`
in inference mode.
Same as :class:`~texar.modules.SoftmaxEmbeddingHelper` except that here
gumbel softmax (instead of softmax) is used.
Args:
embedding: An embedding argument (:attr:`params`) for
:tf_main:`tf.nn.embedding_lookup `, or an
instance of subclass of :class:`texar.modules.EmbedderBase`.
Note that other callables are not acceptable here.
start_tokens: An int tensor shaped `[batch_size]`. The
start tokens.
end_token: An int scalar tensor. The token that marks end of
decoding.
tau: A float scalar tensor, the softmax temperature.
straight_through (bool): Whether to use straight through gradient
between time steps. If `True`, a single token with highest
probability (i.e., greedy sample) is fed to the next step and
gradient is computed using straight through. If `False` (default),
the soft gumbel-softmax distribution is fed to the next step.
stop_gradient (bool): Whether to stop the gradient backpropagation
when feeding softmax vector to the next step.
use_finish (bool): Whether to stop decoding once `end_token` is
generated. If `False`, decoding will continue until
`max_decoding_length` of the decoder is reached.
"""
def __init__(self, embedding, start_tokens, end_token, tau,
straight_through=False, stop_gradient=False, use_finish=True):
super(GumbelSoftmaxEmbeddingHelper, self).__init__(
embedding, start_tokens, end_token, tau, stop_gradient, use_finish)
self._straight_through = straight_through
def sample(self, time, outputs, state, name=None):
"""Returns `sample_id` of shape `[batch_size, vocab_size]`. If
`straight_through` is False, this is gumbel softmax distributions over
vocabulary with temperature `tau`. If `straight_through` is True,
this is one-hot vectors of the greedy samples.
"""
sample_ids = tf.nn.softmax(outputs / self._tau)
sample_ids = GumbelSoftmax(self._tau, logits=outputs).sample()
if self._straight_through:
size = tf.shape(sample_ids)[-1]
sample_ids_hard = tf.cast(
tf.one_hot(tf.argmax(sample_ids, -1), size), sample_ids.dtype)
sample_ids = tf.stop_gradient(sample_ids_hard - sample_ids) \
+ sample_ids
return sample_ids
================================================
FILE: texar_repo/texar/modules/decoders/rnn_decoders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various RNN decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=no-name-in-module, too-many-arguments, too-many-locals
# pylint: disable=not-context-manager, protected-access, invalid-name
import collections
import copy
import tensorflow as tf
from tensorflow.contrib.seq2seq import AttentionWrapper
from tensorflow.python.util import nest
from tensorflow.contrib.seq2seq import tile_batch
from texar.modules.decoders.rnn_decoder_base import RNNDecoderBase
from texar.utils import utils
__all__ = [
"BasicRNNDecoderOutput",
"AttentionRNNDecoderOutput",
"BasicRNNDecoder",
"AttentionRNNDecoder"
]
class BasicRNNDecoderOutput(
collections.namedtuple("BasicRNNDecoderOutput",
("logits", "sample_id", "cell_output"))):
"""The outputs of basic RNN decoder that include both RNN outputs and
sampled ids at each step. This is also used to store results of all the
steps after decoding the whole sequence.
Attributes:
logits: The outputs of RNN (at each step/of all steps) by applying the
output layer on cell outputs. E.g., in
:class:`~texar.modules.BasicRNNDecoder` with default
hyperparameters, this is a Tensor of
shape `[batch_size, max_time, vocab_size]` after decoding the
whole sequence.
sample_id: The sampled results (at each step/of all steps). E.g., in
BasicRNNDecoder with decoding strategy of train_greedy,
this is a Tensor
of shape `[batch_size, max_time]` containing the sampled token
indexes of all steps.
cell_output: The output of RNN cell (at each step/of all steps).
This is the results prior to the output layer. E.g., in
BasicRNNDecoder with default
hyperparameters, this is a Tensor of
shape `[batch_size, max_time, cell_output_size]` after decoding
the whole sequence.
"""
pass
class AttentionRNNDecoderOutput(
collections.namedtuple(
"AttentionRNNDecoderOutput",
["logits", "sample_id", "cell_output",
"attention_scores", "attention_context"])):
"""The outputs of attention RNN decoders that additionally include
attention results.
Attributes:
logits: The outputs of RNN (at each step/of all steps) by applying the
output layer on cell outputs. E.g., in
:class:`~texar.modules.AttentionRNNDecoder`, this is a Tensor of
shape `[batch_size, max_time, vocab_size]` after decoding.
sample_id: The sampled results (at each step/of all steps). E.g., in
:class:`~texar.modules.AttentionRNNDecoder` with decoding strategy
of train_greedy, this
is a Tensor of shape `[batch_size, max_time]` containing the
sampled token indexes of all steps.
cell_output: The output of RNN cell (at each step/of all steps).
This is the results prior to the output layer. E.g., in
AttentionRNNDecoder with default
hyperparameters, this is a Tensor of
shape `[batch_size, max_time, cell_output_size]` after decoding
the whole sequence.
attention_scores: A single or tuple of `Tensor`(s) containing the
alignments emitted (at the previous time step/of all time steps)
for each attention mechanism.
attention_context: The attention emitted (at the previous time step/of
all time steps).
"""
pass
class BasicRNNDecoder(RNNDecoderBase):
"""Basic RNN decoder.
Args:
cell (RNNCell, optional): An instance of
:tf_main:`RNNCell `. If `None`
(default), a cell is created as specified in
:attr:`hparams`.
cell_dropout_mode (optional): A Tensor taking value of
:tf_main:`tf.estimator.ModeKeys `, which
toggles dropout in the RNN cell (e.g., activates dropout in
TRAIN mode). If `None`, :func:`~texar.global_mode` is used.
Ignored if :attr:`cell` is given.
vocab_size (int, optional): Vocabulary size. Required if
:attr:`output_layer` is `None`.
output_layer (optional): An instance of
:tf_main:`tf.layers.Layer `, or
:tf_main:`tf.identity `. Apply to the RNN cell
output to get logits. If `None`, a dense layer
is used with output dimension set to :attr:`vocab_size`.
Set `output_layer=tf.identity` if you do not want to have an
output layer after the RNN cell outputs.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
See :meth:`~texar.modules.RNNDecoderBase._build` for the inputs and outputs
of the decoder. The decoder returns
`(outputs, final_state, sequence_lengths)`, where `outputs` is an instance
of :class:`~texar.modules.BasicRNNDecoderOutput`.
Example:
.. code-block:: python
embedder = WordEmbedder(vocab_size=data.vocab.size)
decoder = BasicRNNDecoder(vocab_size=data.vocab.size)
# Training loss
outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=embedder(data_batch['text_ids']),
sequence_length=data_batch['length']-1)
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=data_batch['text_ids'][:, 1:],
logits=outputs.logits,
sequence_length=data_batch['length']-1)
# Inference sample
outputs, _, _ = decoder(
decoding_strategy='infer_sample',
start_tokens=[data.vocab.bos_token_id]*100,
end_token=data.vocab.eos.token_id,
embedding=embedder,
max_decoding_length=60,
mode=tf.estimator.ModeKeys.PREDICT)
sample_id = sess.run(outputs.sample_id)
sample_text = tx.utils.map_ids_to_strs(sample_id, data.vocab)
print(sample_text)
# [
# the first sequence sample .
# the second sequence sample .
# ...
# ]
"""
def __init__(self,
cell=None,
cell_dropout_mode=None,
vocab_size=None,
output_layer=None,
hparams=None):
RNNDecoderBase.__init__(
self, cell, vocab_size, output_layer, cell_dropout_mode, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"rnn_cell": default_rnn_cell_hparams(),
"max_decoding_length_train": None,
"max_decoding_length_infer": None,
"helper_train": {
"type": "TrainingHelper",
"kwargs": {}
}
"helper_infer": {
"type": "SampleEmbeddingHelper",
"kwargs": {}
}
"name": "basic_rnn_decoder"
}
Here:
"rnn_cell" : dict
A dictionary of RNN cell hyperparameters. Ignored if
:attr:`cell` is given to the decoder constructor.
The default value is defined in
:meth:`~texar.core.layers.default_rnn_cell_hparams`.
"max_decoding_length_train": int or None
Maximum allowed number of decoding steps in training mode.
If `None` (default), decoding is
performed until fully done, e.g., encountering the token.
Ignored if `max_decoding_length` is given when calling
the decoder.
"max_decoding_length_infer" : int or None
Same as "max_decoding_length_train" but for inference mode.
"helper_train" : dict
The hyperparameters of the helper used in training.
"type" can be a helper class, its name or module path, or a
helper instance. If a class name is given, the class must be
from module :tf_main:`tf.contrib.seq2seq `,
:mod:`texar.modules`, or :mod:`texar.custom`. This is used
only when both `decoding_strategy` and `helper` augments are
`None` when calling the decoder. See
:meth:`~texar.modules.RNNDecoderBase._build` for more details.
"helper_infer": dict
Same as "helper_train" but during inference mode.
"name" : str
Name of the decoder.
The default value is "basic_rnn_decoder".
"""
hparams = RNNDecoderBase.default_hparams()
hparams["name"] = "basic_rnn_decoder"
return hparams
def initialize(self, name=None):
return self._helper.initialize() + (self._initial_state,)
def step(self, time, inputs, state, name=None):
cell_outputs, cell_state = self._cell(inputs, state)
logits = self._output_layer(cell_outputs)
sample_ids = self._helper.sample(
time=time, outputs=logits, state=cell_state)
(finished, next_inputs, next_state) = self._helper.next_inputs(
time=time,
outputs=logits,
state=cell_state,
sample_ids=sample_ids)
outputs = BasicRNNDecoderOutput(logits, sample_ids, cell_outputs)
return (outputs, next_state, next_inputs, finished)
def finalize(self, outputs, final_state, sequence_lengths):
return outputs, final_state
@property
def output_size(self):
"""Output size of one step.
"""
return BasicRNNDecoderOutput(
logits=self._rnn_output_size(),
sample_id=self._helper.sample_ids_shape,
cell_output=self._cell.output_size)
@property
def output_dtype(self):
"""Types of output of one step.
"""
# Assume the dtype of the cell is the output_size structure
# containing the input_state's first component's dtype.
# Return that structure and the sample_ids_dtype from the helper.
dtype = nest.flatten(self._initial_state)[0].dtype
return BasicRNNDecoderOutput(
logits=nest.map_structure(lambda _: dtype, self._rnn_output_size()),
sample_id=self._helper.sample_ids_dtype,
cell_output=nest.map_structure(
lambda _: dtype, self._cell.output_size))
class AttentionRNNDecoder(RNNDecoderBase):
"""RNN decoder with attention mechanism.
Args:
memory: The memory to query, e.g., the output of an RNN encoder. This
tensor should be shaped `[batch_size, max_time, dim]`.
memory_sequence_length (optional): A tensor of shape `[batch_size]`
containing the sequence lengths for the batch
entries in memory. If provided, the memory tensor rows are masked
with zeros for values past the respective sequence lengths.
cell (RNNCell, optional): An instance of `RNNCell`. If `None`, a cell
is created as specified in :attr:`hparams`.
cell_dropout_mode (optional): A Tensor taking value of
:tf_main:`tf.estimator.ModeKeys `, which
toggles dropout in the RNN cell (e.g., activates dropout in
TRAIN mode). If `None`, :func:`~texar.global_mode` is used.
Ignored if :attr:`cell` is given.
vocab_size (int, optional): Vocabulary size. Required if
:attr:`output_layer` is `None`.
output_layer (optional): An instance of
:tf_main:`tf.layers.Layer `, or
:tf_main:`tf.identity `. Apply to the RNN cell
output to get logits. If `None`, a dense layer
is used with output dimension set to :attr:`vocab_size`.
Set `output_layer=tf.identity` if you do not want to have an
output layer after the RNN cell outputs.
cell_input_fn (callable, optional): A callable that produces RNN cell
inputs. If `None` (default), the default is used:
`lambda inputs, attention: tf.concat([inputs, attention], -1)`,
which cancats regular RNN cell inputs with attentions.
hparams (dict, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
See :meth:`~texar.modules.RNNDecoderBase._build` for the inputs and outputs
of the decoder. The decoder returns
`(outputs, final_state, sequence_lengths)`, where `outputs` is an instance
of :class:`~texar.modules.AttentionRNNDecoderOutput`.
Example:
.. code-block:: python
# Encodes the source
enc_embedder = WordEmbedder(data.source_vocab.size, ...)
encoder = UnidirectionalRNNEncoder(...)
enc_outputs, _ = encoder(
inputs=enc_embedder(data_batch['source_text_ids']),
sequence_length=data_batch['source_length'])
# Decodes while attending to the source
dec_embedder = WordEmbedder(vocab_size=data.target_vocab.size, ...)
decoder = AttentionRNNDecoder(
memory=enc_outputs,
memory_sequence_length=data_batch['source_length'],
vocab_size=data.target_vocab.size)
outputs, _, _ = decoder(
decoding_strategy='train_greedy',
inputs=dec_embedder(data_batch['target_text_ids']),
sequence_length=data_batch['target_length']-1)
"""
def __init__(self,
memory,
memory_sequence_length=None,
cell=None,
cell_dropout_mode=None,
vocab_size=None,
output_layer=None,
#attention_layer=None, # TODO(zhiting): only valid for tf>=1.0
cell_input_fn=None,
hparams=None):
RNNDecoderBase.__init__(
self, cell, vocab_size, output_layer, cell_dropout_mode, hparams)
attn_hparams = self._hparams['attention']
attn_kwargs = attn_hparams['kwargs'].todict()
# Parse the 'probability_fn' argument
if 'probability_fn' in attn_kwargs:
prob_fn = attn_kwargs['probability_fn']
if prob_fn is not None and not callable(prob_fn):
prob_fn = utils.get_function(
prob_fn,
['tensorflow.nn', 'tensorflow.contrib.sparsemax',
'tensorflow.contrib.seq2seq'])
attn_kwargs['probability_fn'] = prob_fn
attn_kwargs.update({
"memory_sequence_length": memory_sequence_length,
"memory": memory})
self._attn_kwargs = attn_kwargs
attn_modules = ['tensorflow.contrib.seq2seq', 'texar.custom']
# Use variable_scope to ensure all trainable variables created in
# the attention mechanism are collected
with tf.variable_scope(self.variable_scope):
attention_mechanism = utils.check_or_get_instance(
attn_hparams["type"], attn_kwargs, attn_modules,
classtype=tf.contrib.seq2seq.AttentionMechanism)
self._attn_cell_kwargs = {
"attention_layer_size": attn_hparams["attention_layer_size"],
"alignment_history": attn_hparams["alignment_history"],
"output_attention": attn_hparams["output_attention"],
}
self._cell_input_fn = cell_input_fn
# Use variable_scope to ensure all trainable variables created in
# AttentionWrapper are collected
with tf.variable_scope(self.variable_scope):
#if attention_layer is not None:
# self._attn_cell_kwargs["attention_layer_size"] = None
attn_cell = AttentionWrapper(
self._cell,
attention_mechanism,
cell_input_fn=self._cell_input_fn,
#attention_layer=attention_layer,
**self._attn_cell_kwargs)
self._cell = attn_cell
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values:
Common hyperparameters are the same as in
:class:`~texar.modules.BasicRNNDecoder`.
:meth:`~texar.modules.BasicRNNDecoder.default_hparams`.
Additional hyperparameters are for attention mechanism
configuration.
.. code-block:: python
{
"attention": {
"type": "LuongAttention",
"kwargs": {
"num_units": 256,
},
"attention_layer_size": None,
"alignment_history": False,
"output_attention": True,
},
# The following hyperparameters are the same as with
# `BasicRNNDecoder`
"rnn_cell": default_rnn_cell_hparams(),
"max_decoding_length_train": None,
"max_decoding_length_infer": None,
"helper_train": {
"type": "TrainingHelper",
"kwargs": {}
}
"helper_infer": {
"type": "SampleEmbeddingHelper",
"kwargs": {}
}
"name": "attention_rnn_decoder"
}
Here:
"attention" : dict
Attention hyperparameters, including:
"type" : str or class or instance
The attention type. Can be an attention class, its name or
module path, or a class instance. The class must be a subclass
of :tf_main:`TF AttentionMechanism
`. If class name is
given, the class must be from modules
:tf_main:`tf.contrib.seq2seq ` or
:mod:`texar.custom`.
Example:
.. code-block:: python
# class name
"type": "LuongAttention"
"type": "BahdanauAttention"
# module path
"type": "tf.contrib.seq2seq.BahdanauMonotonicAttention"
"type": "my_module.MyAttentionMechanismClass"
# class
"type": tf.contrib.seq2seq.LuongMonotonicAttention
# instance
"type": LuongAttention(...)
"kwargs" : dict
keyword arguments for the attention class constructor.
Arguments :attr:`memory` and
:attr:`memory_sequence_length` should **not** be
specified here because they are given to the decoder
constructor. Ignored if "type" is an attention class
instance. For example
Example:
.. code-block:: python
"type": "LuongAttention",
"kwargs": {
"num_units": 256,
"probability_fn": tf.nn.softmax
}
Here "probability_fn" can also be set to the string name
or module path to a probability function.
"attention_layer_size" : int or None
The depth of the attention (output) layer. The context and
cell output are fed into the attention layer to generate
attention at each time step.
If `None` (default), use the context as attention at each
time step.
"alignment_history": bool
whether to store alignment history from all time steps
in the final output state. (Stored as a time major
`TensorArray` on which you must call `stack()`.)
"output_attention": bool
If `True` (default), the output at each time step is
the attention value. This is the behavior of Luong-style
attention mechanisms. If `False`, the output at each
time step is the output of `cell`. This is the
beahvior of Bhadanau-style attention mechanisms.
In both cases, the `attention` tensor is propagated to
the next time step via the state and is used there.
This flag only controls whether the attention mechanism
is propagated up to the next cell in an RNN stack or to
the top RNN output.
"""
hparams = RNNDecoderBase.default_hparams()
hparams["name"] = "attention_rnn_decoder"
hparams["attention"] = {
"type": "LuongAttention",
"kwargs": {
"num_units": 256,
},
"attention_layer_size": None,
"alignment_history": False,
"output_attention": True,
}
return hparams
# pylint: disable=arguments-differ
def _get_beam_search_cell(self, beam_width):
"""Returns the RNN cell for beam search decoding.
"""
with tf.variable_scope(self.variable_scope, reuse=True):
attn_kwargs = copy.copy(self._attn_kwargs)
memory = attn_kwargs['memory']
attn_kwargs['memory'] = tile_batch(memory, multiplier=beam_width)
memory_seq_length = attn_kwargs['memory_sequence_length']
if memory_seq_length is not None:
attn_kwargs['memory_sequence_length'] = tile_batch(
memory_seq_length, beam_width)
attn_modules = ['tensorflow.contrib.seq2seq', 'texar.custom']
bs_attention_mechanism = utils.check_or_get_instance(
self._hparams.attention.type, attn_kwargs, attn_modules,
classtype=tf.contrib.seq2seq.AttentionMechanism)
bs_attn_cell = AttentionWrapper(
self._cell._cell,
bs_attention_mechanism,
cell_input_fn=self._cell_input_fn,
**self._attn_cell_kwargs)
self._beam_search_cell = bs_attn_cell
return bs_attn_cell
def initialize(self, name=None):
helper_init = self._helper.initialize()
flat_initial_state = nest.flatten(self._initial_state)
dtype = flat_initial_state[0].dtype
initial_state = self._cell.zero_state(
batch_size=tf.shape(flat_initial_state[0])[0], dtype=dtype)
initial_state = initial_state.clone(cell_state=self._initial_state)
return [helper_init[0], helper_init[1], initial_state]
def step(self, time, inputs, state, name=None):
wrapper_outputs, wrapper_state = self._cell(inputs, state)
# Essentisally the same as in BasicRNNDecoder.step()
logits = self._output_layer(wrapper_outputs)
sample_ids = self._helper.sample(
time=time, outputs=logits, state=wrapper_state)
(finished, next_inputs, next_state) = self._helper.next_inputs(
time=time,
outputs=logits,
state=wrapper_state,
sample_ids=sample_ids)
attention_scores = wrapper_state.alignments
attention_context = wrapper_state.attention
outputs = AttentionRNNDecoderOutput(
logits, sample_ids, wrapper_outputs,
attention_scores, attention_context)
return (outputs, next_state, next_inputs, finished)
def finalize(self, outputs, final_state, sequence_lengths):
return outputs, final_state
def _alignments_size(self):
# Reimplementation of the alignments_size of each of
# AttentionWrapper.attention_mechanisms. The original implementation
# of `_BaseAttentionMechanism._alignments_size`:
#
# self._alignments_size = (self._keys.shape[1].value or
# array_ops.shape(self._keys)[1])
#
# can be `None` when the seq length of encoder outputs are priori
# unknown.
alignments_size = []
for am in self._cell._attention_mechanisms:
az = (am._keys.shape[1].value or tf.shape(am._keys)[1:-1])
alignments_size.append(az)
return self._cell._item_or_tuple(alignments_size)
@property
def output_size(self):
return AttentionRNNDecoderOutput(
logits=self._rnn_output_size(),
sample_id=self._helper.sample_ids_shape,
cell_output=self._cell.output_size,
attention_scores=self._alignments_size(),
attention_context=self._cell.state_size.attention)
@property
def output_dtype(self):
"""Types of output of one step.
"""
# Assume the dtype of the cell is the output_size structure
# containing the input_state's first component's dtype.
# Return that structure and the sample_ids_dtype from the helper.
dtype = nest.flatten(self._initial_state)[0].dtype
return AttentionRNNDecoderOutput(
logits=nest.map_structure(lambda _: dtype, self._rnn_output_size()),
sample_id=self._helper.sample_ids_dtype,
cell_output=nest.map_structure(
lambda _: dtype, self._cell.output_size),
attention_scores=nest.map_structure(
lambda _: dtype, self._alignments_size()),
attention_context=nest.map_structure(
lambda _: dtype, self._cell.state_size.attention))
def zero_state(self, batch_size, dtype):
"""Returns zero state of the basic cell.
Equivalent to :attr:`decoder.cell._cell.zero_state`.
"""
return self._cell._cell.zero_state(batch_size=batch_size, dtype=dtype)
def wrapper_zero_state(self, batch_size, dtype):
"""Returns zero state of the attention-wrapped cell.
Equivalent to :attr:`decoder.cell.zero_state`.
"""
return self._cell.zero_state(batch_size=batch_size, dtype=dtype)
@property
def state_size(self):
"""The state size of the basic cell.
Equivalent to :attr:`decoder.cell._cell.state_size`.
"""
return self._cell._cell.state_size
@property
def wrapper_state_size(self):
"""The state size of the attention-wrapped cell.
Equivalent to :attr:`decoder.cell.state_size`.
"""
return self._cell.state_size
================================================
FILE: texar_repo/texar/modules/decoders/rnn_decoders_test.py
================================================
"""
Unit tests for RNN decoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import tensorflow as tf
from texar.modules.decoders.rnn_decoders import BasicRNNDecoderOutput
from texar.modules.decoders.rnn_decoders import BasicRNNDecoder
from texar.modules.decoders.rnn_decoders import AttentionRNNDecoderOutput
from texar.modules.decoders.rnn_decoders import AttentionRNNDecoder
from texar.modules.decoders.rnn_decoder_helpers import get_helper
from texar import context
# pylint: disable=no-member, too-many-locals, too-many-instance-attributes
# pylint: disable=too-many-arguments, protected-access
class BasicRNNDecoderTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.decoders.rnn_decoders.BasicRNNDecoder`.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._vocab_size = 4
self._max_time = 8
self._batch_size = 16
self._emb_dim = 20
self._inputs = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1., dtype=tf.float32)
self._embedding = tf.random_uniform(
[self._vocab_size, self._emb_dim], maxval=1., dtype=tf.float32)
def _test_outputs(self, decoder, outputs, final_state, sequence_lengths,
test_mode=False):
# 4 trainable variables: cell-kernel, cell-bias,
# fc-layer-weights, fc-layer-bias
self.assertEqual(len(decoder.trainable_variables), 4)
cell_dim = decoder.hparams.rnn_cell.kwargs.num_units
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, final_state_, sequence_lengths_ = sess.run(
[outputs, final_state, sequence_lengths],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.TRAIN})
self.assertIsInstance(outputs_, BasicRNNDecoderOutput)
if not test_mode:
self.assertEqual(
outputs_.logits.shape,
(self._batch_size, self._max_time, self._vocab_size))
self.assertEqual(
outputs_.sample_id.shape,
(self._batch_size, self._max_time))
np.testing.assert_array_equal(
sequence_lengths_, [self._max_time]*self._batch_size)
self.assertEqual(final_state_[0].shape,
(self._batch_size, cell_dim))
def test_decode_train(self):
"""Tests decoding in training mode.
"""
output_layer = tf.layers.Dense(self._vocab_size)
decoder = BasicRNNDecoder(vocab_size=self._vocab_size,
output_layer=output_layer)
helper_train = get_helper(
decoder.hparams.helper_train.type,
inputs=self._inputs,
sequence_length=[self._max_time]*self._batch_size,
**decoder.hparams.helper_train.kwargs.todict())
outputs, final_state, sequence_lengths = decoder(helper=helper_train)
self._test_outputs(decoder, outputs, final_state, sequence_lengths)
outputs, final_state, sequence_lengths = decoder(
inputs=self._inputs,
sequence_length=[self._max_time]*self._batch_size)
self._test_outputs(decoder, outputs, final_state, sequence_lengths)
outputs, final_state, sequence_lengths = decoder(
decoding_strategy=None,
inputs=self._inputs,
sequence_length=[self._max_time]*self._batch_size)
self._test_outputs(decoder, outputs, final_state, sequence_lengths)
outputs, final_state, sequence_lengths = decoder(
decoding_strategy=None,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
mode=tf.estimator.ModeKeys.EVAL)
self._test_outputs(decoder, outputs, final_state, sequence_lengths,
test_mode=True)
def test_decode_train_with_tf(self):
"""Compares decoding results with TF built-in decoder.
"""
_inputs_placeholder = tf.placeholder(
tf.int32, [self._batch_size, self._max_time], name="inputs")
_embedding_placeholder = tf.placeholder(
tf.float32, [self._vocab_size, self._emb_dim], name="emb")
inputs = tf.nn.embedding_lookup(_embedding_placeholder,
_inputs_placeholder)
output_layer = tf.layers.Dense(self._vocab_size)
decoder = BasicRNNDecoder(vocab_size=self._vocab_size,
output_layer=output_layer)
helper_train = get_helper(
decoder.hparams.helper_train.type,
inputs=inputs,
sequence_length=[self._max_time]*self._batch_size,
**decoder.hparams.helper_train.kwargs.todict())
outputs, final_state, sequence_lengths = decoder(helper=helper_train)
tf_helper = tf.contrib.seq2seq.TrainingHelper(
inputs, [self._max_time]*self._batch_size)
tf_decoder = tf.contrib.seq2seq.BasicDecoder(
decoder.cell,
tf_helper,
decoder.cell.zero_state(self._batch_size, tf.float32),
output_layer=output_layer)
tf_outputs, tf_final_state, tf_sequence_lengths = \
tf.contrib.seq2seq.dynamic_decode(tf_decoder)
cell_dim = decoder.hparams.rnn_cell.kwargs.num_units
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
inputs_ = np.random.randint(
self._vocab_size, size=(self._batch_size, self._max_time),
dtype=np.int32)
embedding_ = np.random.randn(self._vocab_size, self._emb_dim)
outputs_, final_state_, sequence_lengths_ = sess.run(
[outputs, final_state, sequence_lengths],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.TRAIN,
_inputs_placeholder: inputs_,
_embedding_placeholder: embedding_})
self.assertEqual(final_state_[0].shape,
(self._batch_size, cell_dim))
tf_outputs_, tf_final_state_, tf_sequence_lengths_ = sess.run(
[tf_outputs, tf_final_state, tf_sequence_lengths],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.TRAIN,
_inputs_placeholder: inputs_,
_embedding_placeholder: embedding_})
np.testing.assert_array_equal(outputs_.logits,
tf_outputs_.rnn_output)
np.testing.assert_array_equal(outputs_.sample_id,
tf_outputs_.sample_id)
np.testing.assert_array_equal(final_state_.c, tf_final_state_.c)
np.testing.assert_array_equal(final_state_.h, tf_final_state_.h)
np.testing.assert_array_equal(sequence_lengths_,
tf_sequence_lengths_)
def test_decode_infer(self):
"""Tests decoding in inferencee mode.
"""
output_layer = tf.layers.Dense(self._vocab_size)
decoder = BasicRNNDecoder(vocab_size=self._vocab_size,
output_layer=output_layer)
helper_infer = get_helper(
decoder.hparams.helper_infer.type,
embedding=self._embedding,
start_tokens=[self._vocab_size-2]*self._batch_size,
end_token=self._vocab_size-1,
**decoder.hparams.helper_train.kwargs.todict())
outputs, final_state, sequence_lengths = decoder(helper=helper_infer)
# 4 trainable variables: embedding, cell-kernel, cell-bias,
# fc-layer-weights, fc-layer-bias
self.assertEqual(len(decoder.trainable_variables), 4)
cell_dim = decoder.hparams.rnn_cell.kwargs.num_units
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, final_state_, sequence_lengths_ = sess.run(
[outputs, final_state, sequence_lengths],
feed_dict={context.global_mode():
tf.estimator.ModeKeys.PREDICT})
self.assertIsInstance(outputs_, BasicRNNDecoderOutput)
max_length = max(sequence_lengths_)
self.assertEqual(
outputs_.logits.shape,
(self._batch_size, max_length, self._vocab_size))
self.assertEqual(
outputs_.sample_id.shape, (self._batch_size, max_length))
self.assertEqual(final_state_[0].shape,
(self._batch_size, cell_dim))
class AttentionRNNDecoderTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.decoders.rnn_decoders.AttentionRNNDecoder`.
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._vocab_size = 10
self._max_time = 16
self._batch_size = 8
self._emb_dim = 20
self._attention_dim = 256
self._inputs = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1., dtype=tf.float32)
self._embedding = tf.random_uniform(
[self._vocab_size, self._emb_dim], maxval=1., dtype=tf.float32)
self._encoder_output = tf.random_uniform(
[self._batch_size, self._max_time, 64])
def test_decode_train(self):
"""Tests decoding in training mode.
"""
seq_length = np.random.randint(
self._max_time, size=[self._batch_size]) + 1
encoder_values_length = tf.constant(seq_length)
hparams = {
"attention": {
"kwargs": {
"num_units": self._attention_dim,
# Note: to use sparsemax in TF-CPU, it looks
# `memory_sequence_length` must equal max_time.
#"probability_fn": "sparsemax"
}
}
}
decoder = AttentionRNNDecoder(
memory=self._encoder_output,
memory_sequence_length=encoder_values_length,
vocab_size=self._vocab_size,
hparams=hparams)
helper_train = get_helper(
decoder.hparams.helper_train.type,
inputs=self._inputs,
sequence_length=[self._max_time]*self._batch_size,
**decoder.hparams.helper_train.kwargs.todict())
outputs, final_state, sequence_lengths = decoder(helper=helper_train)
# 4+1 trainable variables: cell-kernel, cell-bias,
# fc-weight, fc-bias, and
# memory_layer: For LuongAttention, we only transform the memory layer;
# thus num_units *must* match the expected query depth.
self.assertEqual(len(decoder.trainable_variables), 5)
cell_dim = decoder.hparams.rnn_cell.kwargs.num_units
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, final_state_, sequence_lengths_ = sess.run(
[outputs, final_state, sequence_lengths],
feed_dict={context.global_mode(): tf.estimator.ModeKeys.TRAIN})
self.assertIsInstance(outputs_, AttentionRNNDecoderOutput)
self.assertEqual(
outputs_.logits.shape,
(self._batch_size, self._max_time, self._vocab_size))
self.assertEqual(
outputs_.sample_id.shape, (self._batch_size, self._max_time))
self.assertEqual(final_state_.cell_state[0].shape,
(self._batch_size, cell_dim))
np.testing.assert_array_equal(
sequence_lengths_, [self._max_time]*self._batch_size)
def test_decode_infer(self):
"""Tests decoding in inference mode.
"""
seq_length = np.random.randint(
self._max_time, size=[self._batch_size]) + 1
encoder_values_length = tf.constant(seq_length)
hparams = {
"attention": {
"kwargs": {
"num_units": 256,
}
}
}
decoder = AttentionRNNDecoder(
vocab_size=self._vocab_size,
memory=self._encoder_output,
memory_sequence_length=encoder_values_length,
hparams=hparams)
helper_infer = get_helper(
decoder.hparams.helper_infer.type,
embedding=self._embedding,
start_tokens=[1]*self._batch_size,
end_token=2,
**decoder.hparams.helper_train.kwargs.todict())
outputs, final_state, sequence_lengths = decoder(helper=helper_infer)
# 4+1 trainable variables: cell-kernel, cell-bias,
# fc-weight, fc-bias, and
# memory_layer: For LuongAttention, we only transform the memory layer;
# thus num_units *must* match the expected query depth.
self.assertEqual(len(decoder.trainable_variables), 5)
cell_dim = decoder.hparams.rnn_cell.kwargs.num_units
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, final_state_, sequence_lengths_ = sess.run(
[outputs, final_state, sequence_lengths],
feed_dict={context.global_mode():
tf.estimator.ModeKeys.PREDICT})
self.assertIsInstance(outputs_, AttentionRNNDecoderOutput)
max_length = max(sequence_lengths_)
self.assertEqual(
outputs_.logits.shape,
(self._batch_size, max_length, self._vocab_size))
self.assertEqual(
outputs_.sample_id.shape, (self._batch_size, max_length))
self.assertEqual(final_state_.cell_state[0].shape,
(self._batch_size, cell_dim))
def test_beam_search_cell(self):
"""Tests :meth:`texar.modules.AttentionRNNDecoder._get_beam_search_cell`
"""
seq_length = np.random.randint(
self._max_time, size=[self._batch_size]) + 1
encoder_values_length = tf.constant(seq_length)
hparams = {
"attention": {
"kwargs": {
"num_units": self._attention_dim,
"probability_fn": "sparsemax"
}
}
}
decoder = AttentionRNNDecoder(
memory=self._encoder_output,
memory_sequence_length=encoder_values_length,
vocab_size=self._vocab_size,
hparams=hparams)
helper_train = get_helper(
decoder.hparams.helper_train.type,
inputs=self._inputs,
sequence_length=[self._max_time]*self._batch_size,
**decoder.hparams.helper_train.kwargs.todict())
_, _, _ = decoder(helper=helper_train)
## 4+1 trainable variables: cell-kernel, cell-bias,
## fc-weight, fc-bias, and
## memory_layer: For LuongAttention, we only transform the memory layer;
## thus num_units *must* match the expected query depth.
self.assertEqual(len(decoder.trainable_variables), 5)
beam_width = 3
beam_cell = decoder._get_beam_search_cell(beam_width)
cell_input = tf.random_uniform([self._batch_size * beam_width,
self._emb_dim])
cell_state = beam_cell.zero_state(self._batch_size * beam_width,
tf.float32)
_ = beam_cell(cell_input, cell_state)
# Test if beam_cell is sharing variables with decoder cell.
self.assertEqual(len(beam_cell.trainable_variables), 0)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/decoders/transformer_decoders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Transformer decoder.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=no-name-in-module, too-many-arguments, too-many-locals
# pylint: disable=invalid-name
import collections
import tensorflow as tf
from tensorflow.python.util import nest
from texar.core import layers
from texar.module_base import ModuleBase
from texar.modules.networks.networks import FeedForwardNetwork
from texar.modules.embedders.position_embedders import SinusoidsPositionEmbedder
from texar.modules.encoders.transformer_encoders import \
default_transformer_poswise_net_hparams
from texar.modules.encoders.multihead_attention import \
MultiheadAttentionEncoder
from texar.utils import beam_search
from texar.utils.shapes import shape_list, mask_sequences
from texar.utils import transformer_attentions as attn
from texar.utils.mode import is_train_mode
__all__ = [
"TransformerDecoderOutput",
"TransformerDecoder"
]
class TransformerDecoderOutput(
collections.namedtuple("TransformerDecoderOutput",
("logits", "sample_id"))):
"""The output of :class:`TransformerDecoder`.
Attributes:
logits: A float Tensor of shape
`[batch_size, max_time, vocab_size]` containing the logits.
sample_id: An int Tensor of shape `[batch_size, max_time]`
containing the sampled token indexes.
"""
class TransformerDecoder(ModuleBase):
"""Transformer decoder that applies multi-head attention for
sequence decoding.
Stacked `~texar.modules.encoders.MultiheadAttentionEncoder` for
encoder-decoder attention and self attention,
`~texar.modules.FeedForwardNetwork` and residual connections.
Use the passed `embedding` variable as the parameters of the
transform layer from output to logits.
Args:
embedding: A Tensor of shape `[vocab_size, dim]` containing the
word embeddng. The Tensor is used as the decoder output layer.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
.. document private functions
.. automethod:: _build
"""
def __init__(self, embedding, hparams=None):
ModuleBase.__init__(self, hparams)
with tf.variable_scope(self.variable_scope):
if self._hparams.initializer:
tf.get_variable_scope().set_initializer(
layers.get_initializer(self._hparams.initializer))
self.position_embedder = \
SinusoidsPositionEmbedder(
self._hparams.position_embedder_hparams)
self._embedding = embedding
self._vocab_size = self._embedding.get_shape().as_list()[0]
self.output_layer = \
self._build_output_layer(shape_list(self._embedding)[-1])
self.multihead_attentions = {
'self_att': [],
'encdec_att': []
}
self.poswise_networks = []
for i in range(self._hparams.num_blocks):
layer_name = 'layer_{}'.format(i)
with tf.variable_scope(layer_name):
with tf.variable_scope("self_attention"):
multihead_attention = MultiheadAttentionEncoder(
self._hparams.multihead_attention)
self.multihead_attentions['self_att'].append(
multihead_attention)
# pylint: disable=protected-access
if self._hparams.dim != \
multihead_attention._hparams.output_dim:
raise ValueError('The output dimenstion of'
'MultiheadEncoder should be equal'
'to the dim of TransformerDecoder')
with tf.variable_scope('encdec_attention'):
multihead_attention = MultiheadAttentionEncoder(
self._hparams.multihead_attention)
self.multihead_attentions['encdec_att'].append(
multihead_attention)
if self._hparams.dim != \
multihead_attention._hparams.output_dim:
raise ValueError('The output dimenstion of'
'MultiheadEncoder should be equal'
'to the dim of TransformerDecoder')
poswise_network = FeedForwardNetwork(
hparams=self._hparams['poswise_feedforward'])
if self._hparams.dim != \
poswise_network._hparams.layers[-1]['kwargs']['units']:
raise ValueError('The output dimenstion of'
'FeedForwardNetwork should be equal'
'to the dim of TransformerDecoder')
self.poswise_networks.append(poswise_network)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
# Same as in TransformerEncoder
"num_blocks": 6,
"dim": 512,
"position_embedder_hparams": None,
"embedding_dropout": 0.1,
"residual_dropout": 0.1,
"poswise_feedforward": default_transformer_poswise_net_hparams,
"multihead_attention": {
"num_units": 512,
"num_heads": 8,
},
"initializer": None,
# Additional for TransformerDecoder
"embedding_tie": True,
"output_layer_bias": False,
"max_decoding_length": 1e10,
"name": "transformer_decoder"
}
Here:
"num_blocks" : int
Number of stacked blocks.
"dim" : int
Hidden dimension of the encoder.
"position_embedder_hparams" : dict, optional
Hyperparameters of a
:class:`~texar.modules.SinusoidsPositionEmbedder` as position
embedder. If `None`, the
:meth:`~texar.modules.SinusoidsPositionEmbedder.default_hparams`
is used.
"embedding_dropout": float
Dropout rate of the input word and position embeddings.
"residual_dropout" : float
Dropout rate of the residual connections.
"poswise_feedforward" : dict,
Hyperparameters for a feed-forward network used in residual
connections.
Make sure the dimension of the output tensor is equal to `dim`.
See :func:`~texar.modules.default_transformer_poswise_net_hparams`
for details.
"multihead_attention": dict,
Hyperparameters for the multihead attention strategy.
Make sure the `output_dim` in this module is equal to `dim`.
See :func:
`~texar.modules.encoder.MultiheadAttentionEncoder.
default_harams` for details.
`
"initializer" : dict, optional
Hyperparameters of the default initializer that initializes
variables created in this module.
See :func:`~texar.core.get_initializer` for details.
"embedding_tie" : bool
Whether to use the word embedding matrix as the output layer
that computes logits. If `False`, an additional dense layer
is created.
"output_layer_bias" : bool
Whether to use bias to the output layer.
"max_decoding_length" : int
The maximum allowed number of decoding steps.
Set to a very large number of avoid the length constraint.
Ignored if provided in :meth:`_build` or
"train_greedy" decoding is used.
Length penalty coefficient. Refer to
https://arxiv.org/abs/1609.08144 for more details.
"name" : str
Name of the module.
"""
return {
"num_blocks": 6,
"initializer": None,
"position_embedder_hparams": None,
"embedding_tie": True,
"output_layer_bias": False,
"max_decoding_length": 1e10,
"embedding_dropout": 0.1,
"residual_dropout": 0.1,
"poswise_feedforward": default_transformer_poswise_net_hparams(),
'multihead_attention': {
'num_units': 512,
'dropout_rate': 0.1,
'output_dim': 512,
'num_heads': 8,
},
"dim": 512,
"name": "transformer_decoder",
}
def _prepare_tokens_to_embeds(self, tokens):
""" a callable function to transform tokens into embeddings."""
token_emb = tf.nn.embedding_lookup(self._embedding, tokens)
return token_emb
def _symbols_to_logits_fn(self, embedding_fn, max_length):
"""Returns a function that accepts the decoded tokens and related
decoding status, and returns the logits of next token.
"""
positions = tf.expand_dims(tf.range(max_length, dtype=tf.int32), 0)
timing_signal = self.position_embedder(positions)
#you can use the comment to prevent the model to decode token
#biases = np.ones([1, self._vocab_size])
#biases[0][3] = -np.inf
def _impl(ids, step, cache):
"""The function is called in dynamic decoding.
`ids` should be next_id of shape `[batch_size, decoded_lenth]`
Returned logits is of shape `[batch_size, vocab_size]`
"""
ids = ids[:, -1:]
inputs = embedding_fn(ids)
# Multiply embedding by sqrt of its dimention
inputs *= self._embedding.shape.as_list()[-1]**0.5
inputs += timing_signal[:, step:step+1]
outputs = self._self_attention_stack(
inputs,
memory=cache['memory'],
cache=cache,
)
logits = self.output_layer(outputs)
logits = tf.squeeze(logits, axis=[1])
#logits = tf.multiply(logits, biases)
return logits, cache
return _impl
def _build(self, # pylint: disable=arguments-differ
memory,
memory_sequence_length=None,
memory_attention_bias=None,
inputs=None,
sequence_length=None,
decoding_strategy='train_greedy',
beam_width=1,
alpha=0,
start_tokens=None,
end_token=None,
max_decoding_length=None,
mode=None):
"""Performs decoding.
The decoder supports 4 decoding strategies. For the first 3 strategies,
set :attr:`decoding_strategy` to the respective string.
- **"train_greedy"**: decoding in teacher-forcing fashion \
(i.e., feeding \
ground truth to decode the next step), and for each step sample \
is obtained by taking the `argmax` of logits. \
Argument :attr:`inputs` is required for this strategy. \
:attr:`sequence_length` is optional.
- **"infer_greedy"**: decoding in inference fashion (i.e., feeding \
`generated` sample to decode the next step), and for each
step sample is obtained by taking the `argmax` of logits.\
Arguments :attr:`(start_tokens, end_token)` are \
required for this strategy, and argument \
:attr:`max_decoding_length` is optional.
- **"infer_sample"**: decoding in inference fashion, and for each step\
sample is obtained by `random sampling` from the logits.
Arguments :attr:`(start_tokens, end_token)` are \
required for this strategy, and argument \
:attr:`max_decoding_length` is optional.
- **Beam Search**: set :attr:`beam_width` to > 1 to use beam search \
decoding.\
Arguments :attr:`(start_tokens, end_token)` are \
required, and argument \
:attr:`max_decoding_length` is optional.
Args:
memory: The memory to attend, e.g., the output of an RNN encoder.
A Tensor of shape `[batch_size, memory_max_time, dim]`.
memory_sequence_length (optional): A Tensor of shape `[batch_size]`
containing the sequence lengths for the batch entries in
memory. Used to create attention bias of
:attr:`memory_attention_bias` is not given. Ignored if
`memory_attention_bias` is provided.
memory_attention_bias (optional): A Tensor of shape
`[batch_size, num_heads, memory_max_time, dim]`.
An attention bias typically sets the value of a padding
position to a large negative value for masking. If not given,
:attr:`memory_sequence_length` is used to automatically
create an attention bias.
inputs (optional): Input tensor for teacher forcing decoding, of
shape `[batch_size, target_max_time, emb_dim]` containing the
target sequence word embeddings.
Used when :attr:`decoding_strategy` is set to "train_greedy".
sequence_length (optional): A Tensor of shape `[batch_size]`,
containing the sequence length of :attr:`inputs`.
Tokens beyond the respective sequence length are masked out.
Used when :attr:`decoding_strategy` is set to
"train_greedy".
decoding_strategy (str): A string specifying the decoding
strategy, including "train_greedy", "infer_greedy",
"infer_sample".
Different arguments are required based on the
strategy. See above for details. Ignored if
:attr:`beam_width` > 1.
beam_width (int): Set to > 1 to use beam search.
alpha (float): Length penalty coefficient.
Refer to https://arxiv.org/abs/1609.08144
for more details.
start_tokens (optional): An int Tensor of shape `[batch_size]`,
containing the start tokens.
Used when `decoding_strategy` = "infer_greedy" or
"infer_sample", or `beam_width` > 1.
end_token (optional): An int 0D Tensor, the token that marks end
of decoding.
Used when `decoding_strategy` = "infer_greedy" or
"infer_sample", or `beam_width` > 1.
max_decoding_length (optional): An int scalar Tensor indicating
the maximum allowed number of decoding steps.
If `None` (default), use "max_decoding_length" defined in
:attr:`hparams`. Ignored in "train_greedy" decoding.
mode (optional): A tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. Controls dropout mode.
If `None` (default), :func:`texar.global_mode`
is used.
Returns:
- For **"train_greedy"** decoding, returns an instance of \
:class:`~texar.modules.TransformerDecoderOutput` which contains\
`sample_id` and `logits`.
- For **"infer_greedy"** and **"infer_sample"** decoding, returns\
a tuple `(outputs, sequence_lengths)`, where `outputs` is an \
instance of :class:`~texar.modules.TransformerDecoderOutput` as\
in "train_greedy", and `sequence_lengths` is a Tensor of shape\
`[batch_size]` containing the length of each sample.
- For **beam_search** decoding, returns a `dict` containing keys\
"sample_id" and "log_prob".
- **"sample_id"** is an int Tensor of shape \
`[batch_size, max_time, beam_width]` containing generated\
token indexes. `sample_id[:,:,0]` is the highest-probable \
sample.
- **"log_porb"** is a float Tensor of shape \
`[batch_size, beam_width]` containing the log probability \
of each sequence sample.
"""
if memory_attention_bias is None:
if memory_sequence_length is None:
raise ValueError(
"`memory_sequence_length` is required if "
"`memory_attention_bias` is not given.")
#enc_padding = 1 - mask_sequences(tf.ones_like(memory),
# memory_sequence_length,
# tensor_rank=3)[:, :, 0]
enc_padding = 1 - tf.sequence_mask(
memory_sequence_length, tf.shape(memory)[1], dtype=tf.float32)
memory_attention_bias = attn.attention_bias_ignore_padding(
enc_padding)
if beam_width <= 1 and decoding_strategy == 'train_greedy':
if sequence_length is not None:
inputs = mask_sequences(inputs, sequence_length, tensor_rank=3)
decoder_self_attention_bias = (
attn.attention_bias_lower_triangle(
shape_list(inputs)[1]))
target_inputs = inputs * self._hparams.dim**0.5
_, lengths, _ = shape_list(target_inputs)
positions = tf.expand_dims(tf.range(lengths, dtype=tf.int32), 0)
pos_embeds = self.position_embedder(positions)
inputs = target_inputs + pos_embeds
decoder_output = self._self_attention_stack(
inputs,
memory,
decoder_self_attention_bias=decoder_self_attention_bias,
memory_attention_bias=memory_attention_bias,
cache=None,
mode=mode)
logits = self.output_layer(decoder_output)
preds = tf.to_int32(tf.argmax(logits, axis=-1))
output = TransformerDecoderOutput(
logits=logits,
sample_id=preds
)
rets = output
else: # Inference decoding
if max_decoding_length is None:
max_decoding_length = self._hparams.max_decoding_length
if beam_width <= 1:
logits, preds, sequence_length = self._infer_decoding(
self._prepare_tokens_to_embeds,
start_tokens,
end_token,
decode_length=max_decoding_length,
memory=memory,
memory_attention_bias=memory_attention_bias,
decoding_strategy=decoding_strategy,
)
output = TransformerDecoderOutput(
logits=logits,
sample_id=preds)
rets = output, sequence_length
else:
# The output format is different when running beam search
sample_id, log_prob = self._beam_decode(
self._prepare_tokens_to_embeds,
start_tokens,
end_token,
beam_width=beam_width,
alpha=alpha,
decode_length=max_decoding_length,
memory=memory,
memory_attention_bias=memory_attention_bias,
)
predictions = {
'sample_id': sample_id,
'log_prob': log_prob
}
rets = predictions
if not self._built:
self._add_internal_trainable_variables()
self._built = True
return rets
def _self_attention_stack(self,
inputs,
memory,
decoder_self_attention_bias=None,
memory_attention_bias=None,
cache=None,
mode=None):
"""Stacked multihead attention module.
"""
inputs = tf.layers.dropout(inputs,
rate=self._hparams.embedding_dropout,
training=is_train_mode(mode))
if cache is not None:
memory_attention_bias = \
cache['memory_attention_bias']
else:
assert decoder_self_attention_bias is not None
x = inputs
for i in range(self._hparams.num_blocks):
layer_name = 'layer_{}'.format(i)
layer_cache = cache[layer_name] if cache is not None else None
with tf.variable_scope(layer_name):
with tf.variable_scope("self_attention"):
multihead_attention = \
self.multihead_attentions['self_att'][i]
selfatt_output = multihead_attention(
queries=layers.layer_normalize(x),
memory=None,
memory_attention_bias=decoder_self_attention_bias,
cache=layer_cache,
mode=mode,
)
x = x + tf.layers.dropout(
selfatt_output,
rate=self._hparams.residual_dropout,
training=is_train_mode(mode),
)
if memory is not None:
with tf.variable_scope('encdec_attention'):
multihead_attention = \
self.multihead_attentions['encdec_att'][i]
encdec_output = multihead_attention(
queries=layers.layer_normalize(x),
memory=memory,
memory_attention_bias=memory_attention_bias,
mode=mode,
)
x = x + tf.layers.dropout(
encdec_output,
rate=self._hparams.residual_dropout,
training=is_train_mode(mode))
poswise_network = self.poswise_networks[i]
with tf.variable_scope('past_poswise_ln'):
sub_output = tf.layers.dropout(
poswise_network(layers.layer_normalize(x)),
rate=self._hparams.residual_dropout,
training=is_train_mode(mode),
)
x = x + sub_output
return layers.layer_normalize(x)
def _build_output_layer(self, dim):
if self._hparams.embedding_tie:
if self._hparams.output_layer_bias:
with tf.variable_scope(self.variable_scope):
affine_bias = tf.get_variable(
'affine_bias', [self._vocab_size])
else:
affine_bias = None
def _outputs_to_logits(outputs):
shape = shape_list(outputs)
outputs = tf.reshape(outputs, [-1, dim])
logits = tf.matmul(outputs, self._embedding, transpose_b=True)
if affine_bias is not None:
logits += affine_bias
logits = tf.reshape(logits, shape[:-1] + [self._vocab_size])
return logits
return _outputs_to_logits
else:
layer = tf.layers.Dense(
self._vocab_size,
use_bias=self._hparams.output_layer_bias)
layer.build([None, dim])
return layer
def _init_cache(self, memory, memory_attention_bias):
cache = {
'memory': memory,
'memory_attention_bias': memory_attention_bias,
}
batch_size = tf.shape(memory)[0]
depth = self._hparams.multihead_attention.num_units
for l in range(self._hparams.num_blocks):
cache['layer_{}'.format(l)] = {
'self_keys': tf.zeros([batch_size, 0, depth]),
'self_values': tf.zeros([batch_size, 0, depth]),
'memory_keys': tf.zeros([batch_size, 0, depth]),
'memory_values': tf.zeros([batch_size, 0, depth]),
}
return cache
def _infer_decoding(self,
embedding_fn,
start_tokens,
end_token,
decode_length,
memory,
memory_attention_bias,
decoding_strategy):
"""Performs "infer_greedy" or "infer_sample" decoding.
"""
batch_size = tf.shape(start_tokens)[0]
finished = tf.fill([batch_size], False)
seq_length = tf.zeros([batch_size], dtype=tf.int32)
step = tf.constant(0)
decoded_ids = tf.zeros([batch_size, 0], dtype=tf.int32)
logits_list = tf.zeros([batch_size, 0, self._vocab_size],
dtype=tf.float32)
next_id = tf.expand_dims(start_tokens, 1)
cache = self._init_cache(memory, memory_attention_bias)
symbols_to_logits_fn = self._symbols_to_logits_fn(
embedding_fn,
max_length=decode_length+1
)
def _body(step, finished, next_id, decoded_ids, cache, logits_list,
seq_length):
logits, cache = symbols_to_logits_fn(next_id, step, cache)
if decoding_strategy == 'infer_greedy':
next_id = tf.argmax(logits, -1, output_type=tf.int32)
elif decoding_strategy == 'infer_sample':
sample_id_sampler = tf.distributions.Categorical(logits=logits)
next_id = sample_id_sampler.sample()
cur_finished = tf.equal(next_id, end_token)
update_len = tf.logical_and(
tf.logical_not(finished),
cur_finished)
seq_length = tf.where(
update_len,
tf.fill(tf.shape(seq_length), step+1),
seq_length)
next_id = tf.expand_dims(next_id, axis=1)
finished |= cur_finished
# Keep the shape as [batch_size, seq_len]
logits = tf.expand_dims(logits, axis=1)
logits_list = tf.concat([logits_list, logits], axis=1)
decoded_ids = tf.concat([decoded_ids, next_id], axis=1)
return step+1, finished, next_id, decoded_ids, cache, \
logits_list, seq_length
def _not_finished(i, finished, *_):
return (i < decode_length) & tf.logical_not(tf.reduce_all(finished))
_, _, _, decoded_ids, _, logits_list, seq_length = tf.while_loop(
_not_finished,
_body,
loop_vars=(step, finished, next_id, decoded_ids, cache, logits_list,
seq_length),
shape_invariants=(
tf.TensorShape([]),
tf.TensorShape([None]),
tf.TensorShape([None, None]),
tf.TensorShape([None, None]),
nest.map_structure(beam_search.get_state_shape_invariants,
cache),
tf.TensorShape([None, None, None]),
tf.TensorShape([None])
)
)
return logits_list, decoded_ids, seq_length
def _beam_decode(self,
embedding_fn,
start_tokens,
end_token,
memory,
memory_attention_bias,
decode_length=256,
beam_width=5,
alpha=0.6):
cache = self._init_cache(memory, memory_attention_bias)
symbols_to_logits_fn = self._symbols_to_logits_fn(
embedding_fn,
max_length=decode_length+1)
outputs, log_prob = beam_search.beam_search(
symbols_to_logits_fn,
start_tokens,
beam_width,
decode_length,
self._vocab_size,
alpha,
states=cache,
eos_id=end_token)
# Ignores
outputs = outputs[:, :, 1:]
# shape = [batch_size, seq_length, beam_width]
outputs = tf.transpose(outputs, [0, 2, 1])
return (outputs, log_prob)
================================================
FILE: texar_repo/texar/modules/decoders/transformer_decoders_test.py
================================================
#
"""
Unit tests for Transformer decodre.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
from texar.modules.decoders.transformer_decoders import TransformerDecoder
from texar.modules.decoders.transformer_decoders import TransformerDecoderOutput
# pylint: disable=too-many-instance-attributes
class TransformerDecoderTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.TransformerDecoder`
"""
def setUp(self):
tf.test.TestCase.setUp(self)
self._vocab_size = 15
self._batch_size = 6
self._max_time = 10
self._emb_dim = 512
self._max_decode_len = 32
self._inputs = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1, dtype=tf.float32)
self._memory = tf.random_uniform(
[self._batch_size, self._max_time, self._emb_dim],
maxval=1, dtype=tf.float32)
self._memory_sequence_length = tf.random_uniform(
[self._batch_size], maxval=self._max_time, dtype=tf.int32)
self._embedding = tf.random_uniform(
[self._vocab_size, self._emb_dim], maxval=1, dtype=tf.float32)
self._start_tokens = tf.fill([self._batch_size], 1)
self.max_decoding_length = self._max_time
def test_train(self):
"""Tests train_greedy
"""
decoder = TransformerDecoder(embedding=self._embedding)
# 6 blocks
# -self multihead_attention: 4 dense without bias + 2 layer norm vars
# -encdec multihead_attention: 4 dense without bias + 2 layer norm vars
# -poswise_network: Dense with bias, Dense with bias + 2 layer norm vars
# 2 layer norm vars
outputs = decoder(memory=self._memory,
memory_sequence_length=self._memory_sequence_length,
memory_attention_bias=None,
inputs=self._inputs,
decoding_strategy='train_greedy',
mode=tf.estimator.ModeKeys.TRAIN)
self.assertEqual(len(decoder.trainable_variables), 110)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertIsInstance(outputs_, TransformerDecoderOutput)
def test_infer_greedy(self):
"""Tests train_greedy
"""
decoder = TransformerDecoder(embedding=self._embedding)
outputs, length = decoder(
memory=self._memory,
memory_sequence_length=self._memory_sequence_length,
memory_attention_bias=None,
inputs=None,
decoding_strategy='infer_greedy',
beam_width=1,
start_tokens=self._start_tokens,
end_token=2,
max_decoding_length=self._max_decode_len,
mode=tf.estimator.ModeKeys.PREDICT)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertIsInstance(outputs_, TransformerDecoderOutput)
def test_infer_sample(self):
"""Tests infer_sample
"""
decoder = TransformerDecoder(embedding=self._embedding)
outputs, length = decoder(
memory=self._memory,
memory_sequence_length=self._memory_sequence_length,
memory_attention_bias=None,
inputs=None,
decoding_strategy='infer_sample',
beam_width=1,
start_tokens=self._start_tokens,
end_token=2,
max_decoding_length=self._max_decode_len,
mode=tf.estimator.ModeKeys.PREDICT)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertIsInstance(outputs_, TransformerDecoderOutput)
def test_beam_search(self):
"""Tests beam_search
"""
decoder = TransformerDecoder(embedding=self._embedding)
outputs = decoder(
memory=self._memory,
memory_sequence_length=self._memory_sequence_length,
memory_attention_bias=None,
inputs=None,
beam_width=5,
start_tokens=self._start_tokens,
end_token=2,
max_decoding_length=self._max_decode_len,
mode=tf.estimator.ModeKeys.PREDICT)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_ = sess.run(outputs)
self.assertEqual(outputs_['log_prob'].shape,
(self._batch_size, 5))
self.assertEqual(outputs_['sample_id'].shape,
(self._batch_size, self._max_decode_len, 5))
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/embedders/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library embedders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.embedders.embedder_base import *
from texar.modules.embedders.embedders import *
from texar.modules.embedders.position_embedders import *
================================================
FILE: texar_repo/texar/modules/embedders/embedder_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The base embedder class.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.module_base import ModuleBase
from texar.modules.embedders import embedder_utils
# pylint: disable=invalid-name
__all__ = [
"EmbedderBase"
]
class EmbedderBase(ModuleBase):
"""The base embedder class that all embedder classes inherit.
Args:
num_embeds (int, optional): The number of embedding elements, e.g.,
the vocabulary size of a word embedder.
hparams (dict or HParams, optional): Embedder hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
"""
def __init__(self, num_embeds=None, hparams=None):
ModuleBase.__init__(self, hparams)
self._num_embeds = num_embeds
# pylint: disable=attribute-defined-outside-init
def _init_parameterized_embedding(self, init_value, num_embeds, hparams):
self._embedding = embedder_utils.get_embedding(
hparams, init_value, num_embeds, self.variable_scope)
if hparams.trainable:
self._add_trainable_variable(self._embedding)
self._num_embeds = self._embedding.get_shape().as_list()[0]
self._dim = self._embedding.get_shape().as_list()[1:]
self._dim_rank = len(self._dim)
if self._dim_rank == 1:
self._dim = self._dim[0]
def _get_dropout_layer(self, hparams, ids_rank=None, dropout_input=None,
dropout_strategy=None):
"""Creates dropout layer according to dropout strategy.
Called in :meth:`_build()`.
"""
dropout_layer = None
st = dropout_strategy
st = hparams.dropout_strategy if st is None else st
if hparams.dropout_rate > 0.:
if st == 'element':
noise_shape = None
elif st == 'item':
noise_shape = tf.concat([tf.shape(dropout_input)[:ids_rank],
tf.ones([self._dim_rank], tf.int32)],
axis=0)
elif st == 'item_type':
noise_shape = [None] + [1] * self._dim_rank
else:
raise ValueError('Unknown dropout strategy: {}'.format(st))
dropout_layer = tf.layers.Dropout(
rate=hparams.dropout_rate, noise_shape=noise_shape)
return dropout_layer
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"name": "embedder"
}
"""
return {
"name": "embedder"
}
def _build(self, *args, **kwargs):
raise NotImplementedError
@property
def num_embeds(self):
"""The number of embedding elements.
"""
return self._num_embeds
================================================
FILE: texar_repo/texar/modules/embedders/embedder_utils.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utils of embedder.
"""
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import tensorflow as tf
from texar.hyperparams import HParams
from texar.core import layers
__all__ = [
"default_embedding_hparams",
"get_embedding",
"soft_embedding_lookup"
]
def default_embedding_hparams():
"""Returns a `dict` of hyperparameters and default values of a embedder.
See :meth:`~texar.modules.WordEmbedder.default_hparams` for details.
.. code-block:: python
{
"name": "embedding",
"dim": 100,
"initializer": None,
"regularizer": {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
},
"dropout_rate": 0.,
"dropout_strategy": 'element',
"trainable": True,
}
Here:
"name" : str
Name of the embedding variable.
"dim" : int or list
Embedding dimension. Can be a list of integers to yield embeddings
with dimensionality > 1.
"initializer" : dict or None
Hyperparameters of the initializer for the embedding values. An
example is as
.. code-block:: python
{
"type": "random_uniform_initializer",
"kwargs": {
"minval": -0.1,
"maxval": 0.1,
"seed": None
}
}
which corresponds to :tf_main:`tf.random_uniform_initializer
`, and includes:
"type" : str or initializer instance
Name, full path, or instance of the initializer class; Or name
or full path to a function that returns the initializer class.
The class or function can be
- Built-in initializer defined in \
:tf_main:`tf.initializers `, e.g., \
:tf_main:`random_uniform ` \
(a.k.a :class:`tf.random_uniform_initializer`), or \
in :mod:`tf`, e.g., :tf_main:`glorot_uniform_initializer \
`, or in \
:tf_main:`tf.keras.initializers `.
- User-defined initializer in :mod:`texar.custom`.
- External initializer. Must provide the full path, \
e.g., :attr:`"my_module.MyInitializer"`, or the instance.
"kwargs" : dict
A dictionary of arguments for constructor of the
initializer class or for the function. An initializer is
created by `initialzier = initializer_class_or_fn(**kwargs)`
where :attr:`initializer_class_or_fn` is specified in
:attr:`"type"`.
Ignored if :attr:`"type"` is an initializer instance.
"regularizer" : dict
Hyperparameters of the regularizer for the embedding values. The
regularizer must be an instance of
the base :tf_main:`Regularizer `
class. The hyperparameters include:
"type" : str or Regularizer instance
Name, full path, or instance of the regularizer class. The
class can be
- Built-in regularizer defined in
:tf_main:`tf.keras.regularizers `, e.g.,
:tf_main:`L1L2 `.
- User-defined regularizer in :mod:`texar.custom`. The
regularizer class should inherit the base class
:tf_main:`Regularizer `.
- External regularizer. Must provide the full path, \
e.g., :attr:`"my_module.MyRegularizer"`, or the instance.
"kwargs" : dict
A dictionary of arguments for constructor of the
regularizer class. A regularizer is created by
calling `regularizer_class(**kwargs)` where
:attr:`regularizer_class` is specified in :attr:`"type"`.
Ignored if :attr:`"type"` is a Regularizer instance.
The default value corresponds to
:tf_main:`L1L2 ` with `(l1=0, l2=0)`,
which disables regularization.
"dropout_rate" : float
The dropout rate between 0 and 1. E.g., `dropout_rate=0.1` would
drop out 10% of the embedding.
"dropout_strategy" : str
The dropout strategy. Can be one of the following
- 'element': The regular strategy that drops individual elements \
in the embedding vectors.
- 'item': Drops individual items (e.g., words) entirely. E.g., for \
the word sequence 'the simpler the better', the strategy can \
yield '_ simpler the better', where the first `the` is dropped.
- 'item_type': Drops item types (e.g., word types). E.g., for the \
above sequence, the strategy can yield '_ simpler _ better', \
where the word type 'the' is dropped. The dropout will never \
yield '_ simpler the better' as in the 'item' strategy.
"trainable" : bool
Whether the embedding is trainable.
"""
return {
"name": "embedding",
"dim": 100,
"initializer": None,
"regularizer": layers.default_regularizer_hparams(),
"dropout_rate": 0.,
"dropout_strategy": 'element',
"trainable": True,
"@no_typecheck": ["dim"]
}
def get_embedding(hparams=None,
init_value=None,
num_embeds=None,
variable_scope='Embedding'):
"""Creates embedding variable if not exists.
Args:
hparams (dict or HParams, optional): Embedding hyperparameters. Missing
hyperparameters are set to default values. See
:func:`~texar.modules.default_embedding_hparams`
for all hyperparameters and default values.
If :attr:`init_value` is given, :attr:`hparams["initializer"]`,
and :attr:`hparams["dim"]` are ignored.
init_value (Tensor or numpy array, optional): Initial values of the
embedding variable. If not given, embedding is initialized as
specified in :attr:`hparams["initializer"]`.
num_embeds (int, optional): The number of embedding items
(e.g., vocabulary size). Required if :attr:`init_value` is
not provided.
variable_scope (str or VariableScope, optional): Variable scope of
the embedding variable.
Returns:
Variable or Tensor: A 2D `Variable` or `Tensor` of the same shape with
:attr:`init_value` or of the shape
:attr:`[num_embeds, hparams["dim"]]`.
"""
with tf.variable_scope(variable_scope):
if hparams is None or isinstance(hparams, dict):
hparams = HParams(hparams, default_embedding_hparams())
regularizer = layers.get_regularizer(hparams["regularizer"])
if init_value is None:
initializer = layers.get_initializer(hparams["initializer"])
dim = hparams["dim"]
if not isinstance(hparams["dim"], (list, tuple)):
dim = [dim]
embedding = tf.get_variable(name='w',
shape=[num_embeds] + dim,
initializer=initializer,
regularizer=regularizer,
trainable=hparams["trainable"])
else:
embedding = tf.get_variable(name='w',
initializer=tf.to_float(init_value),
regularizer=regularizer,
trainable=hparams["trainable"])
return embedding
def soft_embedding_lookup(embedding, soft_ids):
"""Transforms soft ids (e.g., probability distribution over ids) into
embeddings, by mixing the embedding vectors with the soft weights.
Args:
embedding: A Tensor of shape `[num_classes] + embedding-dim` containing
the embedding vectors. Embedding can have dimensionality > 1, i.e.,
:attr:`embedding` can be of shape
`[num_classes, emb_dim_1, emb_dim_2, ...]`
soft_ids: A Tensor of weights (probabilities) used to mix the
embedding vectors.
Returns:
A Tensor of shape `shape(soft_ids)[:-1] + shape(embedding)[1:]`. For
example, if `shape(soft_ids) = [batch_size, max_time, vocab_size]`
and `shape(embedding) = [vocab_size, emb_dim]`, then the return tensor
has shape `[batch_size, max_time, emb_dim]`.
Example::
decoder_outputs, ... = decoder(...)
soft_seq_emb = soft_embedding_lookup(
embedding, tf.nn.softmax(decoder_outputs.logits))
"""
return tf.tensordot(tf.to_float(soft_ids), embedding, [-1, 0])
================================================
FILE: texar_repo/texar/modules/embedders/embedder_utils_test.py
================================================
#
"""
Unit tests for embedder utils.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=no-member
import tensorflow as tf
from texar.modules.embedders import embedder_utils
class GetEmbeddingTest(tf.test.TestCase):
"""Tests embedding creator.
"""
def test_get_embedding(self):
"""Tests :func:`~texar.modules.embedder.embedder_utils.get_embedding`.
"""
vocab_size = 100
emb = embedder_utils.get_embedding(num_embeds=vocab_size)
self.assertEqual(emb.shape[0].value, vocab_size)
self.assertEqual(emb.shape[1].value,
embedder_utils.default_embedding_hparams()["dim"])
hparams = {
"initializer": {
"type": tf.random_uniform_initializer(minval=-0.1, maxval=0.1)
},
"regularizer": {
"type": tf.keras.regularizers.L1L2(0.1, 0.1)
}
}
emb = embedder_utils.get_embedding(
hparams=hparams, num_embeds=vocab_size,
variable_scope='embedding_2')
self.assertEqual(emb.shape[0].value, vocab_size)
self.assertEqual(emb.shape[1].value,
embedder_utils.default_embedding_hparams()["dim"])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/embedders/embedders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various embedders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from texar.modules.embedders.embedder_base import EmbedderBase
from texar.modules.embedders import embedder_utils
from texar.utils.mode import is_train_mode
from texar.utils.shapes import get_rank
__all__ = [
"WordEmbedder"
]
class WordEmbedder(EmbedderBase):
"""Simple word embedder that maps indexes into embeddings. The indexes
can be soft (e.g., distributions over vocabulary).
Either :attr:`init_value` or :attr:`vocab_size` is required. If both are
given, there must be `init_value.shape[0]==vocab_size`.
Args:
init_value (optional): A `Tensor` or numpy array that contains the
initial value of embeddings. It is typically of shape
`[vocab_size] + embedding-dim`. Embedding can have dimensionality
> 1.
If `None`, embedding is initialized as specified in
:attr:`hparams["initializer"]`. Otherwise, the
:attr:`"initializer"` and :attr:`"dim"`
hyperparameters in :attr:`hparams` are ignored.
vocab_size (int, optional): The vocabulary size. Required if
:attr:`init_value` is not given.
hparams (dict, optional): Embedder hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
See :meth:`_build` for the inputs and outputs of the embedder.
Example:
.. code-block:: python
ids = tf.random_uniform(shape=[32, 10], maxval=10, dtype=tf.int64)
soft_ids = tf.random_uniform(shape=[32, 10, 100])
embedder = WordEmbedder(vocab_size=100, hparams={'dim': 256})
ids_emb = embedder(ids=ids) # shape: [32, 10, 256]
soft_ids_emb = embedder(soft_ids=soft_ids) # shape: [32, 10, 256]
.. code-block:: python
## Use with Texar data module
hparams={
'dataset': {
'embedding_init': {'file': 'word2vec.txt'}
...
},
}
data = MonoTextData(data_params)
iterator = DataIterator(data)
batch = iterator.get_next()
# Use data vocab size
embedder_1 = WordEmbedder(vocab_size=data.vocab.size)
emb_1 = embedder_1(batch['text_ids'])
# Use pre-trained embedding
embedder_2 = WordEmbedder(init_value=data.embedding_init_value)
emb_2 = embedder_2(batch['text_ids'])
.. document private functions
.. automethod:: _build
"""
def __init__(self, init_value=None, vocab_size=None, hparams=None):
EmbedderBase.__init__(self, hparams=hparams)
if init_value is None and vocab_size is None:
raise ValueError(
"Either `init_value` or `vocab_size` is required.")
self._init_parameterized_embedding(init_value, vocab_size,
self._hparams)
self._vocab_size = vocab_size
if vocab_size is None:
self._vocab_size = self._num_embeds
if self._vocab_size != self._num_embeds:
raise ValueError(
'vocab_size must equal to init_value.shape[0].'
'Got %d and %d' % (self._vocab_size, self._num_embeds))
self._built = True
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"dim": 100,
"dropout_rate": 0,
"dropout_strategy": 'element',
"trainable": True,
"initializer": {
"type": "random_uniform_initializer",
"kwargs": {
"minval": -0.1,
"maxval": 0.1,
"seed": None
}
},
"regularizer": {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
},
"name": "word_embedder",
}
Here:
"dim" : int or list
Embedding dimension. Can be a list of integers to yield embeddings
with dimensionality > 1.
Ignored if :attr:`init_value` is given to the embedder constructor.
"dropout_rate" : float
The dropout rate between 0 and 1. E.g., `dropout_rate=0.1` would
drop out 10% of the embedding. Set to 0 to disable dropout.
"dropout_strategy" : str
The dropout strategy. Can be one of the following
- :attr:`"element"`: The regular strategy that drops individual \
elements of embedding vectors.
- :attr:`"item"`: Drops individual items (e.g., words) entirely. \
E.g., for \
the word sequence 'the simpler the better', the strategy can \
yield '_ simpler the better', where the first `the` is dropped.
- :attr:`"item_type"`: Drops item types (e.g., word types). \
E.g., for the \
above sequence, the strategy can yield '_ simpler _ better', \
where the word type 'the' is dropped. The dropout will never \
yield '_ simpler the better' as in the 'item' strategy.
"trainable" : bool
Whether the embedding is trainable.
"initializer" : dict or None
Hyperparameters of the initializer for embedding values. See
:func:`~texar.core.get_initializer` for the details. Ignored if
:attr:`init_value` is given to the embedder constructor.
"regularizer" : dict
Hyperparameters of the regularizer for embedding values. See
:func:`~texar.core.get_regularizer` for the details.
"name" : str
Name of the embedding variable.
"""
hparams = embedder_utils.default_embedding_hparams()
hparams["name"] = "word_embedder"
return hparams
def _build(self, ids=None, soft_ids=None, mode=None, **kwargs):
"""Embeds (soft) ids.
Either :attr:`ids` or :attr:`soft_ids` must be given, and they
must not be given at the same time.
Args:
ids (optional): An integer tensor containing the ids to embed.
soft_ids (optional): A tensor of weights (probabilities) used to
mix the embedding vectors.
mode (optional): A tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. If `None`, dropout is
controlled by :func:`texar.global_mode`.
kwargs: Additional keyword arguments for
:tf_main:`tf.nn.embedding_lookup ` besides
:attr:`params` and :attr:`ids`.
Returns:
If :attr:`ids` is given, returns a Tensor of shape
`shape(ids) + embedding-dim`. For example,
if `shape(ids) = [batch_size, max_time]`
and `shape(embedding) = [vocab_size, emb_dim]`, then the return
tensor has shape `[batch_size, max_time, emb_dim]`.
If :attr:`soft_ids` is given, returns a Tensor of shape
`shape(soft_ids)[:-1] + embdding-dim`. For example,
if `shape(soft_ids) = [batch_size, max_time, vocab_size]`
and `shape(embedding) = [vocab_size, emb_dim]`, then the return
tensor has shape `[batch_size, max_time, emb_dim]`.
"""
if ids is not None:
if soft_ids is not None:
raise ValueError(
'Must not specify `ids` and `soft_ids` at the same time.')
ids_rank = get_rank(ids)
elif soft_ids is not None:
ids_rank = get_rank(soft_ids) - 1
else:
raise ValueError('Either `ids` or `soft_ids` must be given.')
embedding = self._embedding
is_training = is_train_mode(mode)
if self._hparams.dropout_strategy == 'item_type':
dropout_layer = self._get_dropout_layer(self._hparams)
if dropout_layer:
embedding = dropout_layer.apply(inputs=embedding,
training=is_training)
if ids is not None:
outputs = tf.nn.embedding_lookup(embedding, ids, **kwargs)
else:
outputs = embedder_utils.soft_embedding_lookup(embedding, soft_ids)
if self._hparams.dropout_strategy != 'item_type':
dropout_layer = self._get_dropout_layer(
self._hparams, ids_rank=ids_rank, dropout_input=outputs)
if dropout_layer:
outputs = dropout_layer.apply(
inputs=outputs, training=is_training)
return outputs
@property
def embedding(self):
"""The embedding tensor, of shape `[vocab_size] + dim`.
"""
return self._embedding
@property
def dim(self):
"""The embedding dimension.
"""
return self._dim
@property
def vocab_size(self):
"""The vocabulary size.
"""
return self._vocab_size
================================================
FILE: texar_repo/texar/modules/embedders/embedders_test.py
================================================
#
"""
Unit tests for embedders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
# pylint: disable=no-member
import numpy as np
import tensorflow as tf
from texar.modules.embedders.embedders import WordEmbedder
from texar.modules.embedders.position_embedders import PositionEmbedder
from texar.context import global_mode
class EmbedderTest(tf.test.TestCase):
"""Tests parameterized embedder.
"""
def _test_word_embedder(self, hparams):
"""Tests :class:`texar.modules.WordEmbedder`.
"""
embedder = WordEmbedder(
vocab_size=100, hparams=hparams)
inputs = tf.ones([64, 16], dtype=tf.int32)
outputs = embedder(inputs)
inputs_soft = tf.ones([64, 16, embedder.vocab_size], dtype=tf.float32)
outputs_soft = embedder(soft_ids=inputs_soft)
emb_dim = embedder.dim
if not isinstance(emb_dim, (list, tuple)):
emb_dim = [emb_dim]
hparams_dim = hparams["dim"]
if not isinstance(hparams["dim"], (list, tuple)):
hparams_dim = [hparams["dim"]]
self.assertEqual(outputs.shape, [64, 16] + emb_dim)
self.assertEqual(outputs_soft.shape, [64, 16] + emb_dim)
self.assertEqual(emb_dim, hparams_dim)
self.assertEqual(embedder.vocab_size, 100)
self.assertEqual(len(embedder.trainable_variables), 1)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, outputs_soft_ = sess.run(
[outputs, outputs_soft],
feed_dict={global_mode(): tf.estimator.ModeKeys.TRAIN})
self.assertEqual(outputs_.shape, (64, 16) + tuple(emb_dim))
self.assertEqual(outputs_soft_.shape, (64, 16) + tuple(emb_dim))
# Tests unknown input shapes
inputs = tf.placeholder(dtype=tf.int64, shape=[None, None])
outputs = embedder(inputs)
self.assertEqual(len(outputs.get_shape()), 2 + len(hparams_dim))
inputs_soft = tf.placeholder(dtype=tf.int64, shape=[None, None, None])
outputs_soft = embedder(soft_ids=inputs_soft)
self.assertEqual(len(outputs_soft.get_shape()), 2 + len(hparams_dim))
def _test_position_embedder(self, hparams):
"""Tests :class:`texar.modules.PositionEmbedder`.
"""
pos_size = 100
embedder = PositionEmbedder(
position_size=pos_size, hparams=hparams)
inputs = tf.ones([64, 16], dtype=tf.int32)
outputs = embedder(inputs)
emb_dim = embedder.dim
if not isinstance(emb_dim, (list, tuple)):
emb_dim = [emb_dim]
hparams_dim = hparams["dim"]
if not isinstance(hparams["dim"], (list, tuple)):
hparams_dim = [hparams["dim"]]
self.assertEqual(outputs.shape, [64, 16] + emb_dim)
self.assertEqual(emb_dim, hparams_dim)
self.assertEqual(embedder.position_size, 100)
self.assertEqual(len(embedder.trainable_variables), 1)
seq_length = tf.random_uniform([64], maxval=pos_size, dtype=tf.int32)
outputs = embedder(sequence_length=seq_length)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, max_seq_length = sess.run(
[outputs, tf.reduce_max(seq_length)],
feed_dict={global_mode(): tf.estimator.ModeKeys.TRAIN})
self.assertEqual(outputs_.shape,
(64, max_seq_length) + tuple(emb_dim))
def test_embedder(self):
"""Tests various embedders.
"""
# no dropout
hparams = {"dim": 1024, "dropout_rate": 0}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024], "dropout_rate": 0}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024, 10], "dropout_rate": 0}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
# dropout with default strategy
hparams = {"dim": 1024, "dropout_rate": 0.3}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024], "dropout_rate": 0.3}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024, 10], "dropout_rate": 0.3}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
# dropout with different strategies
hparams = {"dim": 1024, "dropout_rate": 0.3,
"dropout_strategy": "item"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024], "dropout_rate": 0.3,
"dropout_strategy": "item"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024, 10], "dropout_rate": 0.3,
"dropout_strategy": "item"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": 1024, "dropout_rate": 0.3,
"dropout_strategy": "item_type"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024], "dropout_rate": 0.3,
"dropout_strategy": "item_type"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
hparams = {"dim": [1024, 10], "dropout_rate": 0.3,
"dropout_strategy": "item_type"}
self._test_word_embedder(hparams)
self._test_position_embedder(hparams)
def test_embedder_multi_calls(self):
"""Tests embedders called by multiple times.
"""
hparams = {"dim": 1024, "dropout_rate": 0.3,
"dropout_strategy": "item"}
embedder = WordEmbedder(
vocab_size=100, hparams=hparams)
inputs = tf.ones([64, 16], dtype=tf.int32)
outputs = embedder(inputs)
emb_dim = embedder.dim
if not isinstance(emb_dim, (list, tuple)):
emb_dim = [emb_dim]
self.assertEqual(outputs.shape, [64, 16] + emb_dim)
# Call with inputs in a different shape
inputs = tf.ones([64, 10, 20], dtype=tf.int32)
outputs = embedder(inputs)
emb_dim = embedder.dim
if not isinstance(emb_dim, (list, tuple)):
emb_dim = [emb_dim]
self.assertEqual(outputs.shape, [64, 10, 20] + emb_dim)
def test_word_embedder_soft_ids(self):
"""Tests the correctness of using soft ids.
"""
init_value = np.expand_dims(np.arange(5), 1)
embedder = WordEmbedder(init_value=init_value)
ids = np.array([3])
soft_ids = np.array([[0, 0, 0, 1, 0]])
outputs = embedder(ids=ids)
soft_outputs = embedder(soft_ids=soft_ids)
with self.test_session() as sess:
sess.run(tf.global_variables_initializer())
outputs_, soft_outputs_ = sess.run([outputs, soft_outputs])
self.assertEqual(outputs_, soft_outputs_)
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/embedders/position_embedders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various position embedders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import tensorflow as tf
from texar.modules.embedders.embedder_base import EmbedderBase
from texar.modules.embedders import embedder_utils
from texar.utils.mode import is_train_mode
from texar.utils.shapes import mask_sequences
# pylint: disable=arguments-differ, invalid-name
__all__ = [
"PositionEmbedder",
"SinusoidsPositionEmbedder",
]
class PositionEmbedder(EmbedderBase):
"""Simple position embedder that maps position indexes into embeddings
via lookup.
Either :attr:`init_value` or :attr:`position_size` is required. If both are
given, there must be `init_value.shape[0]==position_size`.
Args:
init_value (optional): A `Tensor` or numpy array that contains the
initial value of embeddings. It is typically of shape
`[position_size, embedding dim]`
If `None`, embedding is initialized as specified in
:attr:`hparams["initializer"]`. Otherwise, the
:attr:`"initializer"` and :attr:`"dim"`
hyperparameters in :attr:`hparams` are ignored.
position_size (int, optional): The number of possible positions, e.g.,
the maximum sequence length. Required if :attr:`init_value` is
not given.
hparams (dict, optional): Embedder hyperparameters. If it is not
specified, the default hyperparameter setting is used. See
:attr:`default_hparams` for the sturcture and default values.
.. document private functions
.. automethod:: _build
"""
def __init__(self, init_value=None, position_size=None, hparams=None):
EmbedderBase.__init__(self, hparams=hparams)
if init_value is None and position_size is None:
raise ValueError(
"Either `init_value` or `position_size` is required.")
self._init_parameterized_embedding(init_value, position_size,
self._hparams)
self._position_size = position_size
if position_size is None:
self._position_size = self._num_embeds
if self._position_size != self._num_embeds:
raise ValueError(
'position_size must equal to init_value.shape[0].'
'Got %d and %d' % (self._position_size, self._num_embeds))
self._built = True
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. code-block:: python
{
"dim": 100,
"initializer": {
"type": "random_uniform_initializer",
"kwargs": {
"minval": -0.1,
"maxval": 0.1,
"seed": None
}
},
"regularizer": {
"type": "L1L2",
"kwargs": {
"l1": 0.,
"l2": 0.
}
},
"dropout_rate": 0,
"trainable": True,
"name": "position_embedder"
}
The hyperparameters have the same meaning as those in
:meth:`texar.modules.WordEmbedder.default_hparams`.
"""
hparams = embedder_utils.default_embedding_hparams()
hparams["name"] = "position_embedder"
return hparams
def _build(self, positions=None, sequence_length=None, mode=None, **kwargs):
"""Embeds the positions.
Either :attr:`position` or :attr:`sequence_length` is required:
- If both are given, :attr:`sequence_length` is used to mask out \
embeddings of those time steps beyond the respective sequence \
lengths.
- If only :attr:`sequence_length` is given, then positions \
from `0` to `sequence_length-1` are embedded.
Args:
positions (optional): An integer tensor containing the position
ids to embed.
sequence_length (optional): An integer tensor of shape
`[batch_size]`. Time steps beyond
the respective sequence lengths will have zero-valued
embeddings.
mode (optional): A tensor taking value in
:tf_main:`tf.estimator.ModeKeys `, including
`TRAIN`, `EVAL`, and `PREDICT`. If `None`, dropout will be
controlled by :func:`texar.global_mode`.
kwargs: Additional keyword arguments for
:tf_main:`tf.nn.embedding_lookup ` besides
:attr:`params` and :attr:`ids`.
Returns:
A `Tensor` of shape `shape(inputs) + embedding dimension`.
"""
# Gets embedder inputs
inputs = positions
if positions is None:
if sequence_length is None:
raise ValueError(
'Either `positions` or `sequence_length` is required.')
max_length = tf.reduce_max(sequence_length)
single_inputs = tf.range(start=0, limit=max_length, dtype=tf.int32)
# Expands `single_inputs` to have shape [batch_size, max_length]
expander = tf.expand_dims(tf.ones_like(sequence_length), -1)
inputs = expander * tf.expand_dims(single_inputs, 0)
ids_rank = len(inputs.shape.dims)
embedding = self._embedding
is_training = is_train_mode(mode)
# Gets dropout strategy
st = self._hparams.dropout_strategy
if positions is None and st == 'item':
# If `inputs` is based on `sequence_length`, then dropout
# strategies 'item' and 'item_type' have the same effect, we
# use 'item_type' to avoid unknown noise_shape in the 'item'
# strategy
st = 'item_type'
# Dropouts as 'item_type' before embedding
if st == 'item_type':
dropout_layer = self._get_dropout_layer(
self._hparams, dropout_strategy=st)
if dropout_layer:
embedding = dropout_layer.apply(inputs=embedding,
training=is_training)
# Embeds
outputs = tf.nn.embedding_lookup(embedding, inputs, **kwargs)
# Dropouts as 'item' or 'elements' after embedding
if st != 'item_type':
dropout_layer = self._get_dropout_layer(
self._hparams, ids_rank=ids_rank, dropout_input=outputs,
dropout_strategy=st)
if dropout_layer:
outputs = dropout_layer.apply(inputs=outputs,
training=is_training)
# Optionally masks
if sequence_length is not None:
outputs = mask_sequences(
outputs, sequence_length,
tensor_rank=len(inputs.shape.dims) + self._dim_rank)
return outputs
@property
def embedding(self):
"""The embedding tensor.
"""
return self._embedding
@property
def dim(self):
"""The embedding dimension.
"""
return self._dim
@property
def position_size(self):
"""The position size, i.e., maximum number of positions.
"""
return self._position_size
class SinusoidsPositionEmbedder(EmbedderBase):
"""Sinusoid position embedder that maps position indexes into embeddings
via sinusoid calculation. This module does not have trainable parameters.
Used in, e.g., :class:`~texar.modules.TransformerEncoder`.
Each channel of the input Tensor is incremented by a sinusoid of a
different frequency and phase.
This allows attention to learn to use absolute and relative positions.
Timing signals should be added to some precursors of both the query
and the memory inputs to attention.
The use of relative position is possible because sin(x+y) and
cos(x+y) can be experessed in terms of y, sin(x) and cos(x).
In particular, we use a geometric sequence of timescales starting with
min_timescale and ending with max_timescale. The number of different
timescales is equal to dim / 2. For each timescale, we
generate the two sinusoidal signals sin(timestep/timescale) and
cos(timestep/timescale). All of these sinusoids are concatenated in
the dim dimension.
.. document private functions
.. automethod:: _build
"""
def __init__(self, hparams=None):
EmbedderBase.__init__(self, hparams=hparams)
def default_hparams(self):
"""Returns a dictionary of hyperparameters with default values
We use a geometric sequence of timescales starting with
min_timescale and ending with max_timescale. The number of different
timescales is equal to dim/2.
.. code-block:: python
{
'min_timescale': 1.0,
'max_timescale': 10000.0,
'dim': 512,
'name':'sinusoid_posisiton_embedder',
}
"""
hparams = {
'min_timescale': 1.0,
'max_timescale': 1.0e4,
'dim': 512,
'name':'sinusoid_posisiton_embedder',
}
return hparams
def _build(self, positions):
"""Embeds.
Args:
positions (optional): An integer tensor containing the position
ids to embed.
Returns:
A `Tensor` of shape `[1, position_size, dim]`.
"""
dim = self._hparams.dim
position = tf.to_float(tf.squeeze(positions, axis=0))
position_size = tf.shape(position)[0]
num_timescales = dim // 2
min_timescale = self._hparams.min_timescale
max_timescale = self._hparams.max_timescale
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(tf.to_float(num_timescales) - 1))
inv_timescales = min_timescale * tf.exp(
tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
scaled_time = tf.expand_dims(position, 1) \
* tf.expand_dims(inv_timescales, 0)
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
signal = tf.pad(signal, [[0, 0], [0, tf.mod(dim, 2)]])
signal = tf.reshape(signal, [1, position_size, dim])
return signal
================================================
FILE: texar_repo/texar/modules/encoders/__init__.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Modules of texar library encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=wildcard-import
from texar.modules.encoders.encoder_base import *
from texar.modules.encoders.rnn_encoders import *
from texar.modules.encoders.hierarchical_encoders import *
from texar.modules.encoders.transformer_encoders import *
from texar.modules.encoders.multihead_attention import *
from texar.modules.encoders.conv_encoders import *
================================================
FILE: texar_repo/texar/modules/encoders/conv_encoders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various convolutional network encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.modules.encoders.encoder_base import EncoderBase
from texar.modules.networks.conv_networks import Conv1DNetwork
__all__ = [
"Conv1DEncoder"
]
class Conv1DEncoder(Conv1DNetwork, EncoderBase):
"""Simple Conv-1D encoder which consists of a sequence of conv layers
followed with a sequence of dense layers.
Wraps :class:`~texar.modules.Conv1DNetwork` to be a subclass of
:class:`~texar.modules.EncoderBase`. Has exact the same functionality
with :class:`~texar.modules.Conv1DNetwork`.
"""
def __init__(self, hparams=None): # pylint: disable=super-init-not-called
Conv1DNetwork.__init__(self, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
The same as :meth:`~texar.modules.Conv1DNetwork.default_hparams`
of :class:`~texar.modules.Conv1DNetwork`, except that the default name
is 'conv_encoder'.
"""
hparams = Conv1DNetwork.default_hparams()
hparams['name'] = 'conv_encoder'
return hparams
================================================
FILE: texar_repo/texar/modules/encoders/conv_encoders_test.py
================================================
#
"""
Unit tests for conv encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import tensorflow as tf
import texar as tx
from texar.modules.encoders.conv_encoders import Conv1DEncoder
class Conv1DEncoderTest(tf.test.TestCase):
"""Tests :class:`~texar.modules.Conv1DEncoder` class.
"""
def test_encode(self):
"""Tests encode.
"""
encoder_1 = Conv1DEncoder()
self.assertEqual(len(encoder_1.layers), 4)
self.assertTrue(isinstance(encoder_1.layer_by_name("conv_pool_1"),
tx.core.MergeLayer))
for layer in encoder_1.layers[0].layers:
self.assertTrue(isinstance(layer, tx.core.SequentialLayer))
inputs_1 = tf.ones([64, 16, 300], tf.float32)
outputs_1 = encoder_1(inputs_1)
self.assertEqual(outputs_1.shape, [64, 128])
hparams = {
# Conv layers
"num_conv_layers": 2,
"filters": 128,
"kernel_size": [[3, 4, 5], 4],
"other_conv_kwargs": {"padding": "same"},
# Pooling layers
"pooling": "AveragePooling",
"pool_size": 2,
"pool_strides": 1,
# Dense layers
"num_dense_layers": 3,
"dense_size": [128, 128, 10],
"dense_activation": "relu",
"other_dense_kwargs": {"use_bias": False},
# Dropout
"dropout_conv": [0, 1, 2],
"dropout_dense": 2
}
encoder_2 = Conv1DEncoder(hparams)
# nlayers = nconv-pool + nconv + npool + ndense + ndropout + flatten
self.assertEqual(len(encoder_2.layers), 1+1+1+3+4+1)
self.assertTrue(isinstance(encoder_2.layer_by_name("conv_pool_1"),
tx.core.MergeLayer))
for layer in encoder_2.layers[1].layers:
self.assertTrue(isinstance(layer, tx.core.SequentialLayer))
inputs_2 = tf.ones([64, 16, 300], tf.float32)
outputs_2 = encoder_2(inputs_2)
self.assertEqual(outputs_2.shape, [64, 10])
def test_unknown_seq_length(self):
"""Tests use of pooling layer when the seq_length dimension of inputs
is `None`.
"""
encoder_1 = Conv1DEncoder()
inputs_1 = tf.placeholder(tf.float32, [64, None, 300])
outputs_1 = encoder_1(inputs_1)
self.assertEqual(outputs_1.shape, [64, 128])
hparams = {
# Conv layers
"num_conv_layers": 2,
"filters": 128,
"kernel_size": [[3, 4, 5], 4],
# Pooling layers
"pooling": "AveragePooling",
"pool_size": [2, None],
# Dense layers
"num_dense_layers": 1,
"dense_size": 10,
}
encoder = Conv1DEncoder(hparams)
# nlayers = nconv-pool + nconv + npool + ndense + ndropout + flatten
self.assertEqual(len(encoder.layers), 1+1+1+1+1+1)
self.assertTrue(isinstance(encoder.layer_by_name('pool_2'),
tx.core.AverageReducePooling1D))
inputs = tf.placeholder(tf.float32, [64, None, 300])
outputs = encoder(inputs)
self.assertEqual(outputs.shape, [64, 10])
hparams_2 = {
# Conv layers
"num_conv_layers": 1,
"filters": 128,
"kernel_size": 4,
"other_conv_kwargs": {'data_format': 'channels_first'},
# Pooling layers
"pooling": "MaxPooling",
"other_pool_kwargs": {'data_format': 'channels_first'},
# Dense layers
"num_dense_layers": 1,
"dense_size": 10,
}
encoder_2 = Conv1DEncoder(hparams_2)
inputs_2 = tf.placeholder(tf.float32, [64, 300, None])
outputs_2 = encoder_2(inputs_2)
self.assertEqual(outputs_2.shape, [64, 10])
if __name__ == "__main__":
tf.test.main()
================================================
FILE: texar_repo/texar/modules/encoders/encoder_base.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base class for encoders.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from texar.module_base import ModuleBase
__all__ = [
"EncoderBase"
]
class EncoderBase(ModuleBase):
"""Base class inherited by all encoder classes.
"""
def __init__(self, hparams=None):
ModuleBase.__init__(self, hparams)
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
"""
return {
"name": "encoder"
}
def _build(self, inputs, *args, **kwargs):
"""Encodes the inputs.
Args:
inputs: Inputs to the encoder.
*args: Other arguments.
**kwargs: Keyword arguments.
Returns:
Encoding results.
"""
raise NotImplementedError
================================================
FILE: texar_repo/texar/modules/encoders/hierarchical_encoders.py
================================================
# Copyright 2018 The Texar Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Various encoders that encode data with hierarchical structure.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import tensorflow as tf
from tensorflow.contrib.rnn import LSTMStateTuple
from tensorflow.python.util import nest # pylint: disable=E0611
from texar.modules.encoders.encoder_base import EncoderBase
from texar.utils import utils
# pylint: disable=invalid-name, too-many-arguments, too-many-locals
__all__ = [
"HierarchicalRNNEncoder"
]
class HierarchicalRNNEncoder(EncoderBase):
"""A hierarchical encoder that stacks basic RNN encoders into two layers.
Can be used to encode long, structured sequences, e.g. paragraphs, dialog
history, etc.
Args:
encoder_major (optional): An instance of subclass of
:class:`~texar.modules.RNNEncoderBase`
The high-level encoder taking final
states from low-level encoder as its
inputs. If not specified, an encoder
is created as specified in
:attr:`hparams["encoder_major"]`.
encoder_minor (optional): An instance of subclass of
:class:`~texar.modules.RNNEncoderBase`
The low-level encoder. If not
specified, an encoder is created as specified
in :attr:`hparams["encoder_minor"]`.
hparams (dict or HParams, optional): Hyperparameters. Missing
hyperparamerter will be set to default values. See
:meth:`default_hparams` for the hyperparameter sturcture and
default values.
See :meth:`_build` for the inputs and outputs of the encoder.
.. document private functions
.. automethod:: _build
"""
def __init__(self, encoder_major=None, encoder_minor=None,
hparams=None):
EncoderBase.__init__(self, hparams)
encoder_major_hparams = utils.get_instance_kwargs(
None, self._hparams.encoder_major_hparams)
encoder_minor_hparams = utils.get_instance_kwargs(
None, self._hparams.encoder_minor_hparams)
if encoder_major is not None:
self._encoder_major = encoder_major
else:
with tf.variable_scope(self.variable_scope.name):
with tf.variable_scope('encoder_major'):
self._encoder_major = utils.check_or_get_instance(
self._hparams.encoder_major_type,
encoder_major_hparams,
['texar.modules.encoders', 'texar.custom'])
if encoder_minor is not None:
self._encoder_minor = encoder_minor
elif self._hparams.config_share:
with tf.variable_scope(self.variable_scope.name):
with tf.variable_scope('encoder_minor'):
self._encoder_minor = utils.check_or_get_instance(
self._hparams.encoder_major_type,
encoder_major_hparams,
['texar.modules.encoders', 'texar.custom'])
else:
with tf.variable_scope(self.variable_scope.name):
with tf.variable_scope('encoder_minor'):
self._encoder_minor = utils.check_or_get_instance(
self._hparams.encoder_minor_type,
encoder_minor_hparams,
['texar.modules.encoders', 'texar.custom'])
@staticmethod
def default_hparams():
"""Returns a dictionary of hyperparameters with default values.
.. role:: python(code)
:language: python
.. code-block:: python
{
"encoder_major_type": "UnidirectionalRNNEncoder",
"encoder_major_hparams": {},
"encoder_minor_type": "UnidirectionalRNNEncoder",
"encoder_minor_hparams": {},
"config_share": False,
"name": "hierarchical_encoder_wrapper"
}
Here:
"encoder_major_type" : str or class or instance
The high-level encoder. Can be a RNN encoder class, its name or
module path, or a class instance.
Ignored if `encoder_major` is given to the encoder constructor.
"encoder_major_hparams" : dict
The hyperparameters for the high-level encoder. The high-level
encoder is created with
:python:`encoder_class(hparams=encoder_major_hparams)`.
Ignored if `encoder_major` is given to the encoder constructor,
or if "encoder_major_type" is an encoder instance.
"encoder_minor_type" : str or class or instance
The low-level encoder. Can be a RNN encoder class, its name or
module path, or a class instance.
Ignored if `encoder_minor` is given to the encoder constructor,
or if "config_share" is True.
"encoder_minor_hparams" : dict
The hyperparameters for the low-level encoder. The high-level
encoder is created with
:python:`encoder_class(hparams=encoder_minor_hparams)`.
Ignored if `encoder_minor` is given to the encoder constructor,
or if "config_share" is True,
or if "encoder_minor_type" is an encoder instance.
"config_share":
Whether to use encoder_major's hyperparameters
to construct encoder_minor.
"name":
Name of the encoder.
"""
hparams = {
"name": "hierarchical_encoder",
"encoder_major_type": "UnidirectionalRNNEncoder",
"encoder_major_hparams": {},
"encoder_minor_type": "UnidirectionalRNNEncoder",
"encoder_minor_hparams": {},
"config_share": False,
"@no_typecheck": [
'encoder_major_hparams',
'encoder_minor_hparams'
]
}
hparams.update(EncoderBase.default_hparams())
return hparams
def _build(self,
inputs,
order='btu',
medium=None,
sequence_length_major=None,
sequence_length_minor=None,
**kwargs):
"""Encodes the inputs.
Args:
inputs: A 4-D tensor of shape `[B, T, U, dim]`, where
- B: batch_size
- T: the max length of high-level sequences. E.g., the max \
number of utterances in dialog history.
- U: the max length of low-level sequences. E.g., the max \
length of each utterance in dialog history.
- dim: embedding dimension
The order of first three dimensions can be changed
according to :attr:`order`.
order: A 3-char string containing 'b', 't', and 'u',
that specifies the order of inputs dimensions above.
Following four can be accepted:
- **'btu'**: None of the encoders are time-major.
- **'utb'**: Both encoders are time-major.
- **'tbu'**: The major encoder is time-major.
- **'ubt'**: The minor encoder is time-major.
medium (optional): A list of callables that subsequently process the
final states of minor encoder and obtain the inputs
for the major encoder.
If not specified, :meth:`flatten` is used for processing
the minor's final states.
sequence_length_major (optional): The `sequence_length` argument
sent to major encoder. This is a 1-D Tensor of shape
`[B]`.
sequence_length_minor (optional): The `sequence_length` argument
sent to minor encoder. It can be either a 1-D Tensor of shape
`[B*T]`, or a 2-D Tensor of shape `[B, T]` or `[T, B]`
according to :attr:`order`.
**kwargs: Other keyword arguments for the major and minor encoders,
such as `initial_state`, etc.
Note that `sequence_length`, and `time_major`
must not be included here.
`time_major` is derived from :attr:`order` automatically.
By default, arguments will be sent to both major and minor
encoders. To specify which encoder an argument should be sent
to, add '_minor'/'_major' as its suffix.
Note that `initial_state_minor` must have a batch dimension
of size `B*T`. If you have an initial state of batch dimension
= `T`, use :meth:`tile_initial_state_minor` to tile it
according to `order`.
Returns:
A tuple `(outputs, final_state)` by the major encoder.
See
the return values of `_build()` method of respective encoder class
for details.
"""
def _kwargs_split(kwargs):
kwargs_minor, kwargs_major = {}, {}
for k, v in kwargs.items():
if len(k) >= 6 and k[-6:] == ['_minor']:
kwargs_minor[k[:-6]] = v
if len(k) >= 6 and k[-6:] == ['_major']:
kwargs_major[k[:-6]] = v
return kwargs_minor, kwargs_major
kwargs_minor, kwargs_major = _kwargs_split(kwargs)
if sequence_length_minor is not None:
sequence_length_minor = tf.reshape(sequence_length_minor, [-1])
kwargs_minor['sequence_length'] = sequence_length_minor
kwargs_major['sequence_length'] = sequence_length_major
expand, shape = self._get_flatten_order(
order, kwargs_minor, kwargs_major, tf.shape(inputs))
inputs = tf.reshape(inputs, shape + [inputs.shape[3]])
_, states_minor = self._encoder_minor(inputs, **kwargs_minor)
self.states_minor_before_medium = states_minor
if medium is None:
states_minor = self.flatten(states_minor)
else:
if not isinstance(medium, collections.Sequence):
medium = [medium]
for fn in medium:
if isinstance(fn, str) and fn == 'flatten':
states_minor = self.flatten(states_minor)
else:
states_minor = fn(states_minor)
self.states_minor_after_medium = states_minor
states_minor = tf.reshape(
states_minor, tf.concat([expand, tf.shape(states_minor)[1:]], 0))
outputs_major, states_major = self._encoder_major(states_minor,
**kwargs_major)
# Add trainable variables of `self._cell` which may be constructed
# externally
if not self._built:
self._add_trainable_variable(
self._encoder_minor.trainable_variables)
self._add_trainable_variable(
self._encoder_major.trainable_variables)
self._built = True
return outputs_major, states_major
@staticmethod
def tile_initial_state_minor(initial_state, order, inputs_shape):
"""Tiles an initial state to be used for encoder minor.
The batch dimension of :attr:`initial_state` must equal `T`. The
state will be copied for `B` times and used to start encoding each
low-level sequence. For example, the first utterance in each dialog
history in the batch will have the same initial state.
Args:
initial_state: Initial state with the batch dimension of size `T`.
order (str): The dimension order of inputs. Must be the same as
used in :meth:`_build`.
inputs_shape: Shape of `inputs` for :meth:`_build`. Can usually
be Obtained with `tf.shape(inputs)`.
Returns:
A tiled initial state with batch dimension of size `B*T`
"""
def _nest_tile(t, multiplier):
return nest.map_structure(lambda x: tf.tile(x, multiplier), t)
if order == 'btu':
return _nest_tile(initial_state, inputs_shape[0])
elif order == 'ubt':
return _nest_tile(initial_state, inputs_shape[1])
elif order == 'utb':
return tf.contrib.seq2seq.tile_batch(initial_state, inputs_shape[2])
elif order == 'tbu':
return tf.contrib.seq2seq.tile_batch(initial_state, inputs_shape[1])
else:
raise ValueError('Unknown order: {}'.format(order))
@staticmethod
def _get_flatten_order(order, kwargs_minor, kwargs_major, shape):
if order == 'btu':
kwargs_minor.setdefault('time_major', False)
kwargs_major.setdefault('time_major', False)
expand = shape[0:2]
shape = [shape[0] * shape[1], shape[2]]
elif order == 'utb':
kwargs_minor.setdefault('time_major', True)
kwargs_major.setdefault('time_major', True)
expand = shape[1:3]
shape = [shape[0], shape[1] * shape[2]]
elif order == 'tbu':
kwargs_minor.setdefault('time_major', False)
kwargs_major.setdefault('time_major', True)
expand = shape[0:2]
shape = [shape[0] * shape[1], shape[2]]
elif order == 'ubt':
kwargs_minor.setdefault('time_major', True)
kwargs_major.setdefault('time_major', False)
expand = shape[1:3]
shape = [shape[0], shape[1] * shape[2]]
else:
raise ValueError('Unknown order: {}'.format(order))
return expand, shape
@staticmethod
def flatten(x):
"""Flattens a cell state by concatenating a sequence of cell
states along the last dimension. If the cell states are
:tf_main:`LSTMStateTuple