Repository: rohithreddy024/Text-Summarizer-Pytorch
Branch: master
Commit: 5f1ae3d1926b
Files: 13
Total size: 80.8 KB

Directory structure:
gitextract_dlevmxnx/

├── README.md
├── beam_search.py
├── data/
│   ├── saved_models/
│   │   └── .gitignore
│   └── unfinished/
│       └── .gitignore
├── data_util/
│   ├── batcher.py
│   ├── config.py
│   └── data.py
├── eval.py
├── make_data_files.py
├── model.py
├── train.py
├── train_util.py
└── training_log.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Text-Summarizer-Pytorch
Combining [A Deep Reinforced Model for Abstractive Summarization](https://arxiv.org/pdf/1705.04304.pdf) and [Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/pdf/1704.04368.pdf)

## Model Description
* LSTM based Sequence-to-Sequence model for Abstractive Summarization
* Pointer mechanism for handling Out of Vocabulary (OOV) words [See et al. (2017)](https://arxiv.org/pdf/1704.04368.pdf)
* Intra-temporal and Intra-decoder attention for handling repeated words [Paulus et al. (2018)](https://arxiv.org/pdf/1705.04304.pdf)
* Self-critic policy gradient training along with MLE training [Paulus et al. (2018)](https://arxiv.org/pdf/1705.04304.pdf)

## Prerequisites
* Pytorch
* Tensorflow
* Python 2 & 3
* [rouge](https://github.com/pltrdy/rouge) 

## Data
* Download train and valid pairs (article, title) of OpenNMT provided Gigaword dataset from [here](https://github.com/harvardnlp/sent-summary)
* Copy files ```train.article.txt```, ```train.title.txt```, ```valid.article.filter.txt```and ```valid.title.filter.txt``` to ```data/unfinished``` folder
* Files are already preprcessed

## Creating ```.bin``` files and vocab file
* The model accepts data in the form of ```.bin``` files.
* To convert ```.txt``` file into ```.bin``` file and chunk them further, run (requires Python 2 & Tensorflow):
```
python make_data_files.py
```
* You will find the data in ```data/chunked``` folder and vocab file in ```data``` folder

## Training
* As suggested in [Paulus et al. (2018)](https://arxiv.org/pdf/1705.04304.pdf), first pretrain the seq-to-seq model using MLE (with Python 3):
```
python train.py --train_mle=yes --train_rl=no --mle_weight=1.0
```
* Next, find the best saved model on validation data by running (with Python 3):
```
python eval.py --task=validate --start_from=0005000.tar
```
* After finding the best model (lets say ```0100000.tar```) with high rouge-l f score, load it and run (with Python 3):
```
python train.py --train_mle=yes --train_rl=yes --mle_weight=0.25 --load_model=0100000.tar --new_lr=0.0001
```
for MLE + RL training (or)
```
python train.py --train_mle=no --train_rl=yes --mle_weight=0.0 --load_model=0100000.tar --new_lr=0.0001
```
for RL training

## Validation
* To perform validation of RL training, run (with Python 3):
```
python eval.py --task=validate --start_from=0100000.tar
```
## Testing
* After finding the best model of RL training (lets say ```0200000.tar```), evaluate it on test data & get all rouge metrics by running (with Python 3):
```
python eval.py --task=test --load_model=0200000.tar
```

## Results
* Rouge scores obtained by using best MLE trained model on test set:  
scores: {  
```'rouge-1':``` {'f': 0.4412018559893622, 'p': 0.4814799494024485, 'r': 0.4232331027817015},  
```'rouge-2':``` {'f': 0.23238981595683728, 'p': 0.2531296070596062, 'r': 0.22407861554997008},  
```'rouge-l':``` {'f': 0.40477682528278364, 'p': 0.4584684491434479, 'r': 0.40351107200202596}  
}

* Rouge scores obtained by using best MLE + RL trained model on test set:  
scores: {  
```'rouge-1':``` {'f': 0.4499047033247696, 'p': 0.4853756369556345, 'r': 0.43544461386607497},  
```'rouge-2':``` {'f': 0.24037014314625643, 'p': 0.25903387205387235, 'r': 0.23362662645146298},  
```'rouge-l':``` {'f': 0.41320241732946406, 'p': 0.4616655167980162, 'r': 0.4144419466382236}  
}

* Training log file is included in the repository

# Examples
```article:``` russia 's lower house of parliament was scheduled friday to debate an appeal to the prime minister that challenged the right of u.s.-funded radio liberty to operate in russia following its introduction of broadcasts targeting chechnya .  
```ref:``` russia 's lower house of parliament mulls challenge to radio liberty  
```dec:``` russian parliament to debate on banning radio liberty  

```article:``` continued dialogue with the democratic people 's republic of korea is important although australia 's plan to open its embassy in pyongyang has been shelved because of the crisis over the dprk 's nuclear weapons program , australian foreign minister alexander downer said on friday .  
```ref:``` dialogue with dprk important says australian foreign minister  
```dec:``` australian fm says dialogue with dprk important  

```article:``` water levels in the zambezi river are rising due to heavy rains in its catchment area , prompting zimbabwe 's civil protection unit -lrb- cpu -rrb- to issue a flood alert for people living in the zambezi valley , the herald reported on friday .  
```ref:``` floods loom in zambezi valley  
```dec:``` water levels rising in zambezi river  

```article:``` tens of thousands of people have fled samarra , about ## miles north of baghdad , in recent weeks , expecting a showdown between u.s. troops and heavily armed groups within the city , according to u.s. and iraqi sources .  
```ref:``` thousands flee samarra fearing battle  
```dec:``` tens of thousands flee samarra expecting showdown with u.s. troops  

```article:``` the #### tung blossom festival will kick off saturday with a fun-filled ceremony at the west lake resort in the northern taiwan county of miaoli , a hakka stronghold , the council of hakka affairs -lrb- cha -rrb- announced tuesday .  
```ref:``` #### tung blossom festival to kick off saturday  
```dec:``` #### tung blossom festival to kick off in miaoli  

## References
* [pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"](https://github.com/atulkum/pointer_summarizer)


================================================
FILE: beam_search.py
================================================
import numpy as np
import torch as T
from data_util import config, data
from train_util import get_cuda


class Beam(object):
    def __init__(self, start_id, end_id, unk_id, hidden_state, context):
        h,c = hidden_state                                              #(n_hid,)
        self.tokens = T.LongTensor(config.beam_size,1).fill_(start_id)  #(beam, t) after t time steps
        self.scores = T.FloatTensor(config.beam_size,1).fill_(-30)      #beam,1; Initial score of beams = -30
        self.tokens, self.scores = get_cuda(self.tokens), get_cuda(self.scores)
        self.scores[0][0] = 0                                           #At time step t=0, all beams should extend from a single beam. So, I am giving high initial score to 1st beam
        self.hid_h = h.unsqueeze(0).repeat(config.beam_size, 1)         #beam, n_hid
        self.hid_c = c.unsqueeze(0).repeat(config.beam_size, 1)         #beam, n_hid
        self.context = context.unsqueeze(0).repeat(config.beam_size, 1) #beam, 2*n_hid
        self.sum_temporal_srcs = None
        self.prev_s = None
        self.done = False
        self.end_id = end_id
        self.unk_id = unk_id

    def get_current_state(self):
        tokens = self.tokens[:,-1].clone()
        for i in range(len(tokens)):
            if tokens[i].item() >= config.vocab_size:
                tokens[i] = self.unk_id
        return tokens


    def advance(self, prob_dist, hidden_state, context, sum_temporal_srcs, prev_s):
        '''Perform beam search: Considering the probabilites of given n_beam x n_extended_vocab words, select first n_beam words that give high total scores
        :param prob_dist: (beam, n_extended_vocab)
        :param hidden_state: Tuple of (beam, n_hid) tensors
        :param context:   (beam, 2*n_hidden)
        :param sum_temporal_srcs:   (beam, n_seq)
        :param prev_s:  (beam, t, n_hid)
        '''
        n_extended_vocab = prob_dist.size(1)
        h, c = hidden_state
        log_probs = T.log(prob_dist+config.eps)                         #beam, n_extended_vocab

        scores = log_probs + self.scores                                #beam, n_extended_vocab
        scores = scores.view(-1,1)                                      #beam*n_extended_vocab, 1
        best_scores, best_scores_id = T.topk(input=scores, k=config.beam_size, dim=0)   #will be sorted in descending order of scores
        self.scores = best_scores                                       #(beam,1); sorted
        beams_order = best_scores_id.squeeze(1)/n_extended_vocab        #(beam,); sorted
        best_words = best_scores_id%n_extended_vocab                    #(beam,1); sorted
        self.hid_h = h[beams_order]                                     #(beam, n_hid); sorted
        self.hid_c = c[beams_order]                                     #(beam, n_hid); sorted
        self.context = context[beams_order]
        if sum_temporal_srcs is not None:
            self.sum_temporal_srcs = sum_temporal_srcs[beams_order]     #(beam, n_seq); sorted
        if prev_s is not None:
            self.prev_s = prev_s[beams_order]                           #(beam, t, n_hid); sorted
        self.tokens = self.tokens[beams_order]                          #(beam, t); sorted
        self.tokens = T.cat([self.tokens, best_words], dim=1)           #(beam, t+1); sorted

        #End condition is when top-of-beam is EOS.
        if best_words[0][0] == self.end_id:
            self.done = True

    def get_best(self):
        best_token = self.tokens[0].cpu().numpy().tolist()              #Since beams are always in sorted (descending) order, 1st beam is the best beam
        try:
            end_idx = best_token.index(self.end_id)
        except ValueError:
            end_idx = len(best_token)
        best_token = best_token[1:end_idx]
        return best_token

    def get_all(self):
        all_tokens = []
        for i in range(len(self.tokens)):
            all_tokens.append(self.tokens[i].cpu().numpy())
        return all_tokens


def beam_search(enc_hid, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, model, start_id, end_id, unk_id):

    batch_size = len(enc_hid[0])
    beam_idx = T.LongTensor(list(range(batch_size)))
    beams = [Beam(start_id, end_id, unk_id, (enc_hid[0][i], enc_hid[1][i]), ct_e[i]) for i in range(batch_size)]   #For each example in batch, create Beam object
    n_rem = batch_size                                                  #Index of beams that are active, i.e: didn't generate [STOP] yet
    sum_temporal_srcs = None                                            #Number of examples in batch that didn't generate [STOP] yet
    prev_s = None

    for t in range(config.max_dec_steps):
        x_t = T.stack(
            [beam.get_current_state() for beam in beams if beam.done == False]      #remaining(rem),beam
        ).contiguous().view(-1)                                                     #(rem*beam,)
        x_t = model.embeds(x_t)                                                 #rem*beam, n_emb

        dec_h = T.stack(
            [beam.hid_h for beam in beams if beam.done == False]                    #rem*beam,n_hid
        ).contiguous().view(-1,config.hidden_dim)
        dec_c = T.stack(
            [beam.hid_c for beam in beams if beam.done == False]                    #rem,beam,n_hid
        ).contiguous().view(-1,config.hidden_dim)                                   #rem*beam,n_hid

        ct_e = T.stack(
            [beam.context for beam in beams if beam.done == False]                  #rem,beam,n_hid
        ).contiguous().view(-1,2*config.hidden_dim)                                 #rem,beam,n_hid

        if sum_temporal_srcs is not None:
            sum_temporal_srcs = T.stack(
                [beam.sum_temporal_srcs for beam in beams if beam.done == False]
            ).contiguous().view(-1, enc_out.size(1))                                #rem*beam, n_seq

        if prev_s is not None:
            prev_s = T.stack(
                [beam.prev_s for beam in beams if beam.done == False]
            ).contiguous().view(-1, t, config.hidden_dim)                           #rem*beam, t-1, n_hid


        s_t = (dec_h, dec_c)
        enc_out_beam = enc_out[beam_idx].view(n_rem,-1).repeat(1, config.beam_size).view(-1, enc_out.size(1), enc_out.size(2))
        enc_pad_mask_beam = enc_padding_mask[beam_idx].repeat(1, config.beam_size).view(-1, enc_padding_mask.size(1))

        extra_zeros_beam = None
        if extra_zeros is not None:
            extra_zeros_beam = extra_zeros[beam_idx].repeat(1, config.beam_size).view(-1, extra_zeros.size(1))
        enc_extend_vocab_beam = enc_batch_extend_vocab[beam_idx].repeat(1, config.beam_size).view(-1, enc_batch_extend_vocab.size(1))

        final_dist, (dec_h, dec_c), ct_e, sum_temporal_srcs, prev_s = model.decoder(x_t, s_t, enc_out_beam, enc_pad_mask_beam, ct_e, extra_zeros_beam, enc_extend_vocab_beam, sum_temporal_srcs, prev_s)              #final_dist: rem*beam, n_extended_vocab

        final_dist = final_dist.view(n_rem, config.beam_size, -1)                   #final_dist: rem, beam, n_extended_vocab
        dec_h = dec_h.view(n_rem, config.beam_size, -1)                             #rem, beam, n_hid
        dec_c = dec_c.view(n_rem, config.beam_size, -1)                             #rem, beam, n_hid
        ct_e = ct_e.view(n_rem, config.beam_size, -1)                             #rem, beam, 2*n_hid

        if sum_temporal_srcs is not None:
            sum_temporal_srcs = sum_temporal_srcs.view(n_rem, config.beam_size, -1) #rem, beam, n_seq

        if prev_s is not None:
            prev_s = prev_s.view(n_rem, config.beam_size, -1, config.hidden_dim)    #rem, beam, t

        # For all the active beams, perform beam search
        active = []         #indices of active beams after beam search

        for i in range(n_rem):
            b = beam_idx[i].item()
            beam = beams[b]
            if beam.done:
                continue

            sum_temporal_srcs_i = prev_s_i = None
            if sum_temporal_srcs is not None:
                sum_temporal_srcs_i = sum_temporal_srcs[i]                              #beam, n_seq
            if prev_s is not None:
                prev_s_i = prev_s[i]                                                #beam, t, n_hid
            beam.advance(final_dist[i], (dec_h[i], dec_c[i]), ct_e[i], sum_temporal_srcs_i, prev_s_i)
            if beam.done == False:
                active.append(b)

        if len(active) == 0:
            break

        beam_idx = T.LongTensor(active)
        n_rem = len(beam_idx)

    predicted_words = []
    for beam in beams:
        predicted_words.append(beam.get_best())

    return predicted_words


================================================
FILE: data/saved_models/.gitignore
================================================


================================================
FILE: data/unfinished/.gitignore
================================================


================================================
FILE: data_util/batcher.py
================================================
#Most of this file is copied form https://github.com/abisee/pointer-generator/blob/master/batcher.py

import queue as Queue
import time
from random import shuffle
from threading import Thread

import numpy as np
import tensorflow as tf

from . import config
from . import data

import random
random.seed(1234)


class Example(object):

  def __init__(self, article, abstract_sentences, vocab):
    # Get ids of special tokens
    start_decoding = vocab.word2id(data.START_DECODING)
    stop_decoding = vocab.word2id(data.STOP_DECODING)

    # Process the article
    article_words = article.split()
    if len(article_words) > config.max_enc_steps:
      article_words = article_words[:config.max_enc_steps]
    self.enc_len = len(article_words) # store the length after truncation but before padding
    self.enc_input = [vocab.word2id(w) for w in article_words] # list of word ids; OOVs are represented by the id for UNK token

    # Process the abstract
    abstract = ' '.join(abstract_sentences) # string
    abstract_words = abstract.split() # list of strings
    abs_ids = [vocab.word2id(w) for w in abstract_words] # list of word ids; OOVs are represented by the id for UNK token

    # Get the decoder input sequence and target sequence
    self.dec_input, _ = self.get_dec_inp_targ_seqs(abs_ids, config.max_dec_steps, start_decoding, stop_decoding)
    self.dec_len = len(self.dec_input)

    # If using pointer-generator mode, we need to store some extra info
    # Store a version of the enc_input where in-article OOVs are represented by their temporary OOV id; also store the in-article OOVs words themselves
    self.enc_input_extend_vocab, self.article_oovs = data.article2ids(article_words, vocab)

    # Get a verison of the reference summary where in-article OOVs are represented by their temporary article OOV id
    abs_ids_extend_vocab = data.abstract2ids(abstract_words, vocab, self.article_oovs)

    # Get decoder target sequence
    _, self.target = self.get_dec_inp_targ_seqs(abs_ids_extend_vocab, config.max_dec_steps, start_decoding, stop_decoding)

    # Store the original strings
    self.original_article = article
    self.original_abstract = abstract
    self.original_abstract_sents = abstract_sentences


  def get_dec_inp_targ_seqs(self, sequence, max_len, start_id, stop_id):
    inp = [start_id] + sequence[:]
    target = sequence[:]
    if len(inp) > max_len: # truncate
      inp = inp[:max_len]
      target = target[:max_len] # no end_token
    else: # no truncation
      target.append(stop_id) # end token
    assert len(inp) == len(target)
    return inp, target


  def pad_decoder_inp_targ(self, max_len, pad_id):
    while len(self.dec_input) < max_len:
      self.dec_input.append(pad_id)
    while len(self.target) < max_len:
      self.target.append(pad_id)


  def pad_encoder_input(self, max_len, pad_id):
    while len(self.enc_input) < max_len:
      self.enc_input.append(pad_id)
    while len(self.enc_input_extend_vocab) < max_len:
      self.enc_input_extend_vocab.append(pad_id)


class Batch(object):
  def __init__(self, example_list, vocab, batch_size):
    self.batch_size = batch_size
    self.pad_id = vocab.word2id(data.PAD_TOKEN) # id of the PAD token used to pad sequences
    self.init_encoder_seq(example_list) # initialize the input to the encoder
    self.init_decoder_seq(example_list) # initialize the input and targets for the decoder
    self.store_orig_strings(example_list) # store the original strings


  def init_encoder_seq(self, example_list):
    # Determine the maximum length of the encoder input sequence in this batch
    max_enc_seq_len = max([ex.enc_len for ex in example_list])

    # Pad the encoder input sequences up to the length of the longest sequence
    for ex in example_list:
      ex.pad_encoder_input(max_enc_seq_len, self.pad_id)

    # Initialize the numpy arrays
    # Note: our enc_batch can have different length (second dimension) for each batch because we use dynamic_rnn for the encoder.
    self.enc_batch = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32)
    self.enc_lens = np.zeros((self.batch_size), dtype=np.int32)
    self.enc_padding_mask = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.float32)

    # Fill in the numpy arrays
    for i, ex in enumerate(example_list):
      self.enc_batch[i, :] = ex.enc_input[:]
      self.enc_lens[i] = ex.enc_len
      for j in range(ex.enc_len):
        self.enc_padding_mask[i][j] = 1

    # For pointer-generator mode, need to store some extra info
    # Determine the max number of in-article OOVs in this batch
    self.max_art_oovs = max([len(ex.article_oovs) for ex in example_list])
    # Store the in-article OOVs themselves
    self.art_oovs = [ex.article_oovs for ex in example_list]
    # Store the version of the enc_batch that uses the article OOV ids
    self.enc_batch_extend_vocab = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32)
    for i, ex in enumerate(example_list):
      self.enc_batch_extend_vocab[i, :] = ex.enc_input_extend_vocab[:]

  def init_decoder_seq(self, example_list):
    # Pad the inputs and targets
    for ex in example_list:
      ex.pad_decoder_inp_targ(config.max_dec_steps, self.pad_id)

    # Initialize the numpy arrays.
    self.dec_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32)
    self.target_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32)
    # self.dec_padding_mask = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.float32)
    self.dec_lens = np.zeros((self.batch_size), dtype=np.int32)

    # Fill in the numpy arrays
    for i, ex in enumerate(example_list):
      self.dec_batch[i, :] = ex.dec_input[:]
      self.target_batch[i, :] = ex.target[:]
      self.dec_lens[i] = ex.dec_len
      # for j in range(ex.dec_len):
      #   self.dec_padding_mask[i][j] = 1

  def store_orig_strings(self, example_list):
    self.original_articles = [ex.original_article for ex in example_list] # list of lists
    self.original_abstracts = [ex.original_abstract for ex in example_list] # list of lists
    self.original_abstracts_sents = [ex.original_abstract_sents for ex in example_list] # list of list of lists


class Batcher(object):
  BATCH_QUEUE_MAX = 1000 # max number of batches the batch_queue can hold

  def __init__(self, data_path, vocab, mode, batch_size, single_pass):
    self._data_path = data_path
    self._vocab = vocab
    self._single_pass = single_pass
    self.mode = mode
    self.batch_size = batch_size
    # Initialize a queue of Batches waiting to be used, and a queue of Examples waiting to be batched
    self._batch_queue = Queue.Queue(self.BATCH_QUEUE_MAX)
    self._example_queue = Queue.Queue(self.BATCH_QUEUE_MAX * self.batch_size)

    # Different settings depending on whether we're in single_pass mode or not
    if single_pass:
      self._num_example_q_threads = 1 # just one thread, so we read through the dataset just once
      self._num_batch_q_threads = 1  # just one thread to batch examples
      self._bucketing_cache_size = 1 # only load one batch's worth of examples before bucketing; this essentially means no bucketing
      self._finished_reading = False # this will tell us when we're finished reading the dataset
    else:
      self._num_example_q_threads = 1 #16 # num threads to fill example queue
      self._num_batch_q_threads = 1 #4  # num threads to fill batch queue
      self._bucketing_cache_size = 1 #100 # how many batches-worth of examples to load into cache before bucketing

    # Start the threads that load the queues
    self._example_q_threads = []
    for _ in range(self._num_example_q_threads):
      self._example_q_threads.append(Thread(target=self.fill_example_queue))
      self._example_q_threads[-1].daemon = True
      self._example_q_threads[-1].start()
    self._batch_q_threads = []
    for _ in range(self._num_batch_q_threads):
      self._batch_q_threads.append(Thread(target=self.fill_batch_queue))
      self._batch_q_threads[-1].daemon = True
      self._batch_q_threads[-1].start()

    # Start a thread that watches the other threads and restarts them if they're dead
    if not single_pass: # We don't want a watcher in single_pass mode because the threads shouldn't run forever
      self._watch_thread = Thread(target=self.watch_threads)
      self._watch_thread.daemon = True
      self._watch_thread.start()

  def next_batch(self):
    # If the batch queue is empty, print a warning
    if self._batch_queue.qsize() == 0:
      # tf.logging.warning('Bucket input queue is empty when calling next_batch. Bucket queue size: %i, Input queue size: %i', self._batch_queue.qsize(), self._example_queue.qsize())
      if self._single_pass and self._finished_reading:
        tf.logging.info("Finished reading dataset in single_pass mode.")
        return None

    batch = self._batch_queue.get() # get the next Batch
    return batch

  def fill_example_queue(self):
    input_gen = self.text_generator(data.example_generator(self._data_path, self._single_pass))

    while True:
      try:
        (article, abstract) = next(input_gen) # read the next example from file. article and abstract are both strings.
      except StopIteration: # if there are no more examples:
        tf.logging.info("The example generator for this example queue filling thread has exhausted data.")
        if self._single_pass:
          tf.logging.info("single_pass mode is on, so we've finished reading dataset. This thread is stopping.")
          self._finished_reading = True
          break
        else:
          raise Exception("single_pass mode is off but the example generator is out of data; error.")

      # abstract_sentences = [sent.strip() for sent in data.abstract2sents(abstract)] # Use the <s> and </s> tags in abstract to get a list of sentences.
      abstract_sentences = [abstract.strip()]
      example = Example(article, abstract_sentences, self._vocab) # Process into an Example.
      self._example_queue.put(example) # place the Example in the example queue.

  def fill_batch_queue(self):
    while True:
      if self.mode == 'decode':
        # beam search decode mode single example repeated in the batch
        ex = self._example_queue.get()
        b = [ex for _ in range(self.batch_size)]
        self._batch_queue.put(Batch(b, self._vocab, self.batch_size))
      else:
        # Get bucketing_cache_size-many batches of Examples into a list, then sort
        inputs = []
        for _ in range(self.batch_size * self._bucketing_cache_size):
          inputs.append(self._example_queue.get())
        inputs = sorted(inputs, key=lambda inp: inp.enc_len, reverse=True) # sort by length of encoder sequence

        # Group the sorted Examples into batches, optionally shuffle the batches, and place in the batch queue.
        batches = []
        for i in range(0, len(inputs), self.batch_size):
          batches.append(inputs[i:i + self.batch_size])
        if not self._single_pass:
          shuffle(batches)
        for b in batches:  # each b is a list of Example objects
          self._batch_queue.put(Batch(b, self._vocab, self.batch_size))

  def watch_threads(self):
    while True:
      tf.logging.info(
        'Bucket queue size: %i, Input queue size: %i',
        self._batch_queue.qsize(), self._example_queue.qsize())

      time.sleep(60)
      for idx,t in enumerate(self._example_q_threads):
        if not t.is_alive(): # if the thread is dead
          tf.logging.error('Found example queue thread dead. Restarting.')
          new_t = Thread(target=self.fill_example_queue)
          self._example_q_threads[idx] = new_t
          new_t.daemon = True
          new_t.start()
      for idx,t in enumerate(self._batch_q_threads):
        if not t.is_alive(): # if the thread is dead
          tf.logging.error('Found batch queue thread dead. Restarting.')
          new_t = Thread(target=self.fill_batch_queue)
          self._batch_q_threads[idx] = new_t
          new_t.daemon = True
          new_t.start()


  def text_generator(self, example_generator):
    while True:
      e = next(example_generator) # e is a tf.Example
      try:
        article_text = e.features.feature['article'].bytes_list.value[0] # the article text was saved under the key 'article' in the data files
        abstract_text = e.features.feature['abstract'].bytes_list.value[0] # the abstract text was saved under the key 'abstract' in the data files
        article_text = article_text.decode()
        abstract_text = abstract_text.decode()
      except ValueError:
        tf.logging.error('Failed to get article or abstract from example')
        continue
      if len(article_text)==0: # See https://github.com/abisee/pointer-generator/issues/1
        #tf.logging.warning('Found an example with empty article text. Skipping it.')
        continue
      else:
        yield (article_text, abstract_text)


================================================
FILE: data_util/config.py
================================================
train_data_path = 	"data/chunked/train/train_*"
valid_data_path = 	"data/chunked/valid/valid_*"
test_data_path = 	"data/chunked/test/test_*"
vocab_path = 		"data/vocab"


# Hyperparameters
hidden_dim = 512
emb_dim = 256
batch_size = 200
max_enc_steps = 55		#99% of the articles are within length 55
max_dec_steps = 15		#99% of the titles are within length 15
beam_size = 4
min_dec_steps= 3
vocab_size = 50000

lr = 0.001
rand_unif_init_mag = 0.02
trunc_norm_init_std = 1e-4

eps = 1e-12
max_iterations = 500000


save_model_path = "data/saved_models"

intra_encoder = True
intra_decoder = True

================================================
FILE: data_util/data.py
================================================
#Most of this file is copied form https://github.com/abisee/pointer-generator/blob/master/data.py

import glob
import random
import struct
import csv
from tensorflow.core.example import example_pb2

# <s> and </s> are used in the data files to segment the abstracts into sentences. They don't receive vocab ids.
SENTENCE_START = '<s>'
SENTENCE_END = '</s>'

PAD_TOKEN = '[PAD]' # This has a vocab id, which is used to pad the encoder input, decoder input and target sequence
UNKNOWN_TOKEN = '[UNK]' # This has a vocab id, which is used to represent out-of-vocabulary words
START_DECODING = '[START]' # This has a vocab id, which is used at the start of every decoder input sequence
STOP_DECODING = '[STOP]' # This has a vocab id, which is used at the end of untruncated target sequences

# Note: none of <s>, </s>, [PAD], [UNK], [START], [STOP] should appear in the vocab file.


class Vocab(object):

  def __init__(self, vocab_file, max_size):
    self._word_to_id = {}
    self._id_to_word = {}
    self._count = 0 # keeps track of total number of words in the Vocab

    # [UNK], [PAD], [START] and [STOP] get the ids 0,1,2,3.
    for w in [UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
      self._word_to_id[w] = self._count
      self._id_to_word[self._count] = w
      self._count += 1

    # Read the vocab file and add words up to max_size
    with open(vocab_file, 'r') as vocab_f:
      for line in vocab_f:
        pieces = line.split()
        if len(pieces) != 2:
          # print ('Warning: incorrectly formatted line in vocabulary file: %s\n' % line)
          continue
        w = pieces[0]
        if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
          raise Exception('<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn\'t be in the vocab file, but %s is' % w)
        if w in self._word_to_id:
          raise Exception('Duplicated word in vocabulary file: %s' % w)
        self._word_to_id[w] = self._count
        self._id_to_word[self._count] = w
        self._count += 1
        if max_size != 0 and self._count >= max_size:
          # print ("max_size of vocab was specified as %i; we now have %i words. Stopping reading." % (max_size, self._count))
          break

    # print ("Finished constructing vocabulary of %i total words. Last word added: %s" % (self._count, self._id_to_word[self._count-1]))

  def word2id(self, word):
    if word not in self._word_to_id:
      return self._word_to_id[UNKNOWN_TOKEN]
    return self._word_to_id[word]

  def id2word(self, word_id):
    if word_id not in self._id_to_word:
      raise ValueError('Id not found in vocab: %d' % word_id)
    return self._id_to_word[word_id]

  def size(self):
    return self._count

  def write_metadata(self, fpath):
    print ("Writing word embedding metadata file to %s..." % (fpath))
    with open(fpath, "w") as f:
      fieldnames = ['word']
      writer = csv.DictWriter(f, delimiter="\t", fieldnames=fieldnames)
      for i in xrange(self.size()):
        writer.writerow({"word": self._id_to_word[i]})


def example_generator(data_path, single_pass):
  while True:
    filelist = glob.glob(data_path) # get the list of datafiles
    assert filelist, ('Error: Empty filelist at %s' % data_path) # check filelist isn't empty
    if single_pass:
      filelist = sorted(filelist)
    else:
      random.shuffle(filelist)
    for f in filelist:
      reader = open(f, 'rb')
      while True:
        len_bytes = reader.read(8)
        if not len_bytes: break # finished reading this file
        str_len = struct.unpack('q', len_bytes)[0]
        example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]
        yield example_pb2.Example.FromString(example_str)
    if single_pass:
      # print ("example_generator completed reading all datafiles. No more data.")
      break


def article2ids(article_words, vocab):
  ids = []
  oovs = []
  unk_id = vocab.word2id(UNKNOWN_TOKEN)
  for w in article_words:
    i = vocab.word2id(w)
    if i == unk_id: # If w is OOV
      if w not in oovs: # Add to list of OOVs
        oovs.append(w)
      oov_num = oovs.index(w) # This is 0 for the first article OOV, 1 for the second article OOV...
      ids.append(vocab.size() + oov_num) # This is e.g. 50000 for the first article OOV, 50001 for the second...
    else:
      ids.append(i)
  return ids, oovs


def abstract2ids(abstract_words, vocab, article_oovs):
  ids = []
  unk_id = vocab.word2id(UNKNOWN_TOKEN)
  for w in abstract_words:
    i = vocab.word2id(w)
    if i == unk_id: # If w is an OOV word
      if w in article_oovs: # If w is an in-article OOV
        vocab_idx = vocab.size() + article_oovs.index(w) # Map to its temporary article OOV number
        ids.append(vocab_idx)
      else: # If w is an out-of-article OOV
        ids.append(unk_id) # Map to the UNK token id
    else:
      ids.append(i)
  return ids


def outputids2words(id_list, vocab, article_oovs):
  words = []
  for i in id_list:
    try:
      w = vocab.id2word(i) # might be [UNK]
    except ValueError as e: # w is OOV
      assert article_oovs is not None, "Error: model produced a word ID that isn't in the vocabulary. This should not happen in baseline (no pointer-generator) mode"
      article_oov_idx = i - vocab.size()
      try:
        w = article_oovs[article_oov_idx]
      except ValueError as e: # i doesn't correspond to an article oov
        raise ValueError('Error: model produced word ID %i which corresponds to article OOV %i but this example only has %i article OOVs' % (i, article_oov_idx, len(article_oovs)))
    words.append(w)
  return words


def abstract2sents(abstract):
  cur = 0
  sents = []
  while True:
    try:
      start_p = abstract.index(SENTENCE_START, cur)
      end_p = abstract.index(SENTENCE_END, start_p + 1)
      cur = end_p + len(SENTENCE_END)
      sents.append(abstract[start_p+len(SENTENCE_START):end_p])
    except ValueError as e: # no more sentences
      return sents


def show_art_oovs(article, vocab):
  unk_token = vocab.word2id(UNKNOWN_TOKEN)
  words = article.split(' ')
  words = [("__%s__" % w) if vocab.word2id(w)==unk_token else w for w in words]
  out_str = ' '.join(words)
  return out_str


def show_abs_oovs(abstract, vocab, article_oovs):
  unk_token = vocab.word2id(UNKNOWN_TOKEN)
  words = abstract.split(' ')
  new_words = []
  for w in words:
    if vocab.word2id(w) == unk_token: # w is oov
      if article_oovs is None: # baseline mode
        new_words.append("__%s__" % w)
      else: # pointer-generator mode
        if w in article_oovs:
          new_words.append("__%s__" % w)
        else:
          new_words.append("!!__%s__!!" % w)
    else: # w is in-vocab word
      new_words.append(w)
  out_str = ' '.join(new_words)
  return out_str


================================================
FILE: eval.py
================================================
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import time

import torch as T
import torch.nn as nn
import torch.nn.functional as F
from model import Model

from data_util import config, data
from data_util.batcher import Batcher
from data_util.data import Vocab
from train_util import *
from beam_search import *
from rouge import Rouge
import argparse

def get_cuda(tensor):
    if T.cuda.is_available():
        tensor = tensor.cuda()
    return tensor

class Evaluate(object):
    def __init__(self, data_path, opt, batch_size = config.batch_size):
        self.vocab = Vocab(config.vocab_path, config.vocab_size)
        self.batcher = Batcher(data_path, self.vocab, mode='eval',
                               batch_size=batch_size, single_pass=True)
        self.opt = opt
        time.sleep(5)

    def setup_valid(self):
        self.model = Model()
        self.model = get_cuda(self.model)
        checkpoint = T.load(os.path.join(config.save_model_path, self.opt.load_model))
        self.model.load_state_dict(checkpoint["model_dict"])


    def print_original_predicted(self, decoded_sents, ref_sents, article_sents, loadfile):
        filename = "test_"+loadfile.split(".")[0]+".txt"
    
        with open(os.path.join("data",filename), "w") as f:
            for i in range(len(decoded_sents)):
                f.write("article: "+article_sents[i] + "\n")
                f.write("ref: " + ref_sents[i] + "\n")
                f.write("dec: " + decoded_sents[i] + "\n\n")

    def evaluate_batch(self, print_sents = False):

        self.setup_valid()
        batch = self.batcher.next_batch()
        start_id = self.vocab.word2id(data.START_DECODING)
        end_id = self.vocab.word2id(data.STOP_DECODING)
        unk_id = self.vocab.word2id(data.UNKNOWN_TOKEN)
        decoded_sents = []
        ref_sents = []
        article_sents = []
        rouge = Rouge()
        while batch is not None:
            enc_batch, enc_lens, enc_padding_mask, enc_batch_extend_vocab, extra_zeros, ct_e = get_enc_data(batch)

            with T.autograd.no_grad():
                enc_batch = self.model.embeds(enc_batch)
                enc_out, enc_hidden = self.model.encoder(enc_batch, enc_lens)

            #-----------------------Summarization----------------------------------------------------
            with T.autograd.no_grad():
                pred_ids = beam_search(enc_hidden, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, self.model, start_id, end_id, unk_id)

            for i in range(len(pred_ids)):
                decoded_words = data.outputids2words(pred_ids[i], self.vocab, batch.art_oovs[i])
                if len(decoded_words) < 2:
                    decoded_words = "xxx"
                else:
                    decoded_words = " ".join(decoded_words)
                decoded_sents.append(decoded_words)
                abstract = batch.original_abstracts[i]
                article = batch.original_articles[i]
                ref_sents.append(abstract)
                article_sents.append(article)

            batch = self.batcher.next_batch()

        load_file = self.opt.load_model

        if print_sents:
            self.print_original_predicted(decoded_sents, ref_sents, article_sents, load_file)

        scores = rouge.get_scores(decoded_sents, ref_sents, avg = True)
        if self.opt.task == "test":
            print(load_file, "scores:", scores)
        else:
            rouge_l = scores["rouge-l"]["f"]
            print(load_file, "rouge_l:", "%.4f" % rouge_l)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--task", type=str, default="validate", choices=["validate","test"])
    parser.add_argument("--start_from", type=str, default="0020000.tar")
    parser.add_argument("--load_model", type=str, default=None)
    opt = parser.parse_args()

    if opt.task == "validate":
        saved_models = os.listdir(config.save_model_path)
        saved_models.sort()
        file_idx = saved_models.index(opt.start_from)
        saved_models = saved_models[file_idx:]
        for f in saved_models:
            opt.load_model = f
            eval_processor = Evaluate(config.valid_data_path, opt)
            eval_processor.evaluate_batch()
    else:   #test
        eval_processor = Evaluate(config.test_data_path, opt)
        eval_processor.evaluate_batch()


================================================
FILE: make_data_files.py
================================================
import os
import shutil
import collections
import tqdm
from tensorflow.core.example import example_pb2
import struct
import random
import shutil

finished_path = "data/finished"
unfinished_path = "data/unfinished"
chunk_path = "data/chunked"

vocab_path = "data/vocab"
VOCAB_SIZE = 200000

CHUNK_SIZE = 15000 # num examples per chunk, for the chunked data
train_bin_path = os.path.join(finished_path, "train.bin")
valid_bin_path = os.path.join(finished_path, "valid.bin")

def make_folder(folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

def delete_folder(folder_path):
    if os.path.exists(folder_path):
        shutil.rmtree(folder_path)

def shuffle_text_data(unshuffled_art, unshuffled_abs, shuffled_art, shuffled_abs):
    article_itr = open(os.path.join(unfinished_path, unshuffled_art), "r")
    abstract_itr = open(os.path.join(unfinished_path, unshuffled_abs), "r")
    list_of_pairs = []
    for article in article_itr:
        article = article.strip()
        abstract = next(abstract_itr).strip()
        list_of_pairs.append((article, abstract))
    article_itr.close()
    abstract_itr.close()
    random.shuffle(list_of_pairs)
    article_itr = open(os.path.join(unfinished_path, shuffled_art), "w")
    abstract_itr = open(os.path.join(unfinished_path, shuffled_abs), "w")
    for pair in list_of_pairs:
        article_itr.write(pair[0]+"\n")
        abstract_itr.write(pair[1]+"\n")
    article_itr.close()
    abstract_itr.close()

def write_to_bin(article_path, abstract_path, out_file, vocab_counter = None):

    with open(out_file, 'wb') as writer:

        article_itr = open(article_path, 'r')
        abstract_itr = open(abstract_path, 'r')
        for article in tqdm.tqdm(article_itr):
            article = article.strip()
            abstract = next(abstract_itr).strip()

            tf_example = example_pb2.Example()
            tf_example.features.feature['article'].bytes_list.value.extend([article])
            tf_example.features.feature['abstract'].bytes_list.value.extend([abstract])
            tf_example_str = tf_example.SerializeToString()
            str_len = len(tf_example_str)
            writer.write(struct.pack('q', str_len))
            writer.write(struct.pack('%ds' % str_len, tf_example_str))

            if vocab_counter is not None:
                art_tokens = article.split(' ')
                abs_tokens = abstract.split(' ')
                # abs_tokens = [t for t in abs_tokens if
                #               t not in [SENTENCE_START, SENTENCE_END]]  # remove these tags from vocab
                tokens = art_tokens + abs_tokens
                tokens = [t.strip() for t in tokens]  # strip
                tokens = [t for t in tokens if t != ""]  # remove empty
                vocab_counter.update(tokens)

    if vocab_counter is not None:
        with open(vocab_path, 'w') as writer:
            for word, count in vocab_counter.most_common(VOCAB_SIZE):
                writer.write(word + ' ' + str(count) + '\n')


def creating_finished_data():
    make_folder(finished_path)

    vocab_counter = collections.Counter()

    write_to_bin(os.path.join(unfinished_path, "train.art.shuf.txt"), os.path.join(unfinished_path, "train.abs.shuf.txt"), train_bin_path, vocab_counter)
    write_to_bin(os.path.join(unfinished_path, "valid.art.shuf.txt"), os.path.join(unfinished_path, "valid.abs.shuf.txt"), valid_bin_path)


def chunk_file(set_name, chunks_dir, bin_file):
    make_folder(chunks_dir)
    reader = open(bin_file, "rb")
    chunk = 0
    finished = False
    while not finished:
        chunk_fname = os.path.join(chunks_dir, '%s_%04d.bin' % (set_name, chunk)) # new chunk
        with open(chunk_fname, 'wb') as writer:
            for _ in range(CHUNK_SIZE):
                len_bytes = reader.read(8)
                if not len_bytes:
                    finished = True
                    break
                str_len = struct.unpack('q', len_bytes)[0]
                example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]
                writer.write(struct.pack('q', str_len))
                writer.write(struct.pack('%ds' % str_len, example_str))
            chunk += 1


if __name__ == "__main__":
    shuffle_text_data("train.article.txt", "train.title.txt", "train.art.shuf.txt", "train.abs.shuf.txt")
    shuffle_text_data("valid.article.filter.txt", "valid.title.filter.txt", "valid.art.shuf.txt", "valid.abs.shuf.txt")
    print("Completed shuffling train & valid text files")
    delete_folder(finished_path)
    creating_finished_data()        #create bin files
    print("Completed creating bin file for train & valid")
    delete_folder(chunk_path)
    chunk_file("train", os.path.join(chunk_path, "train"), train_bin_path)
    chunk_file("valid", os.path.join(chunk_path, "main_valid"), valid_bin_path)
    print("Completed chunking main bin files into smaller ones")
    #Performing rouge evaluation on 1.9 lakh sentences takes lot of time. So, create mini validation set & test set by borrowing 15k samples each from these 1.9 lakh sentences
    make_folder(os.path.join(chunk_path, "valid"))
    make_folder(os.path.join(chunk_path, "test"))
    bin_chunks = os.listdir(os.path.join(chunk_path, "main_valid"))
    bin_chunks.sort()
    samples = random.sample(set(bin_chunks[:-1]), 2)      #Exclude last bin file; contains only 9k sentences
    valid_chunk, test_chunk = samples[0], samples[1]
    shutil.copyfile(os.path.join(chunk_path, "main_valid", valid_chunk), os.path.join(chunk_path, "valid", "valid_00.bin"))
    shutil.copyfile(os.path.join(chunk_path, "main_valid", test_chunk), os.path.join(chunk_path, "test", "test_00.bin"))

    # delete_folder(finished)
    # delete_folder(os.path.join(chunk_path, "main_valid"))


================================================
FILE: model.py
================================================
import torch as T
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from data_util import config
import torch.nn.functional as F
from train_util import get_cuda

def init_lstm_wt(lstm):
    for name, _ in lstm.named_parameters():
        if 'weight' in name:
            wt = getattr(lstm, name)
            wt.data.uniform_(-config.rand_unif_init_mag, config.rand_unif_init_mag)
        elif 'bias' in name:
            # set forget bias to 1
            bias = getattr(lstm, name)
            n = bias.size(0)
            start, end = n // 4, n // 2
            bias.data.fill_(0.)
            bias.data[start:end].fill_(1.)

def init_linear_wt(linear):
    linear.weight.data.normal_(std=config.trunc_norm_init_std)
    if linear.bias is not None:
        linear.bias.data.normal_(std=config.trunc_norm_init_std)

def init_wt_normal(wt):
    wt.data.normal_(std=config.trunc_norm_init_std)


class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()

        self.lstm = nn.LSTM(config.emb_dim, config.hidden_dim, num_layers=1, batch_first=True, bidirectional=True)
        init_lstm_wt(self.lstm)

        self.reduce_h = nn.Linear(config.hidden_dim * 2, config.hidden_dim)
        init_linear_wt(self.reduce_h)
        self.reduce_c = nn.Linear(config.hidden_dim * 2, config.hidden_dim)
        init_linear_wt(self.reduce_c)

    def forward(self, x, seq_lens):
        packed = pack_padded_sequence(x, seq_lens, batch_first=True)
        enc_out, enc_hid = self.lstm(packed)
        enc_out,_ = pad_packed_sequence(enc_out, batch_first=True)
        enc_out = enc_out.contiguous()                              #bs, n_seq, 2*n_hid
        h, c = enc_hid                                              #shape of h: 2, bs, n_hid
        h = T.cat(list(h), dim=1)                                   #bs, 2*n_hid
        c = T.cat(list(c), dim=1)
        h_reduced = F.relu(self.reduce_h(h))                        #bs,n_hid
        c_reduced = F.relu(self.reduce_c(c))
        return enc_out, (h_reduced, c_reduced)


class encoder_attention(nn.Module):

    def __init__(self):
        super(encoder_attention, self).__init__()
        self.W_h = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2, bias=False)
        self.W_s = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2)
        self.v = nn.Linear(config.hidden_dim * 2, 1, bias=False)


    def forward(self, st_hat, h, enc_padding_mask, sum_temporal_srcs):
        ''' Perform attention over encoder hidden states
        :param st_hat: decoder hidden state at current time step
        :param h: encoder hidden states
        :param enc_padding_mask:
        :param sum_temporal_srcs: if using intra-temporal attention, contains summation of attention weights from previous decoder time steps
        '''

        # Standard attention technique (eq 1 in https://arxiv.org/pdf/1704.04368.pdf)
        et = self.W_h(h)                        # bs,n_seq,2*n_hid
        dec_fea = self.W_s(st_hat).unsqueeze(1) # bs,1,2*n_hid
        et = et + dec_fea
        et = T.tanh(et)                         # bs,n_seq,2*n_hid
        et = self.v(et).squeeze(2)              # bs,n_seq

        # intra-temporal attention     (eq 3 in https://arxiv.org/pdf/1705.04304.pdf)
        if config.intra_encoder:
            exp_et = T.exp(et)
            if sum_temporal_srcs is None:
                et1 = exp_et
                sum_temporal_srcs  = get_cuda(T.FloatTensor(et.size()).fill_(1e-10)) + exp_et
            else:
                et1 = exp_et/sum_temporal_srcs  #bs, n_seq
                sum_temporal_srcs = sum_temporal_srcs + exp_et
        else:
            et1 = F.softmax(et, dim=1)

        # assign 0 probability for padded elements
        at = et1 * enc_padding_mask
        normalization_factor = at.sum(1, keepdim=True)
        at = at / normalization_factor

        at = at.unsqueeze(1)                    #bs,1,n_seq
        # Compute encoder context vector
        ct_e = T.bmm(at, h)                     #bs, 1, 2*n_hid
        ct_e = ct_e.squeeze(1)
        at = at.squeeze(1)

        return ct_e, at, sum_temporal_srcs

class decoder_attention(nn.Module):
    def __init__(self):
        super(decoder_attention, self).__init__()
        if config.intra_decoder:
            self.W_prev = nn.Linear(config.hidden_dim, config.hidden_dim, bias=False)
            self.W_s = nn.Linear(config.hidden_dim, config.hidden_dim)
            self.v = nn.Linear(config.hidden_dim, 1, bias=False)

    def forward(self, s_t, prev_s):
        '''Perform intra_decoder attention
        Args
        :param s_t: hidden state of decoder at current time step
        :param prev_s: If intra_decoder attention, contains list of previous decoder hidden states
        '''
        if config.intra_decoder is False:
            ct_d = get_cuda(T.zeros(s_t.size()))
        elif prev_s is None:
            ct_d = get_cuda(T.zeros(s_t.size()))
            prev_s = s_t.unsqueeze(1)               #bs, 1, n_hid
        else:
            # Standard attention technique (eq 1 in https://arxiv.org/pdf/1704.04368.pdf)
            et = self.W_prev(prev_s)                # bs,t-1,n_hid
            dec_fea = self.W_s(s_t).unsqueeze(1)    # bs,1,n_hid
            et = et + dec_fea
            et = T.tanh(et)                         # bs,t-1,n_hid
            et = self.v(et).squeeze(2)              # bs,t-1
            # intra-decoder attention     (eq 7 & 8 in https://arxiv.org/pdf/1705.04304.pdf)
            at = F.softmax(et, dim=1).unsqueeze(1)  #bs, 1, t-1
            ct_d = T.bmm(at, prev_s).squeeze(1)     #bs, n_hid
            prev_s = T.cat([prev_s, s_t.unsqueeze(1)], dim=1)    #bs, t, n_hid

        return ct_d, prev_s


class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        self.enc_attention = encoder_attention()
        self.dec_attention = decoder_attention()
        self.x_context = nn.Linear(config.hidden_dim*2 + config.emb_dim, config.emb_dim)

        self.lstm = nn.LSTMCell(config.emb_dim, config.hidden_dim)
        init_lstm_wt(self.lstm)

        self.p_gen_linear = nn.Linear(config.hidden_dim * 5 + config.emb_dim, 1)

        #p_vocab
        self.V = nn.Linear(config.hidden_dim*4, config.hidden_dim)
        self.V1 = nn.Linear(config.hidden_dim, config.vocab_size)
        init_linear_wt(self.V1)

    def forward(self, x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s):
        x = self.x_context(T.cat([x_t, ct_e], dim=1))
        s_t = self.lstm(x, s_t)

        dec_h, dec_c = s_t
        st_hat = T.cat([dec_h, dec_c], dim=1)
        ct_e, attn_dist, sum_temporal_srcs = self.enc_attention(st_hat, enc_out, enc_padding_mask, sum_temporal_srcs)

        ct_d, prev_s = self.dec_attention(dec_h, prev_s)        #intra-decoder attention

        p_gen = T.cat([ct_e, ct_d, st_hat, x], 1)
        p_gen = self.p_gen_linear(p_gen)            # bs,1
        p_gen = T.sigmoid(p_gen)                    # bs,1

        out = T.cat([dec_h, ct_e, ct_d], dim=1)     # bs, 4*n_hid
        out = self.V(out)                           # bs,n_hid
        out = self.V1(out)                          # bs, n_vocab
        vocab_dist = F.softmax(out, dim=1)
        vocab_dist = p_gen * vocab_dist
        attn_dist_ = (1 - p_gen) * attn_dist

        # pointer mechanism (as suggested in eq 9 https://arxiv.org/pdf/1704.04368.pdf)
        if extra_zeros is not None:
            vocab_dist = T.cat([vocab_dist, extra_zeros], dim=1)
        final_dist = vocab_dist.scatter_add(1, enc_batch_extend_vocab, attn_dist_)

        return final_dist, s_t, ct_e, sum_temporal_srcs, prev_s


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
        self.embeds = nn.Embedding(config.vocab_size, config.emb_dim)
        init_wt_normal(self.embeds.weight)

        self.encoder = get_cuda(self.encoder)
        self.decoder = get_cuda(self.decoder)
        self.embeds = get_cuda(self.embeds)


================================================
FILE: train.py
================================================
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"    #Set cuda device

import time

import torch as T
import torch.nn as nn
import torch.nn.functional as F
from model import Model

from data_util import config, data
from data_util.batcher import Batcher
from data_util.data import Vocab
from train_util import *
from torch.distributions import Categorical
from rouge import Rouge
from numpy import random
import argparse

random.seed(123)
T.manual_seed(123)
if T.cuda.is_available():
    T.cuda.manual_seed_all(123)

class Train(object):
    def __init__(self, opt):
        self.vocab = Vocab(config.vocab_path, config.vocab_size)
        self.batcher = Batcher(config.train_data_path, self.vocab, mode='train',
                               batch_size=config.batch_size, single_pass=False)
        self.opt = opt
        self.start_id = self.vocab.word2id(data.START_DECODING)
        self.end_id = self.vocab.word2id(data.STOP_DECODING)
        self.pad_id = self.vocab.word2id(data.PAD_TOKEN)
        self.unk_id = self.vocab.word2id(data.UNKNOWN_TOKEN)
        time.sleep(5)

    def save_model(self, iter):
        save_path = config.save_model_path + "/%07d.tar" % iter
        T.save({
            "iter": iter + 1,
            "model_dict": self.model.state_dict(),
            "trainer_dict": self.trainer.state_dict()
        }, save_path)

    def setup_train(self):
        self.model = Model()
        self.model = get_cuda(self.model)
        self.trainer = T.optim.Adam(self.model.parameters(), lr=config.lr)
        start_iter = 0
        if self.opt.load_model is not None:
            load_model_path = os.path.join(config.save_model_path, self.opt.load_model)
            checkpoint = T.load(load_model_path)
            start_iter = checkpoint["iter"]
            self.model.load_state_dict(checkpoint["model_dict"])
            self.trainer.load_state_dict(checkpoint["trainer_dict"])
            print("Loaded model at " + load_model_path)
        if self.opt.new_lr is not None:
            self.trainer = T.optim.Adam(self.model.parameters(), lr=self.opt.new_lr)
        return start_iter

    def train_batch_MLE(self, enc_out, enc_hidden, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, batch):
        ''' Calculate Negative Log Likelihood Loss for the given batch. In order to reduce exposure bias,
                pass the previous generated token as input with a probability of 0.25 instead of ground truth label
        Args:
        :param enc_out: Outputs of the encoder for all time steps (batch_size, length_input_sequence, 2*hidden_size)
        :param enc_hidden: Tuple containing final hidden state & cell state of encoder. Shape of h & c: (batch_size, hidden_size)
        :param enc_padding_mask: Mask for encoder input; Tensor of size (batch_size, length_input_sequence) with values of 0 for pad tokens & 1 for others
        :param ct_e: encoder context vector for time_step=0 (eq 5 in https://arxiv.org/pdf/1705.04304.pdf)
        :param extra_zeros: Tensor used to extend vocab distribution for pointer mechanism
        :param enc_batch_extend_vocab: Input batch that stores OOV ids
        :param batch: batch object
        '''
        dec_batch, max_dec_len, dec_lens, target_batch = get_dec_data(batch)                        #Get input and target batchs for training decoder
        step_losses = []
        s_t = (enc_hidden[0], enc_hidden[1])                                                        #Decoder hidden states
        x_t = get_cuda(T.LongTensor(len(enc_out)).fill_(self.start_id))                             #Input to the decoder
        prev_s = None                                                                               #Used for intra-decoder attention (section 2.2 in https://arxiv.org/pdf/1705.04304.pdf)
        sum_temporal_srcs = None                                                                    #Used for intra-temporal attention (section 2.1 in https://arxiv.org/pdf/1705.04304.pdf)
        for t in range(min(max_dec_len, config.max_dec_steps)):
            use_gound_truth = get_cuda((T.rand(len(enc_out)) > 0.25)).long()                        #Probabilities indicating whether to use ground truth labels instead of previous decoded tokens
            x_t = use_gound_truth * dec_batch[:, t] + (1 - use_gound_truth) * x_t                   #Select decoder input based on use_ground_truth probabilities
            x_t = self.model.embeds(x_t)
            final_dist, s_t, ct_e, sum_temporal_srcs, prev_s = self.model.decoder(x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s)
            target = target_batch[:, t]
            log_probs = T.log(final_dist + config.eps)
            step_loss = F.nll_loss(log_probs, target, reduction="none", ignore_index=self.pad_id)
            step_losses.append(step_loss)
            x_t = T.multinomial(final_dist, 1).squeeze()                                            #Sample words from final distribution which can be used as input in next time step
            is_oov = (x_t >= config.vocab_size).long()                                              #Mask indicating whether sampled word is OOV
            x_t = (1 - is_oov) * x_t.detach() + (is_oov) * self.unk_id                              #Replace OOVs with [UNK] token

        losses = T.sum(T.stack(step_losses, 1), 1)                                                  #unnormalized losses for each example in the batch; (batch_size)
        batch_avg_loss = losses / dec_lens                                                          #Normalized losses; (batch_size)
        mle_loss = T.mean(batch_avg_loss)                                                           #Average batch loss
        return mle_loss

    def train_batch_RL(self, enc_out, enc_hidden, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, article_oovs, greedy):
        '''Generate sentences from decoder entirely using sampled tokens as input. These sentences are used for ROUGE evaluation
        Args
        :param enc_out: Outputs of the encoder for all time steps (batch_size, length_input_sequence, 2*hidden_size)
        :param enc_hidden: Tuple containing final hidden state & cell state of encoder. Shape of h & c: (batch_size, hidden_size)
        :param enc_padding_mask: Mask for encoder input; Tensor of size (batch_size, length_input_sequence) with values of 0 for pad tokens & 1 for others
        :param ct_e: encoder context vector for time_step=0 (eq 5 in https://arxiv.org/pdf/1705.04304.pdf)
        :param extra_zeros: Tensor used to extend vocab distribution for pointer mechanism
        :param enc_batch_extend_vocab: Input batch that stores OOV ids
        :param article_oovs: Batch containing list of OOVs in each example
        :param greedy: If true, performs greedy based sampling, else performs multinomial sampling
        Returns:
        :decoded_strs: List of decoded sentences
        :log_probs: Log probabilities of sampled words
        '''
        s_t = enc_hidden                                                                            #Decoder hidden states
        x_t = get_cuda(T.LongTensor(len(enc_out)).fill_(self.start_id))                             #Input to the decoder
        prev_s = None                                                                               #Used for intra-decoder attention (section 2.2 in https://arxiv.org/pdf/1705.04304.pdf)
        sum_temporal_srcs = None                                                                    #Used for intra-temporal attention (section 2.1 in https://arxiv.org/pdf/1705.04304.pdf)
        inds = []                                                                                   #Stores sampled indices for each time step
        decoder_padding_mask = []                                                                   #Stores padding masks of generated samples
        log_probs = []                                                                              #Stores log probabilites of generated samples
        mask = get_cuda(T.LongTensor(len(enc_out)).fill_(1))                                        #Values that indicate whether [STOP] token has already been encountered; 1 => Not encountered, 0 otherwise

        for t in range(config.max_dec_steps):
            x_t = self.model.embeds(x_t)
            probs, s_t, ct_e, sum_temporal_srcs, prev_s = self.model.decoder(x_t, s_t, enc_out, enc_padding_mask, ct_e, extra_zeros, enc_batch_extend_vocab, sum_temporal_srcs, prev_s)
            if greedy is False:
                multi_dist = Categorical(probs)
                x_t = multi_dist.sample()                                                           #perform multinomial sampling
                log_prob = multi_dist.log_prob(x_t)
                log_probs.append(log_prob)
            else:
                _, x_t = T.max(probs, dim=1)                                                        #perform greedy sampling
            x_t = x_t.detach()
            inds.append(x_t)
            mask_t = get_cuda(T.zeros(len(enc_out)))                                                #Padding mask of batch for current time step
            mask_t[mask == 1] = 1                                                                   #If [STOP] is not encountered till previous time step, mask_t = 1 else mask_t = 0
            mask[(mask == 1) + (x_t == self.end_id) == 2] = 0                                       #If [STOP] is not encountered till previous time step and current word is [STOP], make mask = 0
            decoder_padding_mask.append(mask_t)
            is_oov = (x_t>=config.vocab_size).long()                                                #Mask indicating whether sampled word is OOV
            x_t = (1-is_oov)*x_t + (is_oov)*self.unk_id                                             #Replace OOVs with [UNK] token

        inds = T.stack(inds, dim=1)
        decoder_padding_mask = T.stack(decoder_padding_mask, dim=1)
        if greedy is False:                                                                         #If multinomial based sampling, compute log probabilites of sampled words
            log_probs = T.stack(log_probs, dim=1)
            log_probs = log_probs * decoder_padding_mask                                            #Not considering sampled words with padding mask = 0
            lens = T.sum(decoder_padding_mask, dim=1)                                               #Length of sampled sentence
            log_probs = T.sum(log_probs, dim=1) / lens  # (bs,)                                     #compute normalizied log probability of a sentence
        decoded_strs = []
        for i in range(len(enc_out)):
            id_list = inds[i].cpu().numpy()
            oovs = article_oovs[i]
            S = data.outputids2words(id_list, self.vocab, oovs)                                     #Generate sentence corresponding to sampled words
            try:
                end_idx = S.index(data.STOP_DECODING)
                S = S[:end_idx]
            except ValueError:
                S = S
            if len(S) < 2:                                                                           #If length of sentence is less than 2 words, replace it with "xxx"; Avoids setences like "." which throws error while calculating ROUGE
                S = ["xxx"]
            S = " ".join(S)
            decoded_strs.append(S)

        return decoded_strs, log_probs

    def reward_function(self, decoded_sents, original_sents):
        rouge = Rouge()
        try:
            scores = rouge.get_scores(decoded_sents, original_sents)
        except Exception:
            print("Rouge failed for multi sentence evaluation.. Finding exact pair")
            scores = []
            for i in range(len(decoded_sents)):
                try:
                    score = rouge.get_scores(decoded_sents[i], original_sents[i])
                except Exception:
                    print("Error occured at:")
                    print("decoded_sents:", decoded_sents[i])
                    print("original_sents:", original_sents[i])
                    score = [{"rouge-l":{"f":0.0}}]
                scores.append(score[0])
        rouge_l_f1 = [score["rouge-l"]["f"] for score in scores]
        rouge_l_f1 = get_cuda(T.FloatTensor(rouge_l_f1))
        return rouge_l_f1

    # def write_to_file(self, decoded, max, original, sample_r, baseline_r, iter):
    #     with open("temp.txt", "w") as f:
    #         f.write("iter:"+str(iter)+"\n")
    #         for i in range(len(original)):
    #             f.write("dec: "+decoded[i]+"\n")
    #             f.write("max: "+max[i]+"\n")
    #             f.write("org: "+original[i]+"\n")
    #             f.write("Sample_R: %.4f, Baseline_R: %.4f\n\n"%(sample_r[i].item(), baseline_r[i].item()))


    def train_one_batch(self, batch, iter):
        enc_batch, enc_lens, enc_padding_mask, enc_batch_extend_vocab, extra_zeros, context = get_enc_data(batch)

        enc_batch = self.model.embeds(enc_batch)                                                    #Get embeddings for encoder input
        enc_out, enc_hidden = self.model.encoder(enc_batch, enc_lens)

        # -------------------------------Summarization-----------------------
        if self.opt.train_mle == "yes":                                                             #perform MLE training
            mle_loss = self.train_batch_MLE(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch)
        else:
            mle_loss = get_cuda(T.FloatTensor([0]))
        # --------------RL training-----------------------------------------------------
        if self.opt.train_rl == "yes":                                                              #perform reinforcement learning training
            # multinomial sampling
            sample_sents, RL_log_probs = self.train_batch_RL(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch.art_oovs, greedy=False)
            with T.autograd.no_grad():
                # greedy sampling
                greedy_sents, _ = self.train_batch_RL(enc_out, enc_hidden, enc_padding_mask, context, extra_zeros, enc_batch_extend_vocab, batch.art_oovs, greedy=True)

            sample_reward = self.reward_function(sample_sents, batch.original_abstracts)
            baseline_reward = self.reward_function(greedy_sents, batch.original_abstracts)
            # if iter%200 == 0:
            #     self.write_to_file(sample_sents, greedy_sents, batch.original_abstracts, sample_reward, baseline_reward, iter)
            rl_loss = -(sample_reward - baseline_reward) * RL_log_probs                             #Self-critic policy gradient training (eq 15 in https://arxiv.org/pdf/1705.04304.pdf)
            rl_loss = T.mean(rl_loss)

            batch_reward = T.mean(sample_reward).item()
        else:
            rl_loss = get_cuda(T.FloatTensor([0]))
            batch_reward = 0

    # ------------------------------------------------------------------------------------
        self.trainer.zero_grad()
        (self.opt.mle_weight * mle_loss + self.opt.rl_weight * rl_loss).backward()
        self.trainer.step()

        return mle_loss.item(), batch_reward

    def trainIters(self):
        iter = self.setup_train()
        count = mle_total = r_total = 0
        while iter <= config.max_iterations:
            batch = self.batcher.next_batch()
            try:
                mle_loss, r = self.train_one_batch(batch, iter)
            except KeyboardInterrupt:
                print("-------------------Keyboard Interrupt------------------")
                exit(0)

            mle_total += mle_loss
            r_total += r
            count += 1
            iter += 1

            if iter % 1000 == 0:
                mle_avg = mle_total / count
                r_avg = r_total / count
                print("iter:", iter, "mle_loss:", "%.3f" % mle_avg, "reward:", "%.4f" % r_avg)
                count = mle_total = r_total = 0

            if iter % 5000 == 0:
                self.save_model(iter)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_mle', type=str, default="yes")
    parser.add_argument('--train_rl', type=str, default="no")
    parser.add_argument('--mle_weight', type=float, default=1.0)
    parser.add_argument('--load_model', type=str, default=None)
    parser.add_argument('--new_lr', type=float, default=None)
    opt = parser.parse_args()
    opt.rl_weight = 1 - opt.mle_weight
    print("Training mle: %s, Training rl: %s, mle weight: %.2f, rl weight: %.2f"%(opt.train_mle, opt.train_rl, opt.mle_weight, opt.rl_weight))
    print("intra_encoder:", config.intra_encoder, "intra_decoder:", config.intra_decoder)

    train_processor = Train(opt)
    train_processor.trainIters()


================================================
FILE: train_util.py
================================================
import numpy as np
import torch as T
from data_util import config

def get_cuda(tensor):
    if T.cuda.is_available():
        tensor = tensor.cuda()
    return tensor

def get_enc_data(batch):
    batch_size = len(batch.enc_lens)
    enc_batch = T.from_numpy(batch.enc_batch).long()
    enc_padding_mask = T.from_numpy(batch.enc_padding_mask).float()

    enc_lens = batch.enc_lens

    ct_e = T.zeros(batch_size, 2*config.hidden_dim)

    enc_batch = get_cuda(enc_batch)
    enc_padding_mask = get_cuda(enc_padding_mask)

    ct_e = get_cuda(ct_e)

    enc_batch_extend_vocab = None
    if batch.enc_batch_extend_vocab is not None:
        enc_batch_extend_vocab = T.from_numpy(batch.enc_batch_extend_vocab).long()
        enc_batch_extend_vocab = get_cuda(enc_batch_extend_vocab)

    extra_zeros = None
    if batch.max_art_oovs > 0:
        extra_zeros = T.zeros(batch_size, batch.max_art_oovs)
        extra_zeros = get_cuda(extra_zeros)


    return enc_batch, enc_lens, enc_padding_mask, enc_batch_extend_vocab, extra_zeros, ct_e

def get_dec_data(batch):
    dec_batch = T.from_numpy(batch.dec_batch).long()
    dec_lens = batch.dec_lens
    max_dec_len = np.max(dec_lens)
    dec_lens = T.from_numpy(batch.dec_lens).float()

    target_batch = T.from_numpy(batch.target_batch).long()

    dec_batch = get_cuda(dec_batch)
    dec_lens = get_cuda(dec_lens)
    target_batch = get_cuda(target_batch)

    return dec_batch, max_dec_len, dec_lens, target_batch


================================================
FILE: training_log.txt
================================================
--------MLE Training------------

$ python train.py
Training mle: yes, Training rl: no, mle weight: 1.00, rl weight: 0.00
intra_encoder: True intra_decoder: True
iter: 1000 mle_loss: 4.652 reward: 0.0000
iter: 2000 mle_loss: 3.942 reward: 0.0000
iter: 3000 mle_loss: 3.699 reward: 0.0000
iter: 4000 mle_loss: 3.555 reward: 0.0000
iter: 5000 mle_loss: 3.447 reward: 0.0000
iter: 6000 mle_loss: 3.378 reward: 0.0000
iter: 7000 mle_loss: 3.321 reward: 0.0000
iter: 8000 mle_loss: 3.282 reward: 0.0000
iter: 9000 mle_loss: 3.242 reward: 0.0000
iter: 10000 mle_loss: 3.206 reward: 0.0000
iter: 11000 mle_loss: 3.183 reward: 0.0000
iter: 12000 mle_loss: 3.154 reward: 0.0000
iter: 13000 mle_loss: 3.137 reward: 0.0000
iter: 14000 mle_loss: 3.122 reward: 0.0000
iter: 15000 mle_loss: 3.081 reward: 0.0000
iter: 16000 mle_loss: 3.026 reward: 0.0000
iter: 17000 mle_loss: 3.014 reward: 0.0000
iter: 18000 mle_loss: 2.999 reward: 0.0000
iter: 19000 mle_loss: 2.992 reward: 0.0000
iter: 20000 mle_loss: 2.989 reward: 0.0000
iter: 21000 mle_loss: 2.971 reward: 0.0000
iter: 22000 mle_loss: 2.983 reward: 0.0000
iter: 23000 mle_loss: 2.966 reward: 0.0000
iter: 24000 mle_loss: 2.957 reward: 0.0000
iter: 25000 mle_loss: 2.946 reward: 0.0000
iter: 26000 mle_loss: 2.942 reward: 0.0000
iter: 27000 mle_loss: 2.941 reward: 0.0000
iter: 28000 mle_loss: 2.930 reward: 0.0000
iter: 29000 mle_loss: 2.923 reward: 0.0000
iter: 30000 mle_loss: 2.906 reward: 0.0000
iter: 31000 mle_loss: 2.818 reward: 0.0000
iter: 32000 mle_loss: 2.809 reward: 0.0000
iter: 33000 mle_loss: 2.822 reward: 0.0000
iter: 34000 mle_loss: 2.807 reward: 0.0000
iter: 35000 mle_loss: 2.833 reward: 0.0000
iter: 36000 mle_loss: 2.815 reward: 0.0000
iter: 37000 mle_loss: 2.829 reward: 0.0000
iter: 38000 mle_loss: 2.830 reward: 0.0000
iter: 39000 mle_loss: 2.822 reward: 0.0000
iter: 40000 mle_loss: 2.833 reward: 0.0000
iter: 41000 mle_loss: 2.817 reward: 0.0000
iter: 42000 mle_loss: 2.815 reward: 0.0000
iter: 43000 mle_loss: 2.816 reward: 0.0000
iter: 44000 mle_loss: 2.812 reward: 0.0000
iter: 45000 mle_loss: 2.757 reward: 0.0000
iter: 46000 mle_loss: 2.698 reward: 0.0000
iter: 47000 mle_loss: 2.701 reward: 0.0000
iter: 48000 mle_loss: 2.710 reward: 0.0000
iter: 49000 mle_loss: 2.728 reward: 0.0000
iter: 50000 mle_loss: 2.711 reward: 0.0000
iter: 51000 mle_loss: 2.718 reward: 0.0000
iter: 52000 mle_loss: 2.728 reward: 0.0000
iter: 53000 mle_loss: 2.725 reward: 0.0000
iter: 54000 mle_loss: 2.722 reward: 0.0000
iter: 55000 mle_loss: 2.728 reward: 0.0000
iter: 56000 mle_loss: 2.729 reward: 0.0000
iter: 57000 mle_loss: 2.731 reward: 0.0000
iter: 58000 mle_loss: 2.741 reward: 0.0000
iter: 59000 mle_loss: 2.731 reward: 0.0000
iter: 60000 mle_loss: 2.645 reward: 0.0000
iter: 61000 mle_loss: 2.600 reward: 0.0000
iter: 62000 mle_loss: 2.600 reward: 0.0000
iter: 63000 mle_loss: 2.612 reward: 0.0000
iter: 64000 mle_loss: 2.626 reward: 0.0000
iter: 65000 mle_loss: 2.637 reward: 0.0000
iter: 66000 mle_loss: 2.641 reward: 0.0000
iter: 67000 mle_loss: 2.652 reward: 0.0000
iter: 68000 mle_loss: 2.651 reward: 0.0000
iter: 69000 mle_loss: 2.643 reward: 0.0000
iter: 70000 mle_loss: 2.661 reward: 0.0000
iter: 71000 mle_loss: 2.668 reward: 0.0000
iter: 72000 mle_loss: 2.668 reward: 0.0000
iter: 73000 mle_loss: 2.679 reward: 0.0000
iter: 74000 mle_loss: 2.670 reward: 0.0000
iter: 75000 mle_loss: 2.567 reward: 0.0000
iter: 76000 mle_loss: 2.524 reward: 0.0000
iter: 77000 mle_loss: 2.549 reward: 0.0000
iter: 78000 mle_loss: 2.535 reward: 0.0000
iter: 79000 mle_loss: 2.552 reward: 0.0000
iter: 80000 mle_loss: 2.568 reward: 0.0000
iter: 81000 mle_loss: 2.581 reward: 0.0000
iter: 82000 mle_loss: 2.595 reward: 0.0000
iter: 83000 mle_loss: 2.600 reward: 0.0000
iter: 84000 mle_loss: 2.595 reward: 0.0000
iter: 85000 mle_loss: 2.593 reward: 0.0000
iter: 86000 mle_loss: 2.615 reward: 0.0000
iter: 87000 mle_loss: 2.608 reward: 0.0000
iter: 88000 mle_loss: 2.604 reward: 0.0000
iter: 89000 mle_loss: 2.618 reward: 0.0000
iter: 90000 mle_loss: 2.483 reward: 0.0000
iter: 91000 mle_loss: 2.483 reward: 0.0000
iter: 92000 mle_loss: 2.479 reward: 0.0000
iter: 93000 mle_loss: 2.490 reward: 0.0000
iter: 94000 mle_loss: 2.520 reward: 0.0000
iter: 95000 mle_loss: 2.527 reward: 0.0000
iter: 96000 mle_loss: 2.525 reward: 0.0000
iter: 97000 mle_loss: 2.532 reward: 0.0000
iter: 98000 mle_loss: 2.546 reward: 0.0000
iter: 99000 mle_loss: 2.537 reward: 0.0000
iter: 100000 mle_loss: 2.546 reward: 0.0000
iter: 101000 mle_loss: 2.551 reward: 0.0000
iter: 102000 mle_loss: 2.562 reward: 0.0000
iter: 103000 mle_loss: 2.566 reward: 0.0000
iter: 104000 mle_loss: 2.577 reward: 0.0000
iter: 105000 mle_loss: 2.370 reward: 0.0000
iter: 106000 mle_loss: 2.433 reward: 0.0000
iter: 107000 mle_loss: 2.435 reward: 0.0000
iter: 108000 mle_loss: 2.454 reward: 0.0000
iter: 109000 mle_loss: 2.461 reward: 0.0000
iter: 110000 mle_loss: 2.479 reward: 0.0000
iter: 111000 mle_loss: 2.486 reward: 0.0000
iter: 112000 mle_loss: 2.499 reward: 0.0000
iter: 113000 mle_loss: 2.503 reward: 0.0000
iter: 114000 mle_loss: 2.503 reward: 0.0000
iter: 115000 mle_loss: 2.518 reward: 0.0000
iter: 116000 mle_loss: 2.515 reward: 0.0000
iter: 117000 mle_loss: 2.523 reward: 0.0000
iter: 118000 mle_loss: 2.532 reward: 0.0000
iter: 119000 mle_loss: 2.511 reward: 0.0000
iter: 120000 mle_loss: 2.373 reward: 0.0000
iter: 121000 mle_loss: 2.386 reward: 0.0000
iter: 122000 mle_loss: 2.386 reward: 0.0000
iter: 123000 mle_loss: 2.419 reward: 0.0000
iter: 124000 mle_loss: 2.419 reward: 0.0000
iter: 125000 mle_loss: 2.440 reward: 0.0000
iter: 126000 mle_loss: 2.455 reward: 0.0000
iter: 127000 mle_loss: 2.463 reward: 0.0000
iter: 128000 mle_loss: 2.472 reward: 0.0000
iter: 129000 mle_loss: 2.474 reward: 0.0000
iter: 130000 mle_loss: 2.479 reward: 0.0000
iter: 131000 mle_loss: 2.487 reward: 0.0000
iter: 132000 mle_loss: 2.486 reward: 0.0000
iter: 133000 mle_loss: 2.488 reward: 0.0000
iter: 134000 mle_loss: 2.423 reward: 0.0000
iter: 135000 mle_loss: 2.300 reward: 0.0000
iter: 136000 mle_loss: 2.368 reward: 0.0000
iter: 137000 mle_loss: 2.381 reward: 0.0000
iter: 138000 mle_loss: 2.367 reward: 0.0000
iter: 139000 mle_loss: 2.408 reward: 0.0000
iter: 140000 mle_loss: 2.404 reward: 0.0000
iter: 141000 mle_loss: 2.412 reward: 0.0000
iter: 142000 mle_loss: 2.439 reward: 0.0000
iter: 143000 mle_loss: 2.433 reward: 0.0000
iter: 144000 mle_loss: 2.448 reward: 0.0000
iter: 145000 mle_loss: 2.445 reward: 0.0000
iter: 146000 mle_loss: 2.462 reward: 0.0000
iter: 147000 mle_loss: 2.456 reward: 0.0000
iter: 148000 mle_loss: 2.468 reward: 0.0000
iter: 149000 mle_loss: 2.399 reward: 0.0000
iter: 150000 mle_loss: 2.308 reward: 0.0000
iter: 151000 mle_loss: 2.330 reward: 0.0000
iter: 152000 mle_loss: 2.371 reward: 0.0000
iter: 153000 mle_loss: 2.368 reward: 0.0000
iter: 154000 mle_loss: 2.363 reward: 0.0000
iter: 155000 mle_loss: 2.378 reward: 0.0000
iter: 156000 mle_loss: 2.398 reward: 0.0000
iter: 157000 mle_loss: 2.405 reward: 0.0000
iter: 158000 mle_loss: 2.408 reward: 0.0000


-------------MLE Validation---------------

$ python eval.py --task=validate --start_from=0005000.tar
0005000.tar rouge_l: 0.3818
0010000.tar rouge_l: 0.3921
0015000.tar rouge_l: 0.3988
0020000.tar rouge_l: 0.4030
0025000.tar rouge_l: 0.4047
0030000.tar rouge_l: 0.4037
0035000.tar rouge_l: 0.4063
0040000.tar rouge_l: 0.4078
0045000.tar rouge_l: 0.4088
0050000.tar rouge_l: 0.4077
0055000.tar rouge_l: 0.4075
0060000.tar rouge_l: 0.4079
0065000.tar rouge_l: 0.4114		#best
0070000.tar rouge_l: 0.4074
0075000.tar rouge_l: 0.4080
0080000.tar rouge_l: 0.4090
0085000.tar rouge_l: 0.4060
0090000.tar rouge_l: 0.4079
0095000.tar rouge_l: 0.4086
0100000.tar rouge_l: 0.4076
0105000.tar rouge_l: 0.4053
0110000.tar rouge_l: 0.4062
0115000.tar rouge_l: 0.4056
0120000.tar rouge_l: 0.4022
0125000.tar rouge_l: 0.4042
0130000.tar rouge_l: 0.4067
0135000.tar rouge_l: 0.4012
0140000.tar rouge_l: 0.4046
0145000.tar rouge_l: 0.4026
0150000.tar rouge_l: 0.4026
0155000.tar rouge_l: 0.4018

-----------------MLE + RL Training--------------------

$ python train.py --train_mle=yes --train_rl=yes --mle_weight=0.25 --load_model=0065000.tar --new_lr=0.0001
Training mle: yes, Training rl: yes, mle weight: 0.25, rl weight: 0.75
intra_encoder: True intra_decoder: True
Loaded model at data/saved_models/0065000.tar
iter: 66000 mle_loss: 2.555 reward: 0.3088
iter: 67000 mle_loss: 2.570 reward: 0.3097
iter: 68000 mle_loss: 2.496 reward: 0.3177
iter: 69000 mle_loss: 2.568 reward: 0.3101
iter: 70000 mle_loss: 2.437 reward: 0.3231
iter: 71000 mle_loss: 2.474 reward: 0.3209
iter: 72000 mle_loss: 2.471 reward: 0.3204
iter: 73000 mle_loss: 2.474 reward: 0.3204
iter: 74000 mle_loss: 2.451 reward: 0.3226
iter: 75000 mle_loss: 2.477 reward: 0.3204
iter: 76000 mle_loss: 2.470 reward: 0.3204
iter: 77000 mle_loss: 2.503 reward: 0.3182
iter: 78000 mle_loss: 2.523 reward: 0.3148
iter: 79000 mle_loss: 2.385 reward: 0.3286
iter: 80000 mle_loss: 2.488 reward: 0.3200
iter: 81000 mle_loss: 2.396 reward: 0.3271
iter: 82000 mle_loss: 2.459 reward: 0.3215
iter: 83000 mle_loss: 2.371 reward: 0.3301
iter: 84000 mle_loss: 2.433 reward: 0.3253
iter: 85000 mle_loss: 2.475 reward: 0.3207
iter: 86000 mle_loss: 2.504 reward: 0.3178
iter: 87000 mle_loss: 2.441 reward: 0.3241
iter: 88000 mle_loss: 2.424 reward: 0.3266
iter: 89000 mle_loss: 2.399 reward: 0.3285
iter: 90000 mle_loss: 2.405 reward: 0.3274
iter: 91000 mle_loss: 2.425 reward: 0.3262
iter: 92000 mle_loss: 2.424 reward: 0.3264
iter: 93000 mle_loss: 2.433 reward: 0.3252
iter: 94000 mle_loss: 2.414 reward: 0.3278
iter: 95000 mle_loss: 2.444 reward: 0.3241
iter: 96000 mle_loss: 2.395 reward: 0.3288
iter: 97000 mle_loss: 2.425 reward: 0.3256
iter: 98000 mle_loss: 2.378 reward: 0.3305
iter: 99000 mle_loss: 2.415 reward: 0.3268
iter: 100000 mle_loss: 2.412 reward: 0.3277
iter: 101000 mle_loss: 2.387 reward: 0.3296
iter: 102000 mle_loss: 2.370 reward: 0.3316
iter: 103000 mle_loss: 2.420 reward: 0.3268
iter: 104000 mle_loss: 2.408 reward: 0.3285
iter: 105000 mle_loss: 2.415 reward: 0.3276
iter: 106000 mle_loss: 2.401 reward: 0.3295
iter: 107000 mle_loss: 2.467 reward: 0.3233


----------------------MLE + RL Validation--------------------------

$ python eval.py --task=validate --start_from=0070000.tar
0070000.tar rouge_l: 0.4169
0075000.tar rouge_l: 0.4174
0080000.tar rouge_l: 0.4184
0085000.tar rouge_l: 0.4186		#best
0090000.tar rouge_l: 0.4165
0095000.tar rouge_l: 0.4173
0100000.tar rouge_l: 0.4164
0105000.tar rouge_l: 0.4163

----------------------MLE Testing------------------------------------

$ python eval.py --task=test --load_model=0065000.tar
0065000.tar scores: {'rouge-1': {'f': 0.4412018559893622, 'p': 0.4814799494024485, 'r': 0.4232331027817015}, 'rouge-2': {'f': 0.23238981595683728, 'p': 0.2531296070596062, 'r': 0.22407861554997008}, 'rouge-l': {'f': 0.40477682528278364, 'p': 0.4584684491434479, 'r': 0.40351107200202596}}

----------------------MLE + RL Testing-------------------------------

$ python eval.py --task=test --load_model=0085000.tar
0085000.tar scores: {'rouge-1': {'f': 0.4499047033247696, 'p': 0.4853756369556345, 'r': 0.43544461386607497}, 'rouge-2': {'f': 0.24037014314625643, 'p': 0.25903387205387235, 'r': 0.23362662645146298}, 'rouge-l': {'f': 0.41320241732946406, 'p': 0.4616655167980162, 'r': 0.4144419466382236}}