Repository: memray/seq2seq-keyphrase
Branch: master
Commit: e8660727a4f1
Files: 51
Total size: 693.0 KB

Directory structure:
gitextract_qitx157j/

├── .gitignore
├── LICENSE
├── README.md
├── emolga/
│   ├── __init__.py
│   ├── basic/
│   │   ├── __init__.py
│   │   ├── activations.py
│   │   ├── initializations.py
│   │   ├── objectives.py
│   │   └── optimizers.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   └── build_dataset.py
│   ├── layers/
│   │   ├── __init__.py
│   │   ├── attention.py
│   │   ├── core.py
│   │   ├── embeddings.py
│   │   ├── gridlstm.py
│   │   ├── ntm_minibatch.py
│   │   └── recurrent.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── core.py
│   │   ├── covc_encdec.py
│   │   ├── encdec.py
│   │   ├── ntm_encdec.py
│   │   ├── pointers.py
│   │   └── variational.py
│   └── utils/
│       ├── __init__.py
│       ├── generic_utils.py
│       ├── io_utils.py
│       ├── np_utils.py
│       ├── test_utils.py
│       └── theano_utils.py
└── keyphrase/
    ├── __init__.py
    ├── baseline/
    │   ├── evaluate.py
    │   └── export_dataset.py
    ├── config.py
    ├── dataset/
    │   ├── __init__.py
    │   ├── dataset_utils.py
    │   ├── inspec/
    │   │   ├── __init__.py
    │   │   ├── inspec_export_json.py
    │   │   └── key_convert_maui.py
    │   ├── json_count.py
    │   ├── keyphrase_dataset.py
    │   ├── keyphrase_test_dataset.py
    │   ├── keyphrase_train_dataset.py
    │   └── million-paper/
    │       ├── clean_export_json.py
    │       └── preprocess.py
    ├── keyphrase_copynet.py
    ├── keyphrase_utils.py
    └── util/
        ├── __init__.py
        ├── gpu-test.py
        └── stanford-pos-tagger.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# added by memray
/.idea/
/Experiment/
/dataset/
/stanford-postagger/
.DS_Store

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover

# Translations
*.mo
*.pot

# Django stuff:
*.log

# Sphinx documentation
docs/_build/

# PyBuilder
target/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2016 Rui Meng

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# seq2seq-keyphrase
### Note: this repository has been deprecated. Please move to our latest code/data/model release for keyphrase generation at [https://github.com/memray/OpenNMT-kpg-release](https://github.com/memray/OpenNMT-kpg-release). Thank you.

Data
==========
Check out all datasets at [https://huggingface.co/memray/](https://huggingface.co/memray/).


Introduction
==========
This is an implementation of [Deep Keyphrase Generation](http://memray.me/uploads/acl17-keyphrase-generation.pdf) based on [CopyNet](https://github.com/MultiPath/CopyNet).

One training dataset (**KP20k**), five testing datasets (**KP20k, Inspec, NUS, SemEval, Krapivin**) and one pre-trained model are provided. 

Note that the model is trained on scientific papers (abstract and keyword) in Computer Science domain, so it's expected to work well only for CS papers.


Cite
==========
If you use the code or datasets, please cite the following paper:

> Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky and Yu Chi. Deep Keyphrase Generation. 55th Annual Meeting of Association for Computational Linguistics, 2017. [[PDF]](http://memray.me/uploads/acl17-keyphrase-generation.pdf) [[arXiv]](https://arxiv.org/abs/1704.06879)

```
@InProceedings{meng-EtAl:2017:Long,
  author    = {Meng, Rui  and  Zhao, Sanqiang  and  Han, Shuguang  and  He, Daqing  and  Brusilovsky, Peter  and  Chi, Yu},
  title     = {Deep Keyphrase Generation},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {582--592},
  url       = {http://aclweb.org/anthology/P17-1054}
}
```


================================================
FILE: emolga/__init__.py
================================================
__author__ = 'yinpengcheng'


================================================
FILE: emolga/basic/__init__.py
================================================
__author__ = 'jiataogu'


================================================
FILE: emolga/basic/activations.py
================================================
import theano.tensor as T


def softmax(x):
    return T.nnet.softmax(x.reshape((-1, x.shape[-1]))).reshape(x.shape)


def vector_softmax(x):
    return T.nnet.softmax(x.reshape((1, x.shape[0])))[0]


def time_distributed_softmax(x):
    import warnings
    warnings.warn("time_distributed_softmax is deprecated. Just use softmax!", DeprecationWarning)
    return softmax(x)


def softplus(x):
    return T.nnet.softplus(x)


def relu(x):
    return T.nnet.relu(x)


def tanh(x):
    return T.tanh(x)


def sigmoid(x):
    return T.nnet.sigmoid(x)


def hard_sigmoid(x):
    return T.nnet.hard_sigmoid(x)


def linear(x):
    '''
    The function returns the variable that is passed in, so all types work
    '''
    return x


def maxout2(x):
    shape = x.shape
    if x.ndim == 1:
        shape1 = T.cast(shape[0] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape1, shape2])
        x = x.max(1)
    elif x.ndim == 2:
        shape1 = T.cast(shape[1] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape[0], shape1, shape2])
        x = x.max(2)
    elif x.ndim == 3:
        shape1 = T.cast(shape[2] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape[0], shape[1], shape1, shape2])
        x = x.max(3)
    return x


from emolga.utils.generic_utils import get_from_module


def get(identifier):
    return get_from_module(identifier, globals(), 'activation function')


================================================
FILE: emolga/basic/initializations.py
================================================
import theano
import theano.tensor as T
import numpy as np

from emolga.utils.theano_utils import sharedX, shared_zeros, shared_ones


def get_fans(shape):
    if isinstance(shape, int):
        shape = (1, shape)
    fan_in = shape[0] if len(shape) == 2 else np.prod(shape[1:])
    fan_out = shape[1] if len(shape) == 2 else shape[0]
    return fan_in, fan_out


def uniform(shape, scale=0.1):
    return sharedX(np.random.uniform(low=-scale, high=scale, size=shape))


def normal(shape, scale=0.05):
    return sharedX(np.random.randn(*shape) * scale)


def lecun_uniform(shape):
    ''' Reference: LeCun 98, Efficient Backprop
        http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
    '''
    fan_in, fan_out = get_fans(shape)
    scale = np.sqrt(3. / fan_in)
    return uniform(shape, scale)


def glorot_normal(shape):
    ''' Reference: Glorot & Bengio, AISTATS 2010
    '''
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(2. / (fan_in + fan_out))
    return normal(shape, s)


def glorot_uniform(shape):
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(6. / (fan_in + fan_out))
    return uniform(shape, s)


def he_normal(shape):
    ''' Reference:  He et al., http://arxiv.org/abs/1502.01852
    '''
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(2. / fan_in)
    return normal(shape, s)


def he_uniform(shape):
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(6. / fan_in)
    return uniform(shape, s)


def orthogonal(shape, scale=1.1):
    ''' From Lasagne
    '''
    flat_shape = (shape[0], np.prod(shape[1:]))
    a = np.random.normal(0.0, 1.0, flat_shape)
    u, _, v = np.linalg.svd(a, full_matrices=False)
    # pick the one with the correct shape
    q = u if u.shape == flat_shape else v
    q = q.reshape(shape)
    return sharedX(scale * q[:shape[0], :shape[1]])


def identity(shape, scale=1):
    if len(shape) != 2 or shape[0] != shape[1]:
        raise Exception("Identity matrix initialization can only be used for 2D square matrices")
    else:
        return sharedX(scale * np.identity(shape[0]))


def zero(shape):
    return shared_zeros(shape)


def one(shape):
    return shared_ones(shape)

from emolga.utils.generic_utils import get_from_module
def get(identifier):
    return get_from_module(identifier, globals(), 'initialization')


================================================
FILE: emolga/basic/objectives.py
================================================
from __future__ import absolute_import
import theano
import theano.tensor as T
import numpy as np
from six.moves import range

if theano.config.floatX == 'float64':
    epsilon = 1.0e-9
else:
    epsilon = 1.0e-7


def mean_squared_error(y_true, y_pred):
    return T.sqr(y_pred - y_true).mean(axis=-1)


def mean_absolute_error(y_true, y_pred):
    return T.abs_(y_pred - y_true).mean(axis=-1)


def mean_absolute_percentage_error(y_true, y_pred):
    return T.abs_((y_true - y_pred) / T.clip(T.abs_(y_true), epsilon, np.inf)).mean(axis=-1) * 100.


def mean_squared_logarithmic_error(y_true, y_pred):
    return T.sqr(T.log(T.clip(y_pred, epsilon, np.inf) + 1.) - T.log(T.clip(y_true, epsilon, np.inf) + 1.)).mean(axis=-1)


def squared_hinge(y_true, y_pred):
    return T.sqr(T.maximum(1. - y_true * y_pred, 0.)).mean(axis=-1)


def hinge(y_true, y_pred):
    return T.maximum(1. - y_true * y_pred, 0.).mean(axis=-1)


def categorical_crossentropy(y_true, y_pred):
    '''Expects a binary class matrix instead of a vector of scalar classes
    '''
    y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
    # scale preds so that the class probas of each sample sum to 1
    y_pred /= y_pred.sum(axis=-1, keepdims=True)
    cce = T.nnet.categorical_crossentropy(y_pred, y_true)
    return cce


def binary_crossentropy(y_true, y_pred):
    y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
    bce = T.nnet.binary_crossentropy(y_pred, y_true).mean(axis=-1)
    return bce


def poisson_loss(y_true, y_pred):
    return T.mean(y_pred - y_true * T.log(y_pred + epsilon), axis=-1)

####################################################
# Variational Auto-encoder

def gaussian_kl_divergence(mean, ln_var):
    """Computes the KL-divergence of Gaussian variables from the standard one.

    Given two variable ``mean`` representing :math:`\\mu` and ``ln_var``
    representing :math:`\\log(\\sigma^2)`, this function returns a variable
    representing the KL-divergence between the given multi-dimensional Gaussian
    :math:`N(\\mu, S)` and the standard Gaussian :math:`N(0, I)`

    .. math::

       D_{\\mathbf{KL}}(N(\\mu, S) \\| N(0, I)),

    where :math:`S` is a diagonal matrix such that :math:`S_{ii} = \\sigma_i^2`
    and :math:`I` is an identity matrix.

    Args:
        mean (~chainer.Variable): A variable representing mean of given
            gaussian distribution, :math:`\\mu`.
        ln_var (~chainer.Variable): A variable representing logarithm of
            variance of given gaussian distribution, :math:`\\log(\\sigma^2)`.

    Returns:
        ~chainer.Variable: A variable representing KL-divergence between
            given gaussian distribution and the standard gaussian.

    """
    var = T.exp(ln_var)
    return  0.5 * T.sum(mean * mean + var - ln_var - 1, 1)


# aliases
mse = MSE = mean_squared_error
mae = MAE = mean_absolute_error
mape = MAPE = mean_absolute_percentage_error
msle = MSLE = mean_squared_logarithmic_error
gkl = GKL = gaussian_kl_divergence

from emolga.utils.generic_utils import get_from_module
def get(identifier):
    return get_from_module(identifier, globals(), 'objective')


================================================
FILE: emolga/basic/optimizers.py
================================================
from __future__ import absolute_import
import theano
import sys

from theano.sandbox.rng_mrg import MRG_RandomStreams
import theano.tensor as T
import logging

from emolga.utils.theano_utils import shared_zeros, shared_scalar, floatX
from emolga.utils.generic_utils import get_from_module
from six.moves import zip
from copy import copy, deepcopy

logger = logging.getLogger(__name__)


def clip_norm(g, c, n):
    if c > 0:
        g = T.switch(T.ge(n, c), g * c / n, g)
    return g


def kl_divergence(p, p_hat):
    return p_hat - p + p * T.log(p / p_hat)


class Optimizer(object):
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        self.updates   = []
        self.save_parm = []

    def add(self, v):
        self.save_parm += [v]

    def get_state(self):
        return [u[0].get_value() for u in self.updates]

    def set_state(self, value_list):
        assert len(self.updates) == len(value_list)
        for u, v in zip(self.updates, value_list):
            u[0].set_value(floatX(v))

    def get_updates(self, params, loss):
        raise NotImplementedError

    def get_gradients(self, loss, params):
        """
        Consider the situation that gradient is weighted.
        """
        if isinstance(loss, list):
            grads = T.grad(loss[0], params, consider_constant=loss[1:])  # gradient of loss
        else:
            grads = T.grad(loss, params)

        if hasattr(self, 'clipnorm') and self.clipnorm > 0:
            print('use gradient clipping!!')
            print('clipnorm = %f' % self.clipnorm)
            norm = T.sqrt(sum([T.sum(g ** 2) for g in grads]))
            grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
        else:
            print('not use gradient clipping!!')

        return grads

    def get_config(self):
        return {"name": self.__class__.__name__}


class SGD(Optimizer):

    def __init__(self, lr=0.05, momentum=0.9, decay=0.01, nesterov=True, *args, **kwargs):
        super(SGD, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.iterations = shared_scalar(0)
        self.lr = shared_scalar(lr)
        self.momentum = shared_scalar(momentum)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        lr = self.lr * (1.0 / (1.0 + self.decay * self.iterations))
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g in zip(params, grads):
            m = shared_zeros(p.get_value().shape)  # momentum
            v = self.momentum * m - lr * g  # velocity
            self.updates.append((m, v))

            if self.nesterov:
                new_p = p + self.momentum * v - lr * g
            else:
                new_p = p + v

            self.updates.append((p, new_p))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "momentum": float(self.momentum.get_value()),
                "decay": float(self.decay.get_value()),
                "nesterov": self.nesterov}


class RMSprop(Optimizer):
    def __init__(self, lr=0.001, rho=0.9, epsilon=1e-6, *args, **kwargs):
        super(RMSprop, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)
        self.rho = shared_scalar(rho)
        self.iterations = shared_scalar(0)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g, a in zip(params, grads, accumulators):
            new_a = self.rho * a + (1 - self.rho) * g ** 2  # update accumulator
            self.updates.append((a, new_a))

            new_p = p - self.lr * g / T.sqrt(new_a + self.epsilon)
            self.updates.append((p, new_p))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "rho": float(self.rho.get_value()),
                "epsilon": self.epsilon}


class Adagrad(Optimizer):
    def __init__(self, lr=0.01, epsilon=1e-6, *args, **kwargs):
        super(Adagrad, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)

    def get_updates(self, params, constraints, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        self.updates = []

        for p, g, a, c in zip(params, grads, accumulators, constraints):
            new_a = a + g ** 2  # update accumulator
            self.updates.append((a, new_a))
            new_p = p - self.lr * g / T.sqrt(new_a + self.epsilon)
            self.updates.append((p, c(new_p)))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "epsilon": self.epsilon}


class Adadelta(Optimizer):
    '''
        Reference: http://arxiv.org/abs/1212.5701
    '''
    def __init__(self, lr=0.1, rho=0.95, epsilon=1e-6, *args, **kwargs):
        super(Adadelta, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)
        self.iterations = shared_scalar(0)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        delta_accumulators = [shared_zeros(p.get_value().shape) for p in params]
        # self.updates = []
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g, a, d_a in zip(params, grads, accumulators, delta_accumulators):
            new_a = self.rho * a + (1 - self.rho) * g ** 2  # update accumulator
            self.updates.append((a, new_a))

            # use the new accumulator and the *old* delta_accumulator
            update = g * T.sqrt(d_a + self.epsilon) / T.sqrt(new_a +
                                                             self.epsilon)

            new_p = p - self.lr * update
            self.updates.append((p, new_p))

            # update delta_accumulator
            new_d_a = self.rho * d_a + (1 - self.rho) * update ** 2
            self.updates.append((d_a, new_d_a))
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "rho": self.rho,
                "epsilon": self.epsilon}


class Adam(Optimizer):  # new Adam is designed for our purpose.
    '''
        Reference: http://arxiv.org/abs/1412.6980v8

        Default parameters follow those provided in the original paper.
        We add Gaussian Noise to improve the performance.
    '''
    def __init__(self, lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8, save=False, rng=None, *args, **kwargs):
        print('args=%s' % str(args))
        print('kwargs=%s' % str(kwargs))
        super(Adam, self).__init__(**kwargs)
        self.__dict__.update(locals())
        print(locals())

        # if 'iterations' in kwargs:
        #     print('iterations=%s' % str(kwargs['iterations']))
        #     self.iterations = shared_scalar(kwargs['iterations'],  name='iteration')
        # else:
        #     print('iterations not set')
        #     self.iterations = shared_scalar(0,  name='iteration')
        self.iterations = shared_scalar(0, name='iteration')
        self.lr         = shared_scalar(lr, name='lr')
        # self.rng        = MRG_RandomStreams(use_cuda=True)
        self.noise      = []
        self.forget     = dict()
        # self.rng        = rng
        self.beta_1     = beta_1
        self.beta_2     = beta_2
        self.epsilon    = epsilon

        self.add(self.iterations)
        self.add(self.lr)

    def add_noise(self, param):
        if param.name not in self.noise:
            logger.info('add gradient noise to {}'.format(param))
            self.noise += [param.name]

    def add_forget(self, param):
        if param.name not in self.forget:
            logger.info('add forgetting list to {}'.format(param))
            self.forget[param.name] = theano.shared(param.get_value())

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        self.updates = [(self.iterations, self.iterations + 1.)]
        self.pu = []

        t = self.iterations + 1
        lr_t = self.lr * T.sqrt(1 - self.beta_2**t) / (1 - self.beta_1**t)
        for p, g in zip(params, grads):
            m = theano.shared(p.get_value() * 0., name=p.name + '_m')  # zero init of moment
            v = theano.shared(p.get_value() * 0., name=p.name + '_v')  # zero init of velocity

            self.add(m)
            self.add(v)

            # g_noise = self.rng.normal(g.shape, 0, T.sqrt(0.005 * t ** (-0.55)), dtype='float32')

            # if p.name in self.noise:
            #     g_deviated = g + g_noise
            # else:
            #     g_deviated = g

            g_deviated = g  #  + g_noise
            m_t = (self.beta_1 * m) + (1 - self.beta_1) * g_deviated
            v_t = (self.beta_2 * v) + (1 - self.beta_2) * (g_deviated**2)
            u_t = -lr_t * m_t / (T.sqrt(v_t) + self.epsilon)
            p_t = p + u_t

            # # memory reformatting!
            # if p.name in self.forget:
            #     p_t = (1 - p_mem) * p_t + p_mem * self.forget[p.name]
            #     p_s = (1 - p_fgt) * p_t + p_fgt * self.forget[p.name]
            #     self.updates.append((self.forget[p.name], p_s))

            self.updates.append((m, m_t))
            self.updates.append((v, v_t))
            self.updates.append((p, p_t))  # apply constraints
            self.pu.append((p, p_t - p))

        if self.save:
            return self.updates, self.pu
        return self.updates

    def get_config(self):
        # print(theano.tensor.cast(self.lr, dtype='float32').eval())
        # print(int(theano.tensor.cast(self.iterations, dtype='int32').eval()))
        config = {'lr':     float(theano.tensor.cast(self.lr, dtype='float32').eval()),
                  'beta_1': float(self.beta_1),
                  'beta_2': float(self.beta_2),
                  'iterations':  int(theano.tensor.cast(self.iterations, dtype='int32').eval()),
                  'noise':  self.noise
                  }
        base_config = super(Adam, self).get_config()
        return_config = dict(list(base_config.items()) + list(config.items()))
        print('Getting config of optimizer: \n\t\t %s' % str(return_config))
        return return_config

# aliases
sgd = SGD
rmsprop = RMSprop
adagrad = Adagrad
adadelta = Adadelta
adam = Adam


def get(identifier, kwargs=None):
    return get_from_module(identifier, globals(), 'optimizer', instantiate=True,
                           kwargs=kwargs)


================================================
FILE: emolga/dataset/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: emolga/dataset/build_dataset.py
================================================
import json

__author__ = 'jiataogu'
import numpy as np
import numpy.random as rng
import cPickle as pickle
import pprint
import sys
import hickle

from collections import OrderedDict
from fuel import datasets
from fuel import transformers
from fuel import schemes
from fuel import streams

def serialize_to_file_json(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    f = open(path, 'w')
    json.dump(obj, f)
    f.close()

def serialize_to_file_hdf5(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    f = open(path, 'w')
    hickle.dump(obj, f)
    f.close()

def serialize_to_file(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    print('serialize to %s' % path)
    f = open(path, 'wb')
    pickle.dump(obj, f, protocol=protocol)
    f.close()


def show_txt(array, path):
    f = open(path, 'w')
    for line in array:
        f.write(' '.join(line) + '\n')

    f.close()


def divide_dataset(dataset, test_size, max_size):
    train_set = dict()
    test_set  = dict()

    for w in dataset:
        train_set[w] = dataset[w][test_size:max_size].astype('int32')
        test_set[w]  = dataset[w][:test_size].astype('int32')

    return train_set, test_set

def deserialize_from_file_json(path):
    f = open(path, 'r')
    obj = json.load(f)
    f.close()
    return obj

def deserialize_from_file_hdf5(path):
    f = open(path, 'r')
    obj = hickle.load(f)
    f.close()
    return obj

def deserialize_from_file(path):
    f = open(path, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj


def build_fuel(data):
    # create fuel dataset.
    dataset     = datasets.IndexableDataset(indexables=OrderedDict([('data', data)]))
    dataset.example_iteration_scheme \
                = schemes.ShuffledExampleScheme(dataset.num_examples)
    return dataset, len(data)


def obtain_stream(dataset, batch_size, size=1):
    if size == 1:
        data_stream = dataset.get_example_stream()
        data_stream = transformers.Batch(data_stream, iteration_scheme=schemes.ConstantScheme(batch_size))

        # add padding and masks to the dataset
        data_stream = transformers.Padding(data_stream, mask_sources=('data'))
        return data_stream
    else:
        data_streams = [dataset.get_example_stream() for _ in range(size)]
        data_streams = [transformers.Batch(data_stream, iteration_scheme=schemes.ConstantScheme(batch_size))
                        for data_stream in data_streams]
        data_streams = [transformers.Padding(data_stream, mask_sources=('data')) for data_stream in data_streams]
        return data_streams

def build_ptb():
    path = './ptbcorpus/'
    print(path)
    # make the dataset and vocabulary
    X_train = [l.split() for l in open(path + 'ptb.train.txt').readlines()]
    X_test  = [l.split() for l in open(path + 'ptb.test.txt').readlines()]
    X_valid = [l.split() for l in open(path + 'ptb.valid.txt').readlines()]

    X = X_train + X_test + X_valid
    idx2word    = dict(enumerate(set([w for l in X for w in l]), 1))
    idx2word[0] = '<eol>'
    word2idx    = {v: k for k, v in idx2word.items()}
    ixwords_train = [[word2idx[w] for w in l] for l in X_train]
    ixwords_test  = [[word2idx[w] for w in l] for l in X_test]
    ixwords_valid = [[word2idx[w] for w in l] for l in X_valid]
    ixwords_tv    = [[word2idx[w] for w in l] for l in (X_train + X_valid)]

    max_len = max([len(w) for w in X_train])
    print(max_len)
    # serialization:
    # serialize_to_file(ixwords_train, path + 'data_train.pkl')
    # serialize_to_file(ixwords_test,  path + 'data_test.pkl')
    # serialize_to_file(ixwords_valid, path + 'data_valid.pkl')
    # serialize_to_file(ixwords_tv,    path + 'data_tv.pkl')
    # serialize_to_file([idx2word, word2idx], path + 'voc.pkl')
    # show_txt(X, 'data.txt')
    print('save done.')


def filter_unk(X, min_freq=5):
    voc = dict()
    for l in X:
        for w in l:
            if w not in voc:
                voc[w]  = 1
            else:
                voc[w] += 1

    word2idx   = dict()
    word2idx['<eol>'] = 0
    id2word    = dict()
    id2word[0] = '<eol>'

    at         = 1
    for w in voc:
        if voc[w] > min_freq:
            word2idx[w] = at
            id2word[at] = w
            at += 1

    word2idx['<unk>'] = at
    id2word[at] = '<unk>'
    return word2idx, id2word


def build_msr():
    # path = '/home/thoma/Work/Dial-DRL/dataset/MSRSCC/'
    path = '/Users/jiataogu/Work/Dial-DRL/dataset/MSRSCC/'
    print(path)

    X           = [l.split() for l in open(path + 'train.txt').readlines()]
    word2idx, idx2word = filter_unk(X, min_freq=5)
    print('vocabulary size={0}. {1} samples'.format(len(word2idx), len(X)))

    mean_len = np.mean([len(w) for w in X])
    print('mean len = {}'.format(mean_len))

    ixwords     = [[word2idx[w]
                    if w in word2idx
                    else word2idx['<unk>']
                    for w in l] for l in X]
    print(ixwords[0])
    # serialization:
    serialize_to_file(ixwords, path + 'data_train.pkl')


if __name__ == '__main__':
    build_msr()
    # build_ptb()
    # build_dataset()
    # game = GuessOrder(size=8)
    # q = 'Is there any number smaller de than 6 in the last 3 numbers ?'
    # print(game.easy_parse(q))


================================================
FILE: emolga/layers/__init__.py
================================================
__author__ = 'yinpengcheng'


================================================
FILE: emolga/layers/attention.py
================================================
__author__ = 'jiataogu'
from .core import *
"""
Attention Model.
    <::: Two kinds of attention models ::::>
    -- Linear Transformation
    -- Inner Product
"""


class Attention(Layer):
    def __init__(self, target_dim, source_dim, hidden_dim,
                 init='glorot_uniform', name='attention',
                 coverage=False, max_len=50,
                 shared=False):

        super(Attention, self).__init__()
        self.init       = initializations.get(init)
        self.softmax    = activations.get('softmax')
        self.tanh       = activations.get('tanh')
        self.target_dim = target_dim
        self.source_dim = source_dim
        self.hidden_dim = hidden_dim
        self.max_len    = max_len
        self.coverage   = coverage

        if coverage:
            print('Use Coverage Trick!')

        self.Wa         = self.init((self.target_dim, self.hidden_dim))
        self.Ua         = self.init((self.source_dim, self.hidden_dim))
        self.va         = self.init((self.hidden_dim, 1))

        self.Wa.name, self.Ua.name, self.va.name = \
                '{}_Wa'.format(name), '{}_Ua'.format(name), '{}_va'.format(name)
        self.params     = [self.Wa, self.Ua, self.va]
        if coverage:
            self.Ca      = self.init((1, self.hidden_dim))
            self.Ca.name = '{}_Ca'.format(name)
            self.params += [self.Ca]

    def __call__(self, X, S,
                 Smask=None,
                 return_log=False,
                 Cov=None):
        assert X.ndim + 1 == S.ndim, 'source should be one more dimension than target.'
        # X is the decoder representation of t-1:    (nb_samples, hidden_dims)
        # S is the context vector, hidden representation of source text:    (nb_samples, maxlen_s, context_dim)
        # X_mask: mask, an array showing which elements in X are not 0 [nb_sample, max_len]
        # Cov is the coverage vector (nb_samples, maxlen_s)

        if X.ndim == 1:
            X = X[None, :]
            S = S[None, :, :]
            if not Smask:
                Smask = Smask[None, :]

        Eng   = dot(X[:, None, :], self.Wa) + dot(S, self.Ua)  # Concat Attention by Bahdanau et al. 2015 (nb_samples, source_num, hidden_dims)
        Eng   = self.tanh(Eng)

        # location aware by adding previous coverage information, let model learn how to handle coverage
        if self.coverage:
            Eng += dot(Cov[:, :, None], self.Ca)  # (nb_samples, source_num, hidden_dims)

        Eng   = dot(Eng, self.va)
        Eng   = Eng[:, :, 0]                      # 3rd dim is 1, discard it (nb_samples, source_num)

        if Smask is not None:
            # I want to use mask!
            EngSum = logSumExp(Eng, axis=1, mask=Smask)
            if return_log:
                return (Eng - EngSum) * Smask
            else:
                return T.exp(Eng - EngSum) * Smask
        else:
            if return_log:
                return T.log(self.softmax(Eng))
            else:
                return self.softmax(Eng)


class CosineAttention(Layer):
    def __init__(self, target_dim, source_dim,
                 init='glorot_uniform',
                 use_pipe=True,
                 name='attention'):

        super(CosineAttention, self).__init__()
        self.init       = initializations.get(init)
        self.softmax    = activations.get('softmax')
        self.softplus   = activations.get('softplus')
        self.tanh       = activations.get('tanh')
        self.use_pipe   = use_pipe

        self.target_dim = target_dim
        self.source_dim = source_dim

        # pipe
        if self.use_pipe:
            self.W_key  = Dense(self.target_dim, self.source_dim, name='W_key')
        else:
            assert target_dim == source_dim
            self.W_key  = Identity(name='W_key')
        self._add(self.W_key)

        # sharpen
        # self.W_beta     = Dense(self.target_dim, 1, name='W_beta')
        # dio-sharpen
        # self.W_beta     = Dense(self.target_dim, self.source_dim, name='W_beta')
        # self._add(self.W_beta)

        # self.gamma      = self.init((source_dim, ))
        # self.gamma      = self.init((target_dim, source_dim))
        # self.gamma.name = 'o_gamma'
        # self.params    += [self.gamma]

    def __call__(self, X, S, Smask=None, return_log=False):
        assert X.ndim + 1 == S.ndim, 'source should be one more dimension than target.'

        if X.ndim == 1:
            X = X[None, :]
            S = S[None, :, :]
            if not Smask:
                Smask = Smask[None, :]

        key   = self.W_key(X)                   # (nb_samples, source_dim)
        # beta  = self.softplus(self.W_beta(X))   # (nb_samples, source_dim)

        Eng   = dot_2d(key, S)  #, g=self.gamma)
        # Eng   = cosine_sim2d(key, S)  # (nb_samples, source_num)
        # Eng   = T.repeat(beta, Eng.shape[1], axis=1) * Eng

        if Smask is not None:
            # I want to use mask!
            EngSum = logSumExp(Eng, axis=1, mask=Smask)
            if return_log:
                return (Eng - EngSum) * Smask
            else:
                return T.exp(Eng - EngSum) * Smask
        else:
            if return_log:
                return T.log(self.softmax(Eng))
            else:
                return self.softmax(Eng)


================================================
FILE: emolga/layers/core.py
================================================
# -*- coding: utf-8 -*-

from emolga.utils.theano_utils import *
import emolga.basic.initializations as initializations
import emolga.basic.activations as activations


class Layer(object):
    def __init__(self):
        self.params  = []
        self.layers  = []
        self.monitor = {}
        self.watchlist = []

    def init_updates(self):
        self.updates = []

    def _monitoring(self):
        # add monitoring variables
        for l in self.layers:
            for v in l.monitor:
                name = v + '@' + l.name
                print(name)
                self.monitor[name] = l.monitor[v]

    def __call__(self, X, *args, **kwargs):
        return X

    def _add(self, layer):
        if layer:
            self.layers.append(layer)
            self.params += layer.params

    def supports_masked_input(self):
        ''' Whether or not this layer respects the output mask of its previous layer in its calculations. If you try
        to attach a layer that does *not* support masked_input to a layer that gives a non-None output_mask() that is
        an error'''
        return False

    def get_output_mask(self, train=None):
        '''
        For some models (such as RNNs) you want a way of being able to mark some output data-points as
        "masked", so they are not used in future calculations. In such a model, get_output_mask() should return a mask
        of one less dimension than get_output() (so if get_output is (nb_samples, nb_timesteps, nb_dimensions), then the mask
        is (nb_samples, nb_timesteps), with a one for every unmasked datapoint, and a zero for every masked one.

        If there is *no* masking then it shall return None. For instance if you attach an Activation layer (they support masking)
        to a layer with an output_mask, then that Activation shall also have an output_mask. If you attach it to a layer with no
        such mask, then the Activation's get_output_mask shall return None.

        Some emolga have an output_mask even if their input is unmasked, notably Embedding which can turn the entry "0" into
        a mask.
        '''
        return None

    def set_weights(self, weights):
        for p, w in zip(self.params, weights):
            if p.eval().shape != w.shape:
                raise Exception("Layer shape %s not compatible with weight shape %s." % (p.eval().shape, w.shape))
            p.set_value(floatX(w))

    def get_weights(self):
        weights = []
        for p in self.params:
            weights.append(p.get_value())
        return weights

    def get_params(self):
        return self.params

    def set_name(self, name):
        for i in range(len(self.params)):
            if self.params[i].name is None:
                self.params[i].name = '%s_p%d' % (name, i)
            else:
                self.params[i].name = name + '_' + self.params[i].name
        self.name = name


class MaskedLayer(Layer):
    '''
    If your layer trivially supports masking (by simply copying the input mask to the output), then subclass MaskedLayer
    instead of Layer, and make sure that you incorporate the input mask into your calculation of get_output()
    '''
    def supports_masked_input(self):
        return True


class Identity(Layer):
    def __init__(self, name='Identity'):
        super(Identity, self).__init__()
        if name is not None:
            self.set_name(name)

    def __call__(self, X):
        return X


class Dense(Layer):
    def __init__(self, input_dim, output_dim, init='glorot_uniform', activation='tanh', name='Dense',
                 learn_bias=True, negative_bias=False):

        super(Dense, self).__init__()
        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.linear = (activation == 'linear')

        # self.input = T.matrix()
        self.W = self.init((self.input_dim, self.output_dim))
        if not negative_bias:
            self.b = shared_zeros((self.output_dim))
        else:
            self.b = shared_ones((self.output_dim))

        self.learn_bias = learn_bias
        if self.learn_bias:
            self.params = [self.W, self.b]
        else:
            self.params = [self.W]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.W.name = '%s_W' % name
        self.b.name = '%s_b' % name

    def __call__(self, X):
        # output = self.activation(T.dot(X, self.W) + 4. * self.b) # why with a 4.0 here? change to 1
        output = self.activation(T.dot(X, self.W) + self.b)
        return output

    def reverse(self, Y):
        assert self.linear

        output = T.dot((Y - self.b), self.W.T)
        return output


class Dense2(Layer):
    def __init__(self, input_dim1, input_dim2, output_dim, init='glorot_uniform', activation='tanh', name='Dense', learn_bias=True):

        super(Dense2, self).__init__()
        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim1 = input_dim1
        self.input_dim2 = input_dim2
        self.output_dim = output_dim
        self.linear = (activation == 'linear')

        # self.input = T.matrix()

        self.W1 = self.init((self.input_dim1, self.output_dim))
        self.W2 = self.init((self.input_dim2, self.output_dim))
        self.b  = shared_zeros((self.output_dim))

        self.learn_bias = learn_bias
        if self.learn_bias:
            self.params = [self.W1, self.W2, self.b]
        else:
            self.params = [self.W1, self.W2]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.W1.name = '%s_W1' % name
        self.W2.name = '%s_W2' % name
        self.b.name = '%s_b' % name

    def __call__(self, X1, X2):
        output = self.activation(T.dot(X1, self.W1) + T.dot(X2, self.W2) + self.b)
        return output


class Constant(Layer):
    def __init__(self, input_dim, output_dim, init=None, activation='tanh', name='Bias'):

        super(Constant, self).__init__()
        assert input_dim == output_dim, 'Bias Layer needs to have the same input/output nodes.'

        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.b = shared_zeros(self.output_dim)
        self.params = [self.b]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.b.name = '%s_b' % name

    def __call__(self, X=None):
        output = self.activation(self.b)
        if X:
            L = X.shape[0]
            output = T.extra_ops.repeat(output[None, :], L, axis=0)
        return output


class MemoryLinear(Layer):
    def __init__(self, input_dim, input_wdth, init='glorot_uniform',
                 activation='tanh', name='Bias', has_input=True):
        super(MemoryLinear, self).__init__()

        self.init       = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim  = input_dim
        self.input_wdth = input_wdth

        self.b = self.init((self.input_dim, self.input_wdth))
        self.params = [self.b]

        if has_input:
            self.P = self.init((self.input_dim, self.input_wdth))
            self.params += [self.P]

        if name is not None:
            self.set_name(name)

    def __call__(self, X=None):
        out = self.b[None, :, :]
        if X:
            out += self.P[None, :, :] * X
        return self.activation(out)


class Dropout(MaskedLayer):
    """
        Hinton's dropout.
    """
    def __init__(self, rng=None, p=1., name=None):
        super(Dropout, self).__init__()
        self.p   = p
        self.rng = rng

    def __call__(self, X, train=True):
        if self.p > 0.:
            retain_prob = 1. - self.p
            if train:
                X *= self.rng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
            else:
                X *= retain_prob
        return X


class Activation(MaskedLayer):
    """
        Apply an activation function to an output.
    """
    def __init__(self, activation):
        super(Activation, self).__init__()
        self.activation = activations.get(activation)

    def __call__(self, X):
        return self.activation(X)


================================================
FILE: emolga/layers/embeddings.py
================================================
# -*- coding: utf-8 -*-

from .core import Layer
from emolga.utils.theano_utils import *
import emolga.basic.initializations as initializations


class Embedding(Layer):
    '''
        Turn positive integers (indexes) into denses vectors of fixed size.
        e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

        @input_dim: size of vocabulary (highest input integer + 1)
        @out_dim: size of dense representation
    '''

    def __init__(self, input_dim, output_dim, init='uniform', name=None):

        super(Embedding, self).__init__()
        self.init = initializations.get(init)
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.W = self.init((self.input_dim, self.output_dim))

        self.params = [self.W]

        if name is not None:
            self.set_name(name)

    def get_output_mask(self, X):
        '''
        T.ones_like(X): Return an array of ones with shape and type of input.
            T.eq(X, 0): X==0?
            1 - T.eq(X, 0): X!=0?
        :return an array shows that which x!=0
        '''
        return T.ones_like(X) * (1 - T.eq(X, 0))

    def __call__(self, X, mask_zero=False, context=None):
        '''
        return the embedding of X
        :param X:         a set of words, all the X have same length due to padding
                            shape=[nb_sample, max_len]
        :param mask_zero: whether return the mask of X, a list of [0,1] showing which x!=0
        :param context:
        :return
                emb_X:    embedding of X, shape = [nb_sample, max_len, emb_dim]
                X_mask:   mask of X,      shape=[nb_sample, max_len]

        '''
        if context is None:
            out = self.W[X]
        else:
            assert context.ndim == 3
            flag  = False
            if X.ndim == 1:
                flag = True
                X = X[:, None]

            b_size = context.shape[0]

            EMB = T.repeat(self.W[None, :, :], b_size, axis=0)
            EMB = T.concatenate([EMB, context], axis=1)

            m_size = EMB.shape[1]
            e_size = EMB.shape[2]
            maxlen = X.shape[1]

            EMB = EMB.reshape((b_size * m_size, e_size))
            Z   = (T.arange(b_size)[:, None] * m_size + X).reshape((b_size * maxlen,))
            out = EMB[Z]  # (b_size * maxlen, e_size)

            if not flag:
                out = out.reshape((b_size, maxlen, e_size))
            else:
                out = out.reshape((b_size, e_size))

        if mask_zero:
            return out, T.cast(self.get_output_mask(X), dtype='float32')
        else:
            return out


class Zero(Layer):
    def __call__(self, X):
        out = T.zeros(X.shape)
        return out


class Bias(Layer):
    def __call__(self, X):
        tmp = X.flatten()
        tmp = tmp.dimshuffle(0, 'x')
        return tmp


================================================
FILE: emolga/layers/gridlstm.py
================================================
__author__ = 'jiataogu'
"""
The file is the implementation of Grid-LSTM
In this stage we only support 2D LSTM with Pooling.
"""
from recurrent import *
from attention import Attention
import logging
import copy
logger = logging.getLogger(__name__)


class Grid(Recurrent):
    """
    Grid Cell for Grid-LSTM
    ===================================================
    LSTM
            [h', m'] = LSTM(x, h, m):
                gi = sigmoid(Wi * x + Ui * h + Vi * m)  # Vi is peep-hole
                gf = sigmoid(Wf * x + Uf * h + Vf * m)
                go = sigmoid(Wo * x + Uo * h + Vo * m)
                gc = tanh(Wc * x +Uc * h)

                m' = gf @ m + gi @ gc  (@ represents element-wise dot.)
                h' = go @ tanh(m')

    ===================================================
    Grid
    (here is an example for 2D Grid LSTM with priority dimension = 1)
     -------------
    |    c'  d'   |     Grid Block and Grid Updates.
    | a         a'|
    |             |     [d' c'] = LSTM_d([b, d],  c)
    | b         b'|     [a' b'] = LSTM_t([b, d'], a)
    |    c   d    |
     -------------
    ===================================================
    Details please refer to:
        "Grid Long Short-Term Memory", http://arxiv.org/abs/1507.01526
    """
    def __init__(self,
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=False,
                 name=None, weights=None,
                 identity_connect=None
                 ):
        super(Grid, self).__init__()

        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'

        """
        Initialization.
        """
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def build(self):
        """
        Build the model weights
        """
        logger.info("Building GridPool-LSTM !!")
        self.W = dict()
        self.U = dict()
        self.V = dict()
        self.b = dict()

        # ******************************************************************************************
        for k in xrange(self.N):       # N-Grids (for 2 dimensions, 0 is for time; 1 is for depth.)
            axis  = self.axies[k]
            # input layers:
            if self.input_dims[k] > 0 and self.use_input:
                # use the data information.
                self.W[axis + '#i'], self.W[axis + '#f'], \
                self.W[axis + '#o'], self.W[axis + '#c']  \
                      = [self.init((self.input_dims[k], self.output_dims[k])) for _ in xrange(4)]

            # hidden layers:
            for j in xrange(self.N):   # every hidden states inputs.
                pos   = self.axies[j]
                if k == j:
                    self.U[axis + pos + '#i'], self.U[axis + pos + '#f'], \
                    self.U[axis + pos + '#o'], self.U[axis + pos + '#c']  \
                        = [self.inner_init((self.output_dims[j], self.output_dims[k])) for _ in xrange(4)]
                else:
                    self.U[axis + pos + '#i'], self.U[axis + pos + '#f'], \
                    self.U[axis + pos + '#o'], self.U[axis + pos + '#c']  \
                        = [self.init((self.output_dims[j], self.output_dims[k])) for _ in xrange(4)]

            # bias layers:
            self.b[axis + '#i'], self.b[axis + '#o'], self.b[axis + '#c']  \
                      = [shared_zeros(self.output_dims[k]) for _ in xrange(3)]
            self.b[axis + '#f'] = self.forget_bias_init(self.output_dims[k])

            # peep-hole layers:
            if self.peephole:
                self.V[axis + '#i'], self.V[axis + '#f'], self.V[axis + '#o'] \
                      = [self.init(self.output_dims[k]) for _ in xrange(3)]
        # *****************************************************************************************

        # set names for these weights
        for A, n in zip([self.W, self.U, self.b, self.V], ['W', 'U', 'b', 'V']):
            for w in A:
                A[w].name = n + '_' + w

        # set parameters
        self.params = [self.W[s] for s in self.W] + \
                      [self.U[s] for s in self.U] + \
                      [self.b[s] for s in self.b] + \
                      [self.V[s] for s in self.V]

    def lstm_(self, k, H, m, x, identity=False):
        """
       LSTM
            [h', m'] = LSTM(x, h, m):
                gi = sigmoid(Wi * x + Ui * h + Vi * m)  # Vi is peep-hole
                gf = sigmoid(Wf * x + Uf * h + Vf * m)
                go = sigmoid(Wo * x + Uo * h + Vo * m)
                gc = tanh(Wc * x +Uc * h)

                m' = gf @ m + gi @ gc  (@ represents element-wise dot.)
                h' = go @ tanh(m')

        """
        assert len(H) == self.N, 'we have to use all the hidden states in Grid LSTM'
        axis           = self.axies[k]

        # *************************************************************************
        # bias energy
        ei, ef, eo, ec = [self.b[axis + p] for p in ['#i', '#f', '#o', '#c']]

        # hidden energy
        for j in xrange(self.N):
            pos  = self.axies[j]

            ei  += T.dot(H[j], self.U[axis + pos + '#i'])
            ef  += T.dot(H[j], self.U[axis + pos + '#f'])
            eo  += T.dot(H[j], self.U[axis + pos + '#o'])
            ec  += T.dot(H[j], self.U[axis + pos + '#c'])

        # input energy (if any)
        if self.input_dims[k] > 0 and self.use_input:
            ei  += T.dot(x, self.W[axis + '#i'])
            ef  += T.dot(x, self.W[axis + '#f'])
            eo  += T.dot(x, self.W[axis + '#o'])
            ec  += T.dot(x, self.W[axis + '#c'])

        # peep-hole connections
        if self.peephole:
            ei  += m * self.V[axis + '#i'][None, :]
            ef  += m * self.V[axis + '#f'][None, :]
            eo  += m * self.V[axis + '#o'][None, :]
        # *************************************************************************

        # compute the gates.
        i        = self.inner_activation(ei)
        f        = self.inner_activation(ef)
        o        = self.inner_activation(eo)
        c        = self.activation(ec)

        # update the memory and hidden states.
        m_new    = f * m + i * c
        h_new    = o * self.activation(m_new)

        return h_new, m_new

    def grid_(self,
              hs_i,
              ms_i,
              xs_i,
              priority=1,
              identity=None):
        """
        ===================================================
        Grid (2D as an example)
         -------------
        |    c'  d'   |     Grid Block and Grid Updates.
        | a         a'|
        |             |     [d' c'] = LSTM_d([b, d],  c)
        | b         b'|     [a' b'] = LSTM_t([b, d'], a)   priority
        |    c   d    |
         -------------
         a = my | b = hy | c = mx | d = hx
        ===================================================

        Currently masking is not considered in GridLSTM.
        """
        # compute LSTM updates for non-priority dimensions
        H_new   = hs_i
        M_new   = ms_i
        for k in xrange(self.N):
            if k == priority:
                continue
            m   = ms_i[k]
            x   = xs_i[k]
            H_new[k], M_new[k] \
                = self.lstm_(k, hs_i, m, x)

            if identity is not None:
                if identity[k]:
                    H_new[k] += hs_i[k]

        # compute LSTM updates along the priority dimension
        if priority >= 0:
            hs_ii   = H_new
            H_new[priority], M_new[priority] \
                    = self.lstm_(priority, hs_ii, ms_i[priority], xs_i[priority])
            if identity is not None:
                if identity[priority]:
                    H_new[priority] += hs_ii[priority]

        return H_new, M_new


class SequentialGridLSTM(Grid):
    """
    Details please refer to:
        "Grid Long Short-Term Memory",
            http://arxiv.org/abs/1507.01526

    SequentialGridLSTM is a typical 2D-GridLSTM,
    which has one flexible dimension (time) and one fixed dimension (depth)
    Input information is added along x-axis.
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=False,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 pooling=True,
                 attention=False,
                 shared=True,
                 dropout=0,
                 rng=None,
                 ):
        super(Grid, self).__init__()

        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert input_dims[1]    == 0, 'we have no y-axis inputs here.'
        assert shared, 'we share the weights in this stage.'
        assert not (attention and pooling), 'attention and pooling cannot be set at the same time.'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.pooling          = pooling
        self.attention        = attention
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        if self.identity_connect is not None:
            logger.info('Identity Connection: {}'.format(self.identity_connect))

        """
        Build the model weights.
        """
        # build the centroid grid.
        self.build()

        # input projection layer (projected to time-axis)       [x]
        self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')

        self._add(self.Ph)
        self._add(self.Pm)

        # learn init for depth-axis hidden states/memory cells. [y]
        if self.learn_init:
            self.M0      = self.init((depth, output_dims[1]))
            if self.pooling:
                self.H0  = self.init(output_dims[1])
            else:
                self.H0  = self.init((depth, output_dims[1]))

            self.M0.name, self.H0.name = 'M0', 'H0'
            self.params += [self.M0, self.H0]

        # if we use attention instead of max-pooling
        if self.pooling:
            self.PP      = Dense(output_dims[1] + input_dims[0], output_dims[1], # init='orthogonal',
                                 name='PP', activation='linear')
            self._add(self.PP)

        if self.attention:
            self.A       = Attention(target_dim=input_dims[0],
                                     source_dim=output_dims[1],
                                     hidden_dim=200, name='attender')
            self._add(self.A)

        # if self.dropout > 0:
        #     logger.info(">>>>>> USE DropOut !! <<<<<<")
        #     self.D       = Dropout(rng=rng, p=self.dropout, name='Dropout')

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        # since depth is not determined, we cannot decide the number of inputs
        # for one time step.
        # if pooling is True:
        #    args = [raw_input] +       (sequence)
        #           [hy] + [my]*depth   (output_info)
        #
        inputs = args[0]
        Hy_tm1 = [args[k] for k in range(1, 1 + self.depth)]
        My_tm1 = [args[k] for k in range(1 + self.depth, 1 + 2 * self.depth)]

        # x_axis input projection (get hx_t, mx_t)
        hx_t   = self.Ph(inputs)           # (nb_samples, output_dim0)
        mx_t   = self.Pm(inputs)           # (nb_samples, output_dim0)

        # build computation path from bottom to top.
        Hx_t   = [hx_t]
        Mx_t   = [mx_t]
        Hy_t   = []
        My_t   = []
        for d in xrange(self.depth):
            hs_i       = [Hx_t[-1], Hy_tm1[d]]
            ms_i       = [Mx_t[-1], My_tm1[d]]
            xs_i       = [inputs,   T.zeros_like(inputs)]

            hs_o, ms_o = self.grid_(hs_i, ms_i, xs_i, priority=self.priority, identity=self.identity_connect)

            Hx_t      += [hs_o[0]]
            Hy_t      += [hs_o[1]]
            Mx_t      += [ms_o[0]]
            My_t      += [ms_o[1]]

        hx_out = Hx_t[-1]
        mx_out = Mx_t[-1]

        # get the output (output_y, output_x)
        # MAX-Pooling
        if self.pooling:
            # hy_t       = T.max([self.PP(hy) for hy in Hy_t], axis=0)
            hy_t       = T.max([self.PP(T.concatenate([hy, inputs], axis=-1)) for hy in Hy_t], axis=0)
            Hy_t       = [hy_t] * self.depth

        if self.attention:
            HHy_t      = T.concatenate([hy[:, None, :] for hy in Hy_t], axis=1)  # (nb_samples, n_depth, out_dim1)
            annotation = self.A(inputs, HHy_t)   # (nb_samples, n_depth)
            hy_t       = T.sum(HHy_t * annotation[:, :, None], axis=1)           # (nb_samples, out_dim1)
            Hy_t       = [hy_t] * self.depth

        R = Hy_t + My_t + [hx_out, mx_out]
        return tuple(R)

    def __call__(self, X, init_H=None, init_M=None,
                 return_sequence=False, one_step=False,
                 return_info='hy', train=True):
        # It is training/testing path
        self.train = train

        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_H is not None, 'previous state must be provided!'
            assert init_M is not None, 'previous cell must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_H is None:
            if self.learn_init:
                init_m     = T.repeat(self.M0[:, None, :], X.shape[1], axis=1)
                if self.pooling:
                    init_h = T.repeat(self.H0[None, :], self.depth, axis=0)
                else:
                    init_h = self.H0
                init_h     = T.repeat(init_h[:, None, :], X.shape[1], axis=1)

                init_H     = []
                init_M     = []
                for j in xrange(self.depth):
                    init_H.append(init_h[j])
                    init_M.append(init_m[j])
            else:
                init_H     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * self.depth
                init_M     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * self.depth
            pass

        # computational graph !
        if not one_step:
            sequences    = [X]
            outputs_info = init_H + init_M + [None, None]
            outputs, _   = theano.scan(
                self._step,
                sequences=sequences,
                outputs_info=outputs_info
            )
        else:
            outputs      = self._step(*([X[0]] + init_H + init_M))

        if   return_info == 'hx':
            if return_sequence:
                return outputs[0].dimshuffle((1, 0, 2))
            return outputs[-2][-1]
        elif return_info == 'hy':
            assert self.pooling or self.attention, 'y-axis hidden states are only used in the ``Pooling Mode".'
            if return_sequence:
                return outputs[2].dimshuffle((1, 0, 2))
            return outputs[2][-1]
        elif return_info == 'hxhy':
            assert self.pooling or self.attention, 'y-axis hidden states are only used in the ``Pooling Mode".'
            if return_sequence:
                return outputs[-2].dimshuffle((1, 0, 2)), outputs[2].dimshuffle((1, 0, 2))    # x-y
            return outputs[-2][-1], outputs[2][-1]


class PyramidGridLSTM2D(Grid):
    """
    A variant version of Sequential LSTM where we introduce a Pyramid structure.
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=True,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 shared=True,
                 dropout=0
                 ):

        super(Grid, self).__init__()
        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert output_dims[0] == output_dims[1], 'Here we only support square model.'
        assert shared, 'we share the weights in this stage.'
        assert use_input, 'use input and add them in the middle'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Build the model weights.
        """
        # build the centroid grid.
        self.build()

        # # input projection layer (projected to time-axis)       [x]
        # self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        # self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')
        #
        # self._add(self.Ph)
        # self._add(self.Pm)

        # learn init/
        if self.learn_init:
            self.hx0 = self.init((1, output_dims[0]))
            self.hy0 = self.init((1, output_dims[1]))
            self.mx0 = self.init((1, output_dims[0]))
            self.my0 = self.init((1, output_dims[1]))

            self.hx0.name, self.hy0.name = 'hx0', 'hy0'
            self.mx0.name, self.my0.name = 'mx0', 'my0'
            self.params += [self.hx0, self.hy0, self.mx0, self.my0]

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        inputs = args[0]
        hx_tm1 = args[1]
        mx_tm1 = args[2]
        hy_tm1 = args[3]
        my_tm1 = args[4]

        # zero constant inputs.
        pre_info    = [[[T.zeros_like(hx_tm1)
                         for _ in xrange(self.depth)]
                         for _ in xrange(self.depth)]
                         for _ in xrange(4)]  # hx, mx, hy, my

        pre_inputs  = [[T.zeros_like(inputs)
                       for _ in xrange(self.depth)]
                       for _ in xrange(self.depth)]

        for kk in xrange(self.depth):
            pre_inputs[kk][kk] = inputs

        pre_info[0][0][0] = hx_tm1
        pre_info[1][0][0] = mx_tm1
        pre_info[2][0][0] = hy_tm1
        pre_info[3][0][0] = my_tm1

        for step_x in xrange(self.depth):
            for step_y in xrange(self.depth):
                # input hidden/memory/input information
                print pre_info[0][-1][-1], pre_info[2][-1][-1]

                hs_i  = [pre_info[0][step_x][step_y],
                         pre_info[2][step_x][step_y]]
                ms_i  = [pre_info[1][step_x][step_y],
                         pre_info[3][step_x][step_y]]
                xs_i  = [pre_inputs[step_x][step_y],
                         pre_inputs[step_x][step_y]]

                # compute grid-lstm
                hs_o, ms_o = self.grid_(hs_i, ms_i, xs_i, priority =-1)

                # output hidden/memory information
                if (step_x == self.depth - 1) and (step_y == self.depth - 1):
                    hx_t, mx_t, hy_t, my_t = hs_o[0], ms_o[0], hs_o[1], ms_o[1]
                    return hx_t, mx_t, hy_t, my_t

                if step_x + 1 < self.depth:
                    pre_info[0][step_x + 1][step_y] = hs_o[0]
                    pre_info[1][step_x + 1][step_y] = ms_o[0]

                if step_y + 1 < self.depth:
                    pre_info[2][step_x][step_y + 1] = hs_o[1]
                    pre_info[3][step_x][step_y + 1] = ms_o[1]

    def __call__(self, X, init_x=None, init_y=None,
                 return_sequence=False, one_step=False):
        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_x is not None, 'previous x must be provided!'
            assert init_y is not None, 'previous y must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_x is None:
            if self.learn_init:
                init_mx    = T.repeat(self.mx0, X.shape[1], axis=0)
                init_my    = T.repeat(self.my0, X.shape[1], axis=0)
                init_hx    = T.repeat(self.hx0, X.shape[1], axis=0)
                init_hy    = T.repeat(self.hy0, X.shape[1], axis=0)

                init_input = [init_hx, init_mx, init_hy, init_my]
            else:
                init_x     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[0]), 1)] * 2
                init_y     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * 2

                init_input = init_x + init_y
        else:
            init_input = init_x + init_y

        if not one_step:
            sequence       = [X]
            output_info    = init_input
            outputs, _     = theano.scan(
                self._step,
                sequences=sequence,
                outputs_info=output_info
            )
        else:
            outputs        = self._step(*([X[0]] + init_x + init_y))

        if return_sequence:
            hxs = outputs[0].dimshuffle((1, 0, 2))
            hys = outputs[2].dimshuffle((1, 0, 2))
            hs  = T.concatenate([hxs, hys], axis=-1)
            return hs
        else:
            hx  = outputs[0][-1]
            hy  = outputs[2][-1]
            h   = T.concatenate([hx, hy], axis=-1)
            return h


class PyramidLSTM(Layer):
    """
    A more flexible Pyramid LSTM structure!
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=True,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 shared=True,
                 dropout=0
                 ):

        super(PyramidLSTM, self).__init__()
        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert output_dims[0] == output_dims[1], 'Here we only support square model.'
        assert shared, 'we share the weights in this stage.'
        assert use_input, 'use input and add them in the middle'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Build the model weights.
        """
        # build the centroid grid (3 grid versions)
        self.grids = [Grid(output_dims,
                           input_dims,
                           -1,
                           peephole,
                           init, inner_init,
                           forget_bias_init,
                           activation, inner_activation, use_input,
                           name='Grid*{}'.format(k)
                           ) for k in xrange(3)]

        for k in xrange(3):
            self.grids[k].build()
            self._add(self.grids[k])

        # # input projection layer (projected to time-axis)       [x]
        # self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        # self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')
        #
        # self._add(self.Ph)
        # self._add(self.Pm)

        # learn init/
        if self.learn_init:
            self.hx0 = self.init((1, output_dims[0]))
            self.hy0 = self.init((1, output_dims[1]))
            self.mx0 = self.init((1, output_dims[0]))
            self.my0 = self.init((1, output_dims[1]))

            self.hx0.name, self.hy0.name = 'hx0', 'hy0'
            self.mx0.name, self.my0.name = 'mx0', 'my0'
            self.params += [self.hx0, self.hy0, self.mx0, self.my0]

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        inputs = args[0]
        hx_tm1 = args[1]
        mx_tm1 = args[2]
        hy_tm1 = args[3]
        my_tm1 = args[4]

        # zero constant inputs.
        pre_info    = [[[T.zeros_like(hx_tm1)
                         for _ in xrange(self.depth)]
                         for _ in xrange(self.depth)]
                         for _ in xrange(4)]  # hx, mx, hy, my

        pre_inputs  = [[T.zeros_like(inputs)
                       for _ in xrange(self.depth)]
                       for _ in xrange(self.depth)]

        for kk in xrange(self.depth):
            pre_inputs[kk][kk] = inputs

        pre_info[0][0][0] = hx_tm1
        pre_info[1][0][0] = mx_tm1
        pre_info[2][0][0] = hy_tm1
        pre_info[3][0][0] = my_tm1

        for step_x in xrange(self.depth):
            for step_y in xrange(self.depth):
                # input hidden/memory/input information
                print pre_info[0][-1][-1], pre_info[2][-1][-1]

                hs_i  = [pre_info[0][step_x][step_y],
                         pre_info[2][step_x][step_y]]
                ms_i  = [pre_info[1][step_x][step_y],
                         pre_info[3][step_x][step_y]]
                xs_i  = [pre_inputs[step_x][step_y],
                         pre_inputs[step_x][step_y]]

                # compute grid-lstm
                if (step_x + step_y + 1) < self.depth:
                    hs_o, ms_o = self.grids[0].grid_(hs_i, ms_i, xs_i, priority =-1)
                elif (step_x + step_y + 1) == self.depth:
                    hs_o, ms_o = self.grids[1].grid_(hs_i, ms_i, xs_i, priority =-1)
                else:
                    hs_o, ms_o = self.grids[2].grid_(hs_i, ms_i, xs_i, priority =-1)

                # output hidden/memory information
                if (step_x == self.depth - 1) and (step_y == self.depth - 1):
                    hx_t, mx_t, hy_t, my_t = hs_o[0], ms_o[0], hs_o[1], ms_o[1]
                    return hx_t, mx_t, hy_t, my_t

                if step_x + 1 < self.depth:
                    pre_info[0][step_x + 1][step_y] = hs_o[0]
                    pre_info[1][step_x + 1][step_y] = ms_o[0]

                if step_y + 1 < self.depth:
                    pre_info[2][step_x][step_y + 1] = hs_o[1]
                    pre_info[3][step_x][step_y + 1] = ms_o[1]

    def __call__(self, X, init_x=None, init_y=None,
                 return_sequence=False, one_step=False):
        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_x is not None, 'previous x must be provided!'
            assert init_y is not None, 'previous y must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_x is None:
            if self.learn_init:
                init_mx    = T.repeat(self.mx0, X.shape[1], axis=0)
                init_my    = T.repeat(self.my0, X.shape[1], axis=0)
                init_hx    = T.repeat(self.hx0, X.shape[1], axis=0)
                init_hy    = T.repeat(self.hy0, X.shape[1], axis=0)

                init_input = [init_hx, init_mx, init_hy, init_my]
            else:
                init_x     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[0]), 1)] * 2
                init_y     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * 2

                init_input = init_x + init_y
        else:
            init_input = init_x + init_y

        if not one_step:
            sequence       = [X]
            output_info    = init_input
            outputs, _     = theano.scan(
                self._step,
                sequences=sequence,
                outputs_info=output_info
            )
        else:
            outputs        = self._step(*([X[0]] + init_x + init_y))

        if return_sequence:
            hxs = outputs[0].dimshuffle((1, 0, 2))
            hys = outputs[2].dimshuffle((1, 0, 2))
            hs  = T.concatenate([hxs, hys], axis=-1)
            return hs
        else:
            hx  = outputs[0][-1]
            hy  = outputs[2][-1]
            h   = T.concatenate([hx, hy], axis=-1)
            return h

================================================
FILE: emolga/layers/ntm_minibatch.py
================================================
__author__ = 'jiataogu'
import theano
import theano.tensor as T

import scipy.linalg as sl
import numpy as np
from .core import *
from .recurrent import *
import copy

"""
This implementation supports both minibatch learning and on-line training.
We need a minibatch version for Neural Turing Machines.
"""


class Reader(Layer):
    """
        "Reader Head" of the Neural Turing Machine.
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(Reader, self).__init__()
        self.input_dim = input_dim
        self.memory_dim = memory_width

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)

        self.tanh = activations.get('tanh')
        self.sigmoid = activations.get('sigmoid')
        self.softplus = activations.get('softplus')
        self.vec_softmax = activations.get('vector_softmax')
        self.softmax = activations.get('softmax')

        """
        Reader Params.
        """
        self.W_key = self.init((input_dim, memory_width))
        self.W_shift = self.init((input_dim, shift_width))
        self.W_beta = self.init(input_dim)
        self.W_gama = self.init(input_dim)
        self.W_g = self.init(input_dim)

        self.b_key = shared_zeros(memory_width)
        self.b_shift = shared_zeros(shift_width)
        self.b_beta = theano.shared(floatX(0))
        self.b_gama = theano.shared(floatX(0))
        self.b_g = theano.shared(floatX(0))

        self.shift_conv = shift_conv

        # add params and set names.
        self.params = [self.W_key, self.W_shift, self.W_beta, self.W_gama, self.W_g,
                       self.b_key, self.b_shift, self.b_beta, self.b_gama, self.b_g]

        self.W_key.name, self.W_shift.name, self.W_beta.name, \
        self.W_gama.name, self.W_g.name = 'W_key', 'W_shift', 'W_beta', \
                                          'W_gama', 'W_g'

        self.b_key.name, self.b_shift.name, self.b_beta.name, \
        self.b_gama.name, self.b_g.name = 'b_key', 'b_shift', 'b_beta', \
                                          'b_gama', 'b_g'

    def __call__(self, X, w_temp, m_temp):
        # input dimensions
        # X:      (nb_samples, input_dim)
        # w_temp: (nb_samples, memory_dim)
        # m_temp: (nb_samples, memory_dim, memory_width) ::tensor_memory

        key = dot(X, self.W_key, self.b_key)  # (nb_samples, memory_width)
        shift = self.softmax(
            dot(X, self.W_shift, self.b_shift))  # (nb_samples, shift_width)

        beta = self.softplus(dot(X, self.W_beta, self.b_beta))[:, None]  # (nb_samples, x)
        gamma = self.softplus(dot(X, self.W_gama, self.b_gama)) + 1.  # (nb_samples,)
        gamma = gamma[:, None]  # (nb_samples, x)
        g = self.sigmoid(dot(X, self.W_g, self.b_g))[:, None]  # (nb_samples, x)

        signal = [key, shift, beta, gamma, g]

        w_c = self.softmax(
            beta * cosine_sim2d(key, m_temp))  # (nb_samples, memory_dim) //content-based addressing
        w_g = g * w_c + (1 - g) * w_temp  # (nb_samples, memory_dim) //history interpolation
        w_s = shift_convolve2d(w_g, shift, self.shift_conv)  # (nb_samples, memory_dim) //convolutional shift
        w_p = w_s ** gamma  # (nb_samples, memory_dim) //sharpening
        w_t = w_p / T.sum(w_p, axis=1)[:, None]  # (nb_samples, memory_dim)
        return w_t


class Writer(Reader):
    """
        "Writer head" of the Neural Turing Machine
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(Writer, self).__init__(input_dim, memory_width, shift_width, shift_conv,
                                     init, inner_init, name)

        """
        Writer Params.
        """
        self.W_erase = self.init((input_dim, memory_width))
        self.W_add = self.init((input_dim, memory_width))

        self.b_erase = shared_zeros(memory_width)
        self.b_add = shared_zeros(memory_width)

        # add params and set names.
        self.params += [self.W_erase, self.W_add, self.b_erase, self.b_add]

        self.W_erase.name, self.W_add.name = 'W_erase', 'W_add'
        self.b_erase.name, self.b_add.name = 'b_erase', 'b_add'

    def get_fixer(self, X):
        erase = self.sigmoid(dot(X, self.W_erase, self.b_erase))  # (nb_samples, memory_width)
        add   = self.sigmoid(dot(X, self.W_add, self.b_add))  # (nb_samples, memory_width)
        return erase, add


class Controller(Recurrent):
    """
    Controller used in Neural Turing Machine.
        - Core cell (Memory)
        - Reader head
        - Writer head
    It is a simple RNN version. In reality the Neural Turing Machine will use the LSTM cell.
    """

    def __init__(self,
                 input_dim,
                 memory_dim,
                 memory_width,
                 hidden_dim,
                 shift_width=3,
                 init='glorot_uniform',
                 inner_init='orthogonal',
                 name=None,
                 readonly=False,
                 curr_input=False,
                 recurrence=False,
                 memorybook=None
                 ):
        super(Controller, self).__init__()
        # Initialization of the dimensions.
        self.input_dim     = input_dim
        self.memory_dim    = memory_dim
        self.memory_width  = memory_width
        self.hidden_dim    = hidden_dim
        self.shift_width   = shift_width

        self.init          = initializations.get(init)
        self.inner_init    = initializations.get(inner_init)
        self.tanh          = activations.get('tanh')
        self.softmax       = activations.get('softmax')
        self.vec_softmax   = activations.get('vector_softmax')

        self.readonly      = readonly
        self.curr_input    = curr_input
        self.recurrence    = recurrence
        self.memorybook    = memorybook

        """
        Controller Module.
        """
        # hidden projection:
        self.W_in          = self.init((input_dim, hidden_dim))
        self.b_in          = shared_zeros(hidden_dim)
        self.W_rd          = self.init((memory_width, hidden_dim))
        self.W_in.name     = 'W_in'
        self.b_in.name     = 'b_in'
        self.W_rd.name     = 'W_rd'
        self.params        = [self.W_in, self.b_in, self.W_rd]

        # use recurrence:
        if self.recurrence:
            self.W_hh      = self.inner_init((hidden_dim, hidden_dim))
            self.W_hh.name = 'W_hh'
            self.params   += [self.W_hh]

        # Shift convolution
        shift_conv         = sl.circulant(np.arange(memory_dim)).T[
                                np.arange(-(shift_width // 2), (shift_width // 2) + 1)][::-1]

        # use the current input for weights.
        if self.curr_input:
            controller_size = self.input_dim + self.hidden_dim
        else:
            controller_size = self.hidden_dim

        # write head
        if not readonly:
            self.writer    = Writer(controller_size, memory_width, shift_width, shift_conv, name='writer')
            self.writer.set_name('writer')
            self._add(self.writer)

        # read head
        self.reader        = Reader(controller_size, memory_width, shift_width, shift_conv, name='reader')
        self.reader.set_name('reader')
        self._add(self.reader)

        # ***********************************************************
        # reserved for None initialization (we don't use these often)
        self.memory_init   = self.init((memory_dim, memory_width))
        self.w_write_init  = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.w_read_init   = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.contr_init    = self.tanh(np.random.rand(1, hidden_dim).astype(theano.config.floatX))

        if name is not None:
            self.set_name(name)

    def _controller(self, input_t, read_t, controller_tm1=None):
        # input_t : (nb_sample, input_dim)
        # read_t  : (nb_sample, memory_width)
        # controller_tm1: (nb_sample, hidden_dim)
        if self.recurrence:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(controller_tm1, self.W_hh) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)
        else:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)

    @staticmethod
    def _read(w_read, memory):
        # w_read : (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # return dot(w_read, memory)

        return T.sum(w_read[:, :, None] * memory, axis=1)

    @staticmethod
    def _write(w_write, memory, erase, add):
        # w_write: (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # erase/add: (nb_sample, memory_width)

        w_write  = w_write[:, :, None]
        erase    = erase[:, None, :]
        add      = add[:, None, :]

        m_erased = memory * (1 - w_write * erase)
        memory_t = m_erased + w_write * add  # (nb_sample, memory_dim, memory_width)
        return memory_t

    def _step(self, input_t, mask_t,
              memory_tm1,
              w_write_tm1, w_read_tm1,
              controller_tm1):
        # input_t:     (nb_sample, input_dim)
        # memory_tm1:  (nb_sample, memory_dim, memory_width)
        # w_write_tm1: (nb_sample, memory_dim)
        # w_read_tm1:  (nb_sample, memory_dim)
        # controller_tm1: (nb_sample, hidden_dim)

        # read the memory
        if self.curr_input:
            info     = T.concatenate((controller_tm1, input_t), axis=1)
            w_read_t = self.reader(info, w_read_tm1, memory_tm1)
            read_tm1 = self._read(w_read_t, memory_tm1)
        else:
            read_tm1 = self._read(w_read_tm1, memory_tm1)       # (nb_sample, memory_width)

        # get the new controller (hidden states.)
        if self.recurrence:
            controller_t = self._controller(input_t, read_tm1, controller_tm1)
        else:
            controller_t = self._controller(input_t, read_tm1)  # (nb_sample, controller_size)

        # update the memory cell (if need)
        if not self.readonly:
            if self.curr_input:
                infow          = T.concatenate((controller_t, input_t), axis=1)
                w_write_t      = self.writer(infow, w_write_tm1, memory_tm1)     # (nb_sample, memory_dim)
                erase_t, add_t = self.writer.get_fixer(infow)                    # (nb_sample, memory_width)
            else:
                w_write_t      = self.writer(controller_t, w_write_tm1, memory_tm1)
                erase_t, add_t = self.writer.get_fixer(controller_t)
            memory_t           = self._write(w_write_t, memory_tm1, erase_t, add_t)  # (nb_sample, memory_dim, memory_width)
        else:
            w_write_t          = w_write_tm1
            memory_t           = memory_tm1

        # get the next reading weights.
        if not self.curr_input:
            w_read_t           = self.reader(controller_t, w_read_tm1, memory_t)  # (nb_sample, memory_dim)

        # over masking
        memory_t     = memory_t     * mask_t[:, :, None] + memory_tm1 * (1 - mask_t[:, :, None])
        w_read_t     = w_read_t     * mask_t + w_read_tm1     * (1 - mask_t)
        w_write_t    = w_write_t    * mask_t + w_write_tm1    * (1 - mask_t)
        controller_t = controller_t * mask_t + controller_tm1 * (1 - mask_t)

        return memory_t, w_write_t, w_read_t, controller_t

    def __call__(self, X, mask=None, M=None, init_ww=None,
                 init_wr=None, init_c=None, return_sequence=False,
                 one_step=False, return_full=False):
        # recurrent cell only work for tensor.
        if X.ndim == 2:
            X = X[:, None, :]
        nb_samples = X.shape[0]

        # mask
        if mask is None:
            mask = T.alloc(1., X.shape[0], 1)

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        # ***********************************************************************
        # initialization states
        if M is None:
            memory_init  = T.repeat(self.memory_init[None, :, :], nb_samples, axis=0)
        else:
            memory_init  = M

        if init_wr is None:
            w_read_init  = T.repeat(self.w_read_init, nb_samples, axis=0)
        else:
            w_read_init  = init_wr

        if init_ww is None:
            w_write_init = T.repeat(self.w_write_init, nb_samples, axis=0)
        else:
            w_write_init = init_ww

        if init_c is None:
            contr_init   = T.repeat(self.contr_init, nb_samples, axis=0)
        else:
            contr_init   = init_c
        # ************************************************************************

        outputs_info = [memory_init, w_write_init, w_read_init, contr_init]

        if one_step:
            seq = [X[0], padded_mask[0]]
            outputs = self._step(*(seq + outputs_info))
            return outputs
        else:
            seq = [X, padded_mask]
            outputs, _ = theano.scan(
                self._step,
                sequences=seq,
                outputs_info=outputs_info,
                name='controller_recurrence'
            )

        self.monitor['memory_info']   = outputs[0]
        self.monitor['write_weights'] = outputs[1]
        self.monitor['read_weights']  = outputs[2]

        if not return_full:
            if return_sequence:
                return outputs[-1].dimshuffle((1, 0, 2))
            return outputs[-1][-1]
        else:
            if return_sequence:
                return [a.dimshuffle((1, 0, 2)) for a in outputs]
            return [a[-1] for a in outputs]


class AttentionReader(Layer):
    """
        "Reader Head" of the Neural Turing Machine.
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(AttentionReader, self).__init__()
        self.input_dim = input_dim
        self.memory_dim = memory_width

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)

        self.tanh = activations.get('tanh')
        self.sigmoid = activations.get('sigmoid')
        self.softplus = activations.get('softplus')
        self.vec_softmax = activations.get('vector_softmax')
        self.softmax = activations.get('softmax')

        """
        Reader Params.
        """
        self.W_key   = self.init((input_dim, memory_width))
        self.W_lock  = self.inner_init((memory_width, memory_width))

        self.W_shift = self.init((input_dim, shift_width))
        self.W_beta = self.init(input_dim)
        self.W_gama = self.init(input_dim)
        self.W_g = self.init(input_dim)

        # self.v     = self.init(memory_width)
        self.b_key = shared_zeros(memory_width)
        self.b_shift = shared_zeros(shift_width)
        self.b_beta = theano.shared(floatX(0))
        self.b_gama = theano.shared(floatX(0))
        self.b_g = theano.shared(floatX(0))

        self.shift_conv = shift_conv

        # add params and set names.
        self.params = [self.W_key, self.W_shift, self.W_beta, self.W_gama, self.W_g,
                       self.b_key, self.b_shift, self.b_beta, self.b_gama, self.b_g,
                       self.W_lock]

        self.W_key.name, self.W_shift.name, self.W_beta.name, \
        self.W_gama.name, self.W_g.name = 'W_key', 'W_shift', 'W_beta', \
                                          'W_gama', 'W_g'
        self.W_lock.name  = 'W_lock'

        self.b_key.name, self.b_shift.name, self.b_beta.name, \
        self.b_gama.name, self.b_g.name = 'b_key', 'b_shift', 'b_beta', \
                                          'b_gama', 'b_g'

    def __call__(self, X, w_temp, m_temp):
        # input dimensions
        # X:      (nb_samples, input_dim)
        # w_temp: (nb_samples, memory_dim)
        # m_temp: (nb_samples, memory_dim, memory_width) ::tensor_memory

        key   = dot(X, self.W_key, self.b_key)  # (nb_samples, memory_width)
        lock  = dot(m_temp, self.W_lock)        # (nb_samples, memory_dim, memory_width)
        shift = self.softmax(
            dot(X, self.W_shift, self.b_shift))  # (nb_samples, shift_width)

        beta = self.softplus(dot(X, self.W_beta, self.b_beta))[:, None]  # (nb_samples, x)
        gamma = self.softplus(dot(X, self.W_gama, self.b_gama)) + 1.  # (nb_samples,)
        gamma = gamma[:, None]  # (nb_samples, x)
        g = self.sigmoid(dot(X, self.W_g, self.b_g))[:, None]  # (nb_samples, x)

        signal = [key, shift, beta, gamma, g]

        energy = T.sum(key[:, None, :] * lock, axis=2)
        # energy = T.tensordot(key[:, None, :] + lock, self.v, [2, 0])
        w_c    = self.softmax(beta * energy)
        # w_c = self.softmax(
        #     beta * cosine_sim2d(key, m_temp))  # (nb_samples, memory_dim) //content-based addressing
        w_g = g * w_c + (1 - g) * w_temp  # (nb_samples, memory_dim) //history interpolation
        w_s = shift_convolve2d(w_g, shift, self.shift_conv)  # (nb_samples, memory_dim) //convolutional shift
        w_p = w_s ** gamma  # (nb_samples, memory_dim) //sharpening
        w_t = w_p / T.sum(w_p, axis=1)[:, None]  # (nb_samples, memory_dim)
        return w_t


class AttentionWriter(AttentionReader):
    """
        "Writer head" of the Neural Turing Machine
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(AttentionWriter, self).__init__(input_dim, memory_width, shift_width, shift_conv,
                                     init, inner_init, name)

        """
        Writer Params.
        """
        self.W_erase = self.init((input_dim, memory_width))
        self.W_add = self.init((input_dim, memory_width))

        self.b_erase = shared_zeros(memory_width)
        self.b_add = shared_zeros(memory_width)

        # add params and set names.
        self.params += [self.W_erase, self.W_add, self.b_erase, self.b_add]

        self.W_erase.name, self.W_add.name = 'W_erase', 'W_add'
        self.b_erase.name, self.b_add.name = 'b_erase', 'b_add'

    def get_fixer(self, X):
        erase = self.sigmoid(dot(X, self.W_erase, self.b_erase))  # (nb_samples, memory_width)
        add   = self.sigmoid(dot(X, self.W_add, self.b_add))  # (nb_samples, memory_width)
        return erase, add


class BernoulliController(Recurrent):
    """
    Controller used in Neural Turing Machine.
        - Core cell (Memory): binary memory
        - Reader head
        - Writer head
    It is a simple RNN version. In reality the Neural Turing Machine will use the LSTM cell.
    """

    def __init__(self,
                 input_dim,
                 memory_dim,
                 memory_width,
                 hidden_dim,
                 shift_width=3,
                 init='glorot_uniform',
                 inner_init='orthogonal',
                 name=None,
                 readonly=False,
                 curr_input=False,
                 recurrence=False,
                 memorybook=None
                 ):
        super(BernoulliController, self).__init__()
        # Initialization of the dimensions.
        self.input_dim     = input_dim
        self.memory_dim    = memory_dim
        self.memory_width  = memory_width
        self.hidden_dim    = hidden_dim
        self.shift_width   = shift_width

        self.init          = initializations.get(init)
        self.inner_init    = initializations.get(inner_init)
        self.tanh          = activations.get('tanh')
        self.softmax       = activations.get('softmax')
        self.vec_softmax   = activations.get('vector_softmax')
        self.sigmoid       = activations.get('sigmoid')

        self.readonly      = readonly
        self.curr_input    = curr_input
        self.recurrence    = recurrence
        self.memorybook    = memorybook

        """
        Controller Module.
        """
        # hidden projection:
        self.W_in          = self.init((input_dim, hidden_dim))
        self.b_in          = shared_zeros(hidden_dim)
        self.W_rd          = self.init((memory_width, hidden_dim))
        self.W_in.name     = 'W_in'
        self.b_in.name     = 'b_in'
        self.W_rd.name     = 'W_rd'
        self.params        = [self.W_in, self.b_in, self.W_rd]

        # use recurrence:
        if self.recurrence:
            self.W_hh      = self.inner_init((hidden_dim, hidden_dim))
            self.W_hh.name = 'W_hh'
            self.params   += [self.W_hh]

        # Shift convolution
        shift_conv         = sl.circulant(np.arange(memory_dim)).T[
                                np.arange(-(shift_width // 2), (shift_width // 2) + 1)][::-1]

        # use the current input for weights.
        if self.curr_input:
            controller_size = self.input_dim + self.hidden_dim
        else:
            controller_size = self.hidden_dim

        # write head
        if not readonly:
            self.writer    = AttentionWriter(controller_size, memory_width, shift_width, shift_conv, name='writer')
            self.writer.set_name('writer')
            self._add(self.writer)

        # read head
        self.reader        = AttentionReader(controller_size, memory_width, shift_width, shift_conv, name='reader')
        self.reader.set_name('reader')
        self._add(self.reader)

        # ***********************************************************
        # reserved for None initialization (we don't use these often)
        self.memory_init   = self.sigmoid(self.init((memory_dim, memory_width)))
        self.w_write_init  = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.w_read_init   = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.contr_init    = self.tanh(np.random.rand(1, hidden_dim).astype(theano.config.floatX))

        if name is not None:
            self.set_name(name)

    def _controller(self, input_t, read_t, controller_tm1=None):
        # input_t : (nb_sample, input_dim)
        # read_t  : (nb_sample, memory_width)
        # controller_tm1: (nb_sample, hidden_dim)
        if self.recurrence:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(controller_tm1, self.W_hh) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)
        else:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)

    @staticmethod
    def _read(w_read, memory):
        # w_read : (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # return dot(w_read, memory)

        return T.sum(w_read[:, :, None] * memory, axis=1)

    @staticmethod
    def _write(w_write, memory, erase, add):
        # w_write: (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # erase/add: (nb_sample, memory_width)

        w_write  = w_write[:, :, None]
        erase    = erase[:, None, :]     # erase is a gate.
        add      = add[:, None, :]       # add is a bias

        # m_erased = memory * (1 - w_write * erase)
        # memory_t = m_erased + w_write * add  # (nb_sample, memory_dim, memory_width)
        memory_t = memory * (1 - w_write * erase) + \
                   add * w_write * (1 - erase)

        return memory_t

    def _step(self, input_t, mask_t,
              memory_tm1,
              w_write_tm1, w_read_tm1,
              controller_tm1):
        # input_t:     (nb_sample, input_dim)
        # memory_tm1:  (nb_sample, memory_dim, memory_width)
        # w_write_tm1: (nb_sample, memory_dim)
        # w_read_tm1:  (nb_sample, memory_dim)
        # controller_tm1: (nb_sample, hidden_dim)

        # read the memory
        if self.curr_input:
            info     = T.concatenate((controller_tm1, input_t), axis=1)
            w_read_t = self.reader(info, w_read_tm1, memory_tm1)
            read_tm1 = self._read(w_read_t, memory_tm1)
        else:
            read_tm1 = self._read(w_read_tm1, memory_tm1)       # (nb_sample, memory_width)

        # get the new controller (hidden states.)
        if self.recurrence:
            controller_t = self._controller(input_t, read_tm1, controller_tm1)
        else:
            controller_t = self._controller(input_t, read_tm1)  # (nb_sample, controller_size)

        # update the memory cell (if need)
        if not self.readonly:
            if self.curr_input:
                infow          = T.concatenate((controller_t, input_t), axis=1)
                w_write_t      = self.writer(infow, w_write_tm1, memory_tm1)     # (nb_sample, memory_dim)
                erase_t, add_t = self.writer.get_fixer(infow)                    # (nb_sample, memory_width)
            else:
                w_write_t      = self.writer(controller_t, w_write_tm1, memory_tm1)
                erase_t, add_t = self.writer.get_fixer(controller_t)
            memory_t           = self._write(w_write_t, memory_tm1, erase_t, add_t)  # (nb_sample, memory_dim, memory_width)
        else:
            w_write_t          = w_write_tm1
            memory_t           = memory_tm1

        # get the next reading weights.
        if not self.curr_input:
            w_read_t           = self.reader(controller_t, w_read_tm1, memory_t)  # (nb_sample, memory_dim)

        # over masking
        memory_t     = memory_t     * mask_t[:, :, None] + memory_tm1 * (1 - mask_t[:, :, None])
        w_read_t     = w_read_t     * mask_t + w_read_tm1     * (1 - mask_t)
        w_write_t    = w_write_t    * mask_t + w_write_tm1    * (1 - mask_t)
        controller_t = controller_t * mask_t + controller_tm1 * (1 - mask_t)

        return memory_t, w_write_t, w_read_t, controller_t

    def __call__(self, X, mask=None, M=None, init_ww=None,
                 init_wr=None, init_c=None, return_sequence=False,
                 one_step=False, return_full=False):
        # recurrent cell only work for tensor.
        if X.ndim == 2:
            X = X[:, None, :]
        nb_samples = X.shape[0]

        # mask
        if mask is None:
            mask = T.alloc(1., X.shape[0], 1)

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        # ***********************************************************************
        # initialization states
        if M is None:
            memory_init  = T.repeat(self.memory_init[None, :, :], nb_samples, axis=0)
        else:
            memory_init  = M

        if init_wr is None:
            w_read_init  = T.repeat(self.w_read_init, nb_samples, axis=0)
        else:
            w_read_init  = init_wr

        if init_ww is None:
            w_write_init = T.repeat(self.w_write_init, nb_samples, axis=0)
        else:
            w_write_init = init_ww

        if init_c is None:
            contr_init   = T.repeat(self.contr_init, nb_samples, axis=0)
        else:
            contr_init   = init_c
        # ************************************************************************

        outputs_info = [memory_init, w_write_init, w_read_init, contr_init]

        if one_step:
            seq = [X[0], padded_mask[0]]
            outputs = self._step(*(seq + outputs_info))
            return outputs
        else:
            seq = [X, padded_mask]
            outputs, _ = theano.scan(
                self._step,
                sequences=seq,
                outputs_info=outputs_info,
                name='controller_recurrence'
            )

        self.monitor['memory_info'] = outputs

        if not return_full:
            if return_sequence:
                return outputs[-1].dimshuffle((1, 0, 2))
            return outputs[-1][-1]
        else:
            if return_sequence:
                return [a.dimshuffle((1, 0, 2)) for a in outputs]
            return [a[-1] for a in outputs]

================================================
FILE: emolga/layers/recurrent.py
================================================
# -*- coding: utf-8 -*-
from abc import abstractmethod
from .core import *


class Recurrent(MaskedLayer):
    """
        Recurrent Neural Network
    """

    @staticmethod
    def get_padded_shuffled_mask(mask, pad=0):
        """
        change the order of dims of mask, to match the dim of inputs outside
            [1] change the 2D matrix into 3D, (nb_samples, max_sent_len, 1)
            [2] dimshuffle to (max_sent_len, nb_samples, 1)
            the value on dim=0 could be either 0 or 1?
        :param: mask, shows x is a word (!=0) or not(==0), shape=(n_samples, max_sent_len)
        """
        # mask is (n_samples, time)
        assert mask, 'mask cannot be None'
        # pad a dim of 1 to the right, (nb_samples, max_sent_len, 1)
        mask = T.shape_padright(mask)
        # mask = T.addbroadcast(mask, -1), make the new dim broadcastable
        mask = T.addbroadcast(mask, mask.ndim-1)

        # change the order of dims, to match the dim of inputs outside
        mask = mask.dimshuffle(1, 0, 2)  # (max_sent_len, nb_samples, 1)

        if pad > 0:
            # left-pad in time with 0
            padding = alloc_zeros_matrix(pad, mask.shape[1], 1)
            mask = T.concatenate([padding, mask], axis=0)
        return mask.astype('int8')


class GRU(Recurrent):
    """
        Gated Recurrent Unit - Cho et al. 2014

        Acts as a spatio-temporal projection,
        turning a sequence of vectors into a single vector.

        Eats inputs with shape:
        (nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)

        and returns outputs with shape:
        if not return_sequences:
            (nb_samples, output_dim)
        if return_sequences:
            (nb_samples, max_sample_length, output_dim)

        z_t         = tanh(W_z*x + U_z*h_t-1 + b_z)
        r_t         = tanh(W_r*x + U_r*h_t-1 + b_r)
        hh_t        = tanh(W_h*x + U_r*(r_t*h_t-1) + b_h)
        h_t         = z_t * h_t-1 + (1 - z_t) * hh_t

        The doc product computation regarding x is independent from time
            so it could be done out of the recurrent process (in advance)
                x_z         = dot(X, self.W_z, self.b_z)
                x_r         = dot(X, self.W_r, self.b_r)
                x_h         = dot(X, self.W_h, self.b_h)

        References:
            On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
                http://www.aclweb.org/anthology/W14-4012
            Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
                http://arxiv.org/pdf/1412.3555v1.pdf
    """

    def __init__(self,
                 input_dim,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(GRU, self).__init__()
        """
        Standard GRU.
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)
        # W is a matrix to map input x_t
        self.W_z = self.init((self.input_dim, self.output_dim))
        self.W_r = self.init((self.input_dim, self.output_dim))
        self.W_h = self.init((self.input_dim, self.output_dim))
        # U is a matrix to map hidden state of last time h_t-1
        self.U_z = self.inner_init((self.output_dim, self.output_dim))
        self.U_r = self.inner_init((self.output_dim, self.output_dim))
        self.U_h = self.inner_init((self.output_dim, self.output_dim))
        # bias terms
        self.b_z = shared_zeros(self.output_dim)
        self.b_r = shared_zeros(self.output_dim)
        self.b_h = shared_zeros(self.output_dim)

        # set names
        self.W_z.name, self.U_z.name, self.b_z.name = 'Wz', 'Uz', 'bz'
        self.W_r.name, self.U_r.name, self.b_r.name = 'Wr', 'Ur', 'br'
        self.W_h.name, self.U_h.name, self.b_h.name = 'Wh', 'Uh', 'bh'

        self.params = [
            self.W_z, self.U_z, self.b_z,
            self.W_r, self.U_r, self.b_r,
            self.W_h, self.U_h, self.b_h,
        ]

        """
        GRU with context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_z = self.init((self.context_dim, self.output_dim))
            self.C_r = self.init((self.context_dim, self.output_dim))
            self.C_h = self.init((self.context_dim, self.output_dim))
            self.C_z.name, self.C_r.name, self.C_h.name = 'Cz', 'Cr', 'Ch'

            self.params += [self.C_z, self.C_r, self.C_h]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xz_t, xr_t, xh_t, mask_t,
              h_tm1,
              u_z, u_r, u_h):
        """
        One step computation of GRU for a batch of data at time t
                sequences=[x_z, x_r, x_h, padded_mask],
                outputs_info=init_h,
                non_sequences=[self.U_z, self.U_r, self.U_h]
        :param xz_t, xr_t, xh_t:
                        value of x of time t after gate z/r/h (computed beforehand)
                            shape=(n_samples, output_emb_dim)
        :param mask_t:  mask of time t, indicates whether t-th token is a word, shape=(n_samples, 1)
        :param h_tm1:   hidden value (output) of last time, shape=(nb_samples, output_emb_dim)
        :param u_z, u_r, u_h:
                        mapping matrix for hidden state of time t-1
                        shape=(output_emb_dim, output_emb_dim)
        :return: h_t:   output, hidden state of time t, shape=(nb_samples, output_emb_dim)
        """
        # h_mask_tm1 = mask_tm1 * h_tm1
        # Here we use a GroundHog-like style which allows

        # activation value of update/reset gate, shape=(n_samples, 1)
        z          = self.inner_activation(xz_t + T.dot(h_tm1, u_z))
        r          = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t       = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t        = z * h_tm1 + (1 - z) * hh_t

        # why use mask_t to mix up h_t and h_tm1 again?
        #   if current term is None (padding term, mask=0), then drop the update (0*h_t and keep use the last state(1*h_tm1)
        h_t        = mask_t * h_t + (1 - mask_t) * h_tm1
        return h_t

    def _step_gate(self,
                   xz_t, xr_t, xh_t, mask_t,
                   h_tm1,
                   u_z, u_r, u_h):
        """
        One step computation of GRU
        :returns
            h_t:   output, hidden state of time t, shape=(n_samples, output_emb_dim)
            z:     value of update gate (after activation), shape=(n_samples, 1)
            r:     value of reset gate (after activation), shape=(n_samples, 1)
        """
        # h_mask_tm1 = mask_tm1 * h_tm1
        # Here we use a GroundHog-like style which allows
        z          = self.inner_activation(xz_t + T.dot(h_tm1, u_z))
        r          = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t       = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t        = z * h_tm1 + (1 - z) * hh_t
        h_t        = mask_t * h_t + (1 - mask_t) * h_tm1
        return h_t, z, r

    def __call__(self, X, mask=None, C=None, init_h=None,
                 return_sequence=False, one_step=False,
                 return_gates=False):
        """
        :param X:       input sequence, a list of word vectors, shape=(n_samples, max_sent_len, input_emb_dim)
        :param mask:    input mask, shows x is a word (!=0) or not(==0), shape=(n_samples, max_sent_len)
        :param C:       context, for encoder is none
        :param init_h:  initial hidden state
        :param return_sequence: if True, return the encoding at each time, or only return the end state
        :param one_step: only go one step computation, or will be done by theano.scan()
        :param return_gates: whether return the gate state
        :return:
        """
        # recurrent cell only work for tensor
        if X.ndim == 2: # X.ndim == 3, shape=(n_samples, max_sent_len, input_emb_dim)
            X = X[:, None, :]
            if mask is not None:
                mask = mask[:, None]

        # mask, shape=(n_samples, max_sent_len)
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], 1)

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        # reshape the mask to shape=(max_sent_len, n_samples, 1)
        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X           = X.dimshuffle((1, 0, 2))     # X:   (max_sent_len, nb_samples, input_emb_dim)
        # compute the gate values at each time in advance
        #       shape of W = (input_emb_dim, output_emb_dim)
        x_z         = dot(X, self.W_z, self.b_z)  # x_z: (max_sent_len, nb_samples, output_emb_dim)
        x_r         = dot(X, self.W_r, self.b_r)  # x_r: (max_sent_len, nb_samples, output_emb_dim)
        x_h         = dot(X, self.W_h, self.b_h)  # x_h: (max_sent_len, nb_samples, output_emb_dim)

        """
        GRU with constant context. (no attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_z     += dot(ctx_step, self.C_z)
            x_r     += dot(ctx_step, self.C_r)
            x_h     += dot(ctx_step, self.C_h)

        """
        GRU with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if not return_gates:
            if one_step:
                seq          = [x_z, x_r, x_h, padded_mask]    # A hidden BUG (1)+++(1) !?!!!?!!?!?
                outputs_info = [init_h]
                non_seq      = [self.U_z, self.U_r, self.U_h]
                outputs = self._step(*(seq + outputs_info + non_seq))

            else:
                outputs, _ = theano.scan(
                    self._step,
                    sequences=[x_z, x_r, x_h, padded_mask],
                    outputs_info=init_h,
                    non_sequences=[self.U_z, self.U_r, self.U_h]
                )

            # return hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
            if return_sequence:
                return outputs.dimshuffle((1, 0, 2))
            # hidden state of last time, shape=(nb_samples, output_emb_dim)
            return outputs[-1]
        else:
            if one_step:
                seq             = [x_z, x_r, x_h, padded_mask]    # A hidden BUG (1)+++(1) !?!!!?!!?!?
                outputs_info    = [init_h]
                non_seq         = [self.U_z, self.U_r, self.U_h]
                outputs, zz, rr = self._step_gate(*(seq + outputs_info + non_seq))

            else:
                outputx, _ = theano.scan(
                    self._step_gate,
                    sequences=[x_z, x_r, x_h, padded_mask],
                    outputs_info=[init_h, None, None],
                    non_sequences=[self.U_z, self.U_r, self.U_h]
                )
                outputs, zz, rr = outputx

            if return_sequence:
                return outputs.dimshuffle((1, 0, 2)), zz.dimshuffle((1, 0, 2)), rr.dimshuffle((1, 0, 2))
            return outputs[-1], zz[-1], rr[-1]


class JZS3(Recurrent):
    """
        Evolved recurrent neural network architectures from the evaluation of thousands
        of models, serving as alternatives to LSTMs and GRUs. See Jozefowicz et al. 2015.

        This corresponds to the `MUT3` architecture described in the paper.

        Takes inputs with shape:
        (nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)

        and returns outputs with shape:
        if not return_sequences:
            (nb_samples, output_dim)
        if return_sequences:
            (nb_samples, max_sample_length, output_dim)

        References:
            An Empirical Exploration of Recurrent Network Architectures
                http://www.jmlr.org/proceedings/papers/v37/jozefowicz15.pdf
    """
    def __init__(self,
                 input_dim,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(JZS3, self).__init__()
        """
        Standard model
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        self.W_z = self.init((self.input_dim, self.output_dim))
        self.U_z = self.inner_init((self.output_dim, self.output_dim))
        self.b_z = shared_zeros(self.output_dim)

        self.W_r = self.init((self.input_dim, self.output_dim))
        self.U_r = self.inner_init((self.output_dim, self.output_dim))
        self.b_r = shared_zeros(self.output_dim)

        self.W_h = self.init((self.input_dim, self.output_dim))
        self.U_h = self.inner_init((self.output_dim, self.output_dim))
        self.b_h = shared_zeros(self.output_dim)

        # set names
        self.W_z.name, self.U_z.name, self.b_z.name = 'Wz', 'Uz', 'bz'
        self.W_r.name, self.U_r.name, self.b_r.name = 'Wr', 'Ur', 'br'
        self.W_h.name, self.U_h.name, self.b_h.name = 'Wh', 'Uh', 'bh'

        self.params = [
            self.W_z, self.U_z, self.b_z,
            self.W_r, self.U_r, self.b_r,
            self.W_h, self.U_h, self.b_h,
        ]

        """
        context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_z = self.init((self.context_dim, self.output_dim))
            self.C_r = self.init((self.context_dim, self.output_dim))
            self.C_h = self.init((self.context_dim, self.output_dim))
            self.C_z.name, self.C_r.name, self.C_h.name = 'Cz', 'Cr', 'Ch'

            self.params += [self.C_z, self.C_r, self.C_h]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xz_t, xr_t, xh_t, mask_t,
              h_tm1,
              u_z, u_r, u_h):
        # h_mask_tm1 = mask_tm1 * h_tm1
        z     = self.inner_activation(xz_t + T.dot(T.tanh(h_tm1), u_z))
        r     = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t  = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t   = (hh_t * z + h_tm1 * (1 - z)) * mask_t + (1 - mask_t) * h_tm1
        return h_t

    def __call__(self, X, mask=None, C=None, init_h=None, return_sequence=False, one_step=False):
        # recurrent cell only work for tensor
        if X.ndim == 2:
            X = X[:, None, :]

        # mask
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], X.shape[1])

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        x_z = dot(X, self.W_z, self.b_z)
        x_r = dot(X, self.W_r, self.b_r)
        x_h = dot(X, self.W_h, self.b_h)

        """
        JZS3 with constant context. (not attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_z     += dot(ctx_step, self.C_z)
            x_r     += dot(ctx_step, self.C_r)
            x_h     += dot(ctx_step, self.C_h)

        """
        JZS3 with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if one_step:
            seq          = [x_z, x_r, x_h, padded_mask]
            outputs_info = [init_h]
            non_seq      = [self.U_z, self.U_r, self.U_h]
            outputs = self._step(*(seq + outputs_info + non_seq))

        else:
            outputs, updates = theano.scan(
                self._step,
                sequences=[x_z, x_r, x_h, padded_mask],
                outputs_info=init_h,
                non_sequences=[self.U_z, self.U_r, self.U_h],
            )

        if return_sequence:
            return outputs.dimshuffle((1, 0, 2))
        return outputs[-1]


class LSTM(Recurrent):
    def __init__(self,
                 input_dim=0,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(LSTM, self).__init__()
        """
        Standard model
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        # input gate param.
        self.W_i = self.init((self.input_dim, self.output_dim))
        self.U_i = self.inner_init((self.output_dim, self.output_dim))
        self.b_i = shared_zeros(self.output_dim)

        # forget gate param.
        self.W_f = self.init((self.input_dim, self.output_dim))
        self.U_f = self.inner_init((self.output_dim, self.output_dim))
        self.b_f = self.forget_bias_init(self.output_dim)  # forget gate needs one bias.

        # output gate param.
        self.W_o = self.init((self.input_dim, self.output_dim))
        self.U_o = self.inner_init((self.output_dim, self.output_dim))
        self.b_o = shared_zeros(self.output_dim)

        # memory param.
        self.W_c = self.init((self.input_dim, self.output_dim))
        self.U_c = self.inner_init((self.output_dim, self.output_dim))
        self.b_c = shared_zeros(self.output_dim)

        # set names
        self.W_i.name, self.U_i.name, self.b_i.name = 'Wi', 'Ui', 'bi'
        self.W_f.name, self.U_f.name, self.b_f.name = 'Wf', 'Uf', 'bf'
        self.W_o.name, self.U_o.name, self.b_o.name = 'Wo', 'Uo', 'bo'
        self.W_c.name, self.U_c.name, self.b_c.name = 'Wc', 'Uc', 'bc'

        self.params = [
            self.W_i, self.U_i, self.b_i,
            self.W_f, self.U_f, self.b_f,
            self.W_o, self.U_o, self.b_o,
            self.W_c, self.U_c, self.b_c,
        ]

        """
        context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_i = self.init((self.context_dim, self.output_dim))
            self.C_f = self.init((self.context_dim, self.output_dim))
            self.C_o = self.init((self.context_dim, self.output_dim))
            self.C_c = self.init((self.context_dim, self.output_dim))
            self.C_i.name, self.C_f.name, self.C_o.name, self.C_c.name = 'Ci', 'Cf', 'Co', 'Cc'

            self.params += [self.C_i, self.C_f, self.C_o, self.C_c]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xi_t, xf_t, xo_t, xc_t, mask_t,
              h_tm1, c_tm1,
              u_i, u_f, u_o, u_c):
        # h_mask_tm1 = mask_tm1 * h_tm1

        i     = self.inner_activation(xi_t + T.dot(h_tm1, u_i))  # input  gate
        f     = self.inner_activation(xf_t + T.dot(h_tm1, u_f))  # forget gate
        o     = self.inner_activation(xo_t + T.dot(h_tm1, u_o))  # output gate
        c     = self.activation(xc_t + T.dot(h_tm1, u_c))        # memory updates

        # update the memory cell.
        c_t   = f * c_tm1 + i * c
        h_t   = o * self.activation(c_t)

        # masking
        c_t   = c_t * mask_t + (1 - mask_t) * c_tm1
        h_t   = h_t * mask_t + (1 - mask_t) * h_tm1
        return h_t, c_t

    def input_embed(self, X, C=None):
        x_i = dot(X, self.W_i, self.b_i)
        x_f = dot(X, self.W_f, self.b_f)
        x_o = dot(X, self.W_o, self.b_o)
        x_c = dot(X, self.W_c, self.b_c)

        """
        LSTM with constant context. (not attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_i     += dot(ctx_step, self.C_i)
            x_f     += dot(ctx_step, self.C_f)
            x_o     += dot(ctx_step, self.C_o)
            x_c     += dot(ctx_step, self.C_c)

        return x_i, x_f, x_o, x_c

    def __call__(self, X, mask=None, C=None, init_h=None, init_c=None, return_sequence=False, one_step=False):
        # recurrent cell only work for tensor
        if X.ndim == 2:
            X = X[:, None, :]

        # mask
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], X.shape[1])

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))
        x_i, x_f, x_o, x_c = self.input_embed(X, C)

        """
        LSTM with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if init_c is None:
            init_c = init_h

        if one_step:
            seq          = [x_i, x_f, x_o, x_c, padded_mask]
            outputs_info = [init_h, init_c]
            non_seq      = [self.U_i, self.U_f, self.U_o, self.U_c]
            outputs = self._step(*(seq + outputs_info + non_seq))

        else:
            outputs, updates = theano.scan(
                self._step,
                sequences=[x_i, x_f, x_o, x_c, padded_mask],
                outputs_info=[init_h, init_c],
                non_sequences=[self.U_i, self.U_f, self.U_o, self.U_c],
            )

        if return_sequence:
            return outputs[0].dimshuffle((1, 0, 2)), outputs[1].dimshuffle((1, 0, 2))  # H, C
        return outputs[0][-1], outputs[1][-1]


================================================
FILE: emolga/models/__init__.py
================================================
__author__ = 'jiataogu'


================================================
FILE: emolga/models/core.py
================================================
import json

import numpy

from keyphrase import config

__author__ = 'jiataogu'
import theano
import logging
import deepdish as dd

from emolga.dataset.build_dataset import serialize_to_file, deserialize_from_file, serialize_to_file_json
from emolga.utils.theano_utils import floatX

logger = logging.getLogger(__name__)


class Model(object):
    def __init__(self):
        self.layers  = []
        self.params  = []
        self.monitor = {}
        self.watchlist = []

    def _add(self, layer):
        if layer:
            self.layers.append(layer)
            self.params += layer.params

    def _monitoring(self):
        # add monitoring variables
        for l in self.layers:
            for v in l.monitor:
                name = v + '@' + l.name
                print(name)
                self.monitor[name] = l.monitor[v]

    def compile_monitoring(self, inputs, updates=None):
        logger.info('compile monitoring')
        for i, v in enumerate(self.monitor):
            self.watchlist.append(v)
            logger.info('monitoring [{0}]: {1}'.format(i, v))

        self.watch = theano.function(inputs,
                                     [self.monitor[v] for v in self.watchlist],
                                     updates=updates
                                     )
        logger.info('done.')

    def set_weights(self, weights):
        if hasattr(self, 'save_parm'):
            params = self.params + self.save_parm
        else:
            params = self.params

        for p, w in zip(params, weights):
            # print(p.name)
            if p.eval().shape != w.shape:
                raise Exception("Layer shape %s not compatible with weight shape %s." % (p.eval().shape, w.shape))
            p.set_value(floatX(w))

    def get_weights(self):
        weights = []
        for p in self.params:
            weights.append(p.get_value())

        if hasattr(self, 'save_parm'):
            for v in self.save_parm:
                weights.append(v.get_value())

        return weights

    def set_name(self, name):
        for i in range(len(self.params)):
            if self.params[i].name is None:
                self.params[i].name = '%s_p%d' % (name, i)
            else:
                self.params[i].name = name + '@' + self.params[i].name
        self.name = name

    def save(self, filename):
        if hasattr(self, 'save_parm'):
            params = self.params + self.save_parm
        else:
            params = self.params
        ps = 'save: <\n'
        for p in params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '> to ... {}'.format(filename)
        # logger.info(ps)

        # hdf5 module seems works abnormal !!
        # dd.io.save(filename, self.get_weights())
        serialize_to_file(self.get_weights(), filename)

    def load(self, filename):
        logger.info('load the weights.')

        # hdf5 module seems works abnormal !!
        # weights = dd.io.load(filename)
        weights = deserialize_from_file(filename)
        # print(len(weights))
        self.set_weights(weights)

    def save_weight_json(self, filename):
        '''
        Write weights into file for manually check (not useful, too many parameters in (-1,1))
        :param filename:
        :return:
        '''
        print('Save weights into file: %s' % filename)
        with open(filename, 'w') as f:
            for p in self.params:
                str = json.dumps(p.get_value().tolist())+'\n'
                f.write(str)

    def load_weight_json(self, filename):
        '''
        Write weights into file for manually check (not useful, too many parameters in (-1,1))
        :param filename:
        :return:
        '''
        count_1 = 0
        count_5 = 0
        count_10 = 0
        total = 0
        max = 0.
        with open(filename, 'r') as f:
            for line in f:
                list = numpy.array(json.loads(line))
                list = list.ravel()
                for e in list:
                    total += 1
                    if abs(e) > 1:
                        count_1 += 1
                    if abs(e) > 5:
                        count_5 += 1
                    if abs(e) > 10:
                        count_10 += 1
                        print(e)
                    if abs(e) > max:
                        max = abs(e)
                        print('new max = %f' % e)
        print('total = %d' % total)
        print('count < 1/5/10 = %d / %d / %d' % (count_1, count_5, count_10))
        print('max = %f' % max)

if __name__ == '__main__':
    config = config.setup_keyphrase_all()  # setup_keyphrase_inspec
    agent = Model()
    agent.load_weight_json(config['weight_json'])

================================================
FILE: emolga/models/covc_encdec.py
================================================
__author__ = 'jiataogu, memray'
import theano
import logging
import copy
import emolga.basic.objectives as objectives
import emolga.basic.optimizers as optimizers

from theano.compile.nanguardmode import NanGuardMode
from emolga.utils.generic_utils import visualize_
from emolga.layers.core import Dropout, Dense, Dense2, Identity
from emolga.layers.recurrent import *
from emolga.layers.ntm_minibatch import Controller
from emolga.layers.embeddings import *
from emolga.layers.attention import *
from emolga.models.core import Model
from nltk.stem.porter import *
import math

logger = logging.getLogger(__name__)
RNN    = GRU             # change it here for other RNN models.
err    = 1e-7


class Encoder(Model):
    """
    Recurrent Neural Network-based Encoder
    It is used to compute the context vector.
    """

    def __init__(self,
                 config, rng, prefix='enc',
                 mode='Evaluation', embed=None, use_context=False):
        super(Encoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.mode = mode
        self.name = prefix
        self.use_context = use_context

        self.return_embed = False
        self.return_sequence = False

        """
        Create all elements of the Encoder's Computational graph
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['enc_voc_size'],
                self.config['enc_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        if self.use_context:
            self.Initializer = Dense(
                config['enc_contxt_dim'],
                config['enc_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )
            self._add(self.Initializer)

        """
        Encoder Core
        """
        # create RNN cells
        if not self.config['bidirectional']:
            logger.info("{}_create RNN cells.".format(self.prefix))
            self.RNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_cell".format(self.prefix)
            )
            self._add(self.RNN)
        else:
            logger.info("{}_create forward RNN cells.".format(self.prefix))
            self.forwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_fw_cell".format(self.prefix)
            )
            self._add(self.forwardRNN)

            logger.info("{}_create backward RNN cells.".format(self.prefix))
            self.backwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_bw_cell".format(self.prefix)
            )
            self._add(self.backwardRNN)

        logger.info("create encoder ok.")

    def build_encoder(self, source, context=None, return_embed=False,
                      return_sequence=False,
                      return_gates=False,
                      clean_mask=False):
        """
        Build the Encoder Computational Graph

        For the copynet default configurations (with attention)
            return_embed=True,
            return_sequence=True,
            return_gates=True,
            clean_mask=False
        Input:
            source : source text, a list of indexes, shape=[nb_sample, max_len]
            context: None
        Return:
            For Attention model:
                return_sequence=True: to return the embedding at each time, not just the end state
                return_embed=True:
                    X_out:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
                    X:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
                    X_mask: mask, an array showing which elements in X are not 0 [nb_sample, src_max_len]
                    X_tail: encoding of ending of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]

        nb_sample:  number of samples, defined by batch size
        max_len:    max length of sentence (lengths of input are same after padding)
        """
        # clean_mask means we set the hidden states of masked places as 0.
        # sometimes it will help the program to solve something
        # note that this option only works when return_sequence.
        # we recommend to leave at least one mask in the end of encoded sequence.

        # Initial state
        Init_h = None
        if self.use_context:
            Init_h = self.Initializer(context)

        # word embedding
        if not self.config['bidirectional']:
            X, X_mask = self.Embed(source, True)
            if return_gates:
                X_out, Z, R = self.RNN(X, X_mask, C=context, init_h=Init_h,
                                       return_sequence=return_sequence,
                                       return_gates=True)
            else:
                X_out     = self.RNN(X, X_mask, C=context, init_h=Init_h,
                                     return_sequence=return_sequence,
                                     return_gates=False)
            if return_sequence:
                X_tail    = X_out[:, -1]

                if clean_mask:
                    X_out     = X_out * X_mask[:, :, None]
            else:
                X_tail    = X_out
        else:
            source2 = source[:, ::-1]
            '''
            Get the embedding of inputs
                shape(X)=[nb_sample, max_len, emb_dim]
                shape(X_mask)=[nb_sample, max_len]
            '''
            X,  X_mask  = self.Embed(source , mask_zero=True)
            X2, X2_mask = self.Embed(source2, mask_zero=True)

            '''
            Get the output after RNN
                return_sequence=True
            '''
            if not return_gates:
                '''
                X_out: hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
                '''
                X_out1 = self.backwardRNN(X, X_mask,  C=context, init_h=Init_h, return_sequence=return_sequence)
                X_out2 = self.forwardRNN(X2, X2_mask, C=context, init_h=Init_h, return_sequence=return_sequence)
            else:
                '''
                X_out: hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
                Z:     update gate value, shape=(n_samples, 1)
                R:     reset gate value, shape=(n_samples, 1)
                '''
                X_out1, Z1, R1  = self.backwardRNN(X, X_mask,  C=context, init_h=Init_h,
                                                   return_sequence=return_sequence,
                                                   return_gates=True)
                X_out2, Z2, R2  = self.forwardRNN(X2, X2_mask, C=context, init_h=Init_h,
                                                  return_sequence=return_sequence,
                                                  return_gates=True)
                Z = T.concatenate([Z1, Z2[:, ::-1, :]], axis=2)
                R = T.concatenate([R1, R2[:, ::-1, :]], axis=2)

            if not return_sequence:
                X_out  = T.concatenate([X_out1, X_out2], axis=1)
                X_tail = X_out
            else:
                X_out  = T.concatenate([X_out1, X_out2[:, ::-1, :]], axis=2)
                X_tail = T.concatenate([X_out1[:, -1], X_out2[:, -1]], axis=1)

                if clean_mask:
                    X_out     = X_out * X_mask[:, :, None]

        X_mask  = T.cast(X_mask, dtype='float32')
        if not return_gates:
            if return_embed:
                return X_out, X, X_mask, X_tail
            return X_out
        else:
            if return_embed:
                return X_out, X, X_mask, X_tail, Z, R
            return X_out, Z, R

    def compile_encoder(self, with_context=False, return_embed=False, return_sequence=False):
        source  = T.imatrix()
        self.return_embed = return_embed
        self.return_sequence = return_sequence
        if with_context:
            context = T.matrix()

            self.encode = theano.function([source, context],
                                          self.build_encoder(source, context,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence),
                                            allow_input_downcast=True)
            self.gtenc  = theano.function([source, context],
                                          self.build_encoder(source, context,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence,
                                                             return_gates=True),
                                            allow_input_downcast=True)
        else:
            """
            return
                X_out:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
                X:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
                X_mask: mask, an array showing which elements in X are not 0 [nb_sample, max_len]
                X_tail: encoding of end of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]
            """
            self.encode = theano.function([source],
                                          self.build_encoder(source, None,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence),
                                          allow_input_downcast=True)

            """
            return
                Z:  value of update gate, shape=(nb_sample, 1)
                R:  value of update gate, shape=(nb_sample, 1)
            """
            self.gtenc  = theano.function([source],
                                          self.build_encoder(source, None,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence,
                                                             return_gates=True),
                                          allow_input_downcast=True)


class Decoder(Model):
    """
    Recurrent Neural Network-based Decoder.
    It is used for:
        (1) Evaluation: compute the probability P(Y|X)
        (2) Prediction: sample the best result based on P(Y|X)
        (3) Beam-search
        (4) Scheduled Sampling (how to implement it?)
    """

    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 highway=False):
        """
        mode = RNN: use a RNN Decoder
        """
        super(Decoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.name = prefix
        self.mode = mode

        self.highway = highway
        self.init = initializations.get('glorot_uniform')
        self.sigmoid = activations.get('sigmoid')

        # use standard drop-out for input & output.
        # I believe it should not use for context vector.
        self.dropout = config['dropout']
        if self.dropout > 0:
            logger.info('Use standard-dropout!!!!')
            self.D   = Dropout(rng=self.rng, p=self.dropout, name='{}_Dropout'.format(prefix))

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['dec_voc_size'],
                self.config['dec_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        # create Initialization Layers
        logger.info("{}_create initialization layers.".format(self.prefix))
        if not config['bias_code']:
            self.Initializer = Zero()
        else:
            self.Initializer = Dense(
                config['dec_contxt_dim'],
                config['dec_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )

        # create RNN cells
        logger.info("{}_create RNN cells.".format(self.prefix))
        if 'location_embed' in self.config:
            if config['location_embed']:
                dec_embedd_dim = 2 * self.config['dec_embedd_dim']
            else:
                dec_embedd_dim = self.config['dec_embedd_dim']
        else:
            dec_embedd_dim = self.config['dec_embedd_dim']

        self.RNN = RNN(
            dec_embedd_dim,
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            name="{}_cell".format(self.prefix)
        )

        self._add(self.Initializer)
        self._add(self.RNN)

        # HighWay Gating
        if highway:
            logger.info("HIGHWAY CONNECTION~~~!!!")
            assert self.config['context_predict']
            assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']

            self.C_x = self.init((self.config['dec_contxt_dim'],
                                  self.config['dec_hidden_dim']))
            self.H_x = self.init((self.config['dec_hidden_dim'],
                                  self.config['dec_hidden_dim']))
            self.b_x = initializations.get('zero')(self.config['dec_hidden_dim'])

            self.C_x.name = '{}_Cx'.format(self.prefix)
            self.H_x.name = '{}_Hx'.format(self.prefix)
            self.b_x.name = '{}_bx'.format(self.prefix)
            self.params += [self.C_x, self.H_x, self.b_x]

        # create readout layers
        logger.info("_create Readout layers")

        # 1. hidden layers readout.
        self.hidden_readout = Dense(
            self.config['dec_hidden_dim'],
            self.config['output_dim']
            if self.config['deep_out']
            else self.config['dec_voc_size'],
            activation='linear',
            name="{}_hidden_readout".format(self.prefix)
        )

        # 2. previous word readout
        self.prev_word_readout = None
        if self.config['bigram_predict']:
            self.prev_word_readout = Dense(
                dec_embedd_dim,
                self.config['output_dim']
                if self.config['deep_out']
                else self.config['dec_voc_size'],
                activation='linear',
                name="{}_prev_word_readout".format(self.prefix),
                learn_bias=False
            )

        # 3. context readout
        self.context_readout = None
        if self.config['context_predict']:
            if not self.config['leaky_predict']:
                self.context_readout = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['output_dim']
                    if self.config['deep_out']
                    else self.config['dec_voc_size'],
                    activation='linear',
                    name="{}_context_readout".format(self.prefix),
                    learn_bias=False
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']
                self.context_readout = self.hidden_readout

        # option: deep output (maxout)
        if self.config['deep_out']:
            self.activ = Activation(config['deep_out_activ'])
            # self.dropout = Dropout(rng=self.rng, p=config['dropout'])
            self.output_nonlinear = [self.activ]  # , self.dropout]
            self.output = Dense(
                self.config['output_dim'] / 2
                if config['deep_out_activ'] == 'maxout2'
                else self.config['output_dim'],

                self.config['dec_voc_size'],
                activation='softmax',
                name="{}_output".format(self.prefix),
                learn_bias=False
            )
        else:
            self.output_nonlinear = []
            self.output = Activation('softmax')

        # registration:
        self._add(self.hidden_readout)

        if not self.config['leaky_predict']:
            self._add(self.context_readout)

        self._add(self.prev_word_readout)
        self._add(self.output)

        if self.config['deep_out']:
            self._add(self.activ)
        # self._add(self.dropout)

        logger.info("create decoder ok.")

    @staticmethod
    def _grab_prob(probs, X, block_unk=False):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target):
        # Word embedding
        Y, Y_mask = self.Embed(target, True)  # (nb_samples, max_len, embedding_dim)

        if self.config['use_input']:
            X = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)
        else:
            X = 0 * Y

        # option ## drop words.

        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, Y, Y_mask, Count

    def build_decoder(self, target, context=None,
                      return_count=False,
                      train=True):

        """
        Build the Decoder Computational Graph
        For training/testing
        """
        X, X_mask, Y, Y_mask, Count = self.prepare_xy(target)

        # input drop-out if any.
        if self.dropout > 0:
            X = self.D(X, train=train)

        # Initial state of RNN
        Init_h = self.Initializer(context)
        if not self.highway:
            X_out  = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=True)

            # Readout
            readout = self.hidden_readout(X_out)
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            if self.config['context_predict']:
                readout += self.context_readout(context).dimshuffle(0, 'x', 1)
        else:
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

            def _recurrence(x, x_mask, prev_h, c):
                # compute the highway gate for context vector.
                xx    = dot(c, self.C_x, self.b_x) + dot(prev_h, self.H_x)  # highway gate.
                xx    = self.sigmoid(xx)

                cy    = xx * c   # the path without using RNN
                x_out = self.RNN(x, mask=x_mask, C=c, init_h=prev_h, one_step=True)
                hx    = (1 - xx) * x_out
                cy    = T.cast(cy, 'float32')
                x_out = T.cast(x_out, 'float32')
                hx    = T.cast(hx, 'float32')
                return x_out, hx, cy

            outputs, _ = theano.scan(
                _recurrence,
                sequences=[X, X_mask],
                outputs_info=[Init_h, None, None],
                non_sequences=[context]
            )

            # hidden readout + context readout
            readout   = self.hidden_readout( outputs[1].dimshuffle((1, 0, 2)))
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            readout  += self.context_readout(outputs[2].dimshuffle((1, 0, 2)))

            # return to normal size.
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        prob_dist = self.output(readout)  # (nb_samples, max_len, vocab_size)
        # log_old  = T.sum(T.log(self._grab_prob(prob_dist, target)), axis=1)
        log_prob = T.sum(T.log(self._grab_prob(prob_dist, target) + err) * X_mask, axis=1)
        log_ppl  = log_prob / Count

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl

    """
    Sample one step
    """

    def _step_sample(self, prev_word, prev_stat, context):
        # word embedding (note that for the first word, embedding should be all zero)
        if self.config['use_input']:
            X = T.switch(
                prev_word[:, None] < 0,
                alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                self.Embed(prev_word)
            )
        else:
            X = alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim'])

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        if not self.highway:
            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            # compute the readout probability distribution and sample it
            # here the readout is a matrix, different from the learner.
            readout = self.hidden_readout(next_stat)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            if self.config['context_predict']:
                readout += self.context_readout(context)
        else:
            xx     = dot(context, self.C_x, self.b_x) + dot(prev_stat, self.H_x)  # highway gate.
            xx     = self.sigmoid(xx)

            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            readout  = self.hidden_readout((1 - xx) * X_proj)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            readout += self.context_readout(xx * context)

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        next_prob = self.output(readout)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat

    """
    Build the sampler for sampling/greedy search/beam search
    """

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.matrix()  # theano variable.

        init_h = self.Initializer(context)
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state', allow_input_downcast=True)
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')
        next_prob, next_sample, next_stat \
            = self._step_sample(prev_word, prev_stat, context)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_word, prev_stat, context]
        outputs = [next_prob, next_sample, next_stat]

        self.sample_next = theano.function(inputs, outputs, name='sample_next', allow_input_downcast=True)
        logger.info('done')
        pass

    """
    Build a Stochastic Sampler which can use SCAN to work on GPU.
    However it cannot be used in Beam-search.
    """

    def build_stochastic_sampler(self):
        context = T.matrix()
        init_h = self.Initializer(context)

        logger.info('compile the function: sample')
        pass

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context
        next_state = self.get_init_state(context)
        next_word = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)

        # Start searching!
        for ii in range(maxlen):
            # print(next_word)
            ctx = np.tile(context, [live_k, 1])
            next_prob, next_word, next_state \
                = self.sample_next(next_word, next_state, ctx)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                sample.append(nw)
                score += next_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only computed in a flatten way!
                cand_scores = hyp_scores[:, None] - np.log(next_prob)
                cand_flat = cand_scores.flatten()
                ranks_flat = cand_flat.argsort()[:(k - dead_k)]

                # fetch the best results.
                voc_size = next_prob.shape[1]
                trans_index = ranks_flat / voc_size
                word_index = ranks_flat % voc_size
                costs = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []

                for idx, [ti, wi] in enumerate(zip(trans_index, word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []

                for idx in range(len(new_hyp_samples)):
                    if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        dead_k += 1
                    else:
                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                if dead_k >= k:
                    break

                next_word = np.array([w[-1] for w in hyp_samples])
                next_state = np.array(hyp_states)
                pass
            pass

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in range(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])

        return sample, score


class DecoderAtt(Decoder):
    """
    Recurrent Neural Network-based Decoder [for CopyNet-b Only]
    with Attention Mechanism
    """
    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 copynet=False, identity=False):
        super(DecoderAtt, self).__init__(
                config, rng, prefix,
                 mode, embed, False)
        self.init     = initializations.get('glorot_uniform')
        self.copynet  = copynet
        self.identity = identity
        # attention reader
        self.attention_reader = Attention(
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            1000,
            name='source_attention',
            coverage=self.config['coverage']
        )
        self._add(self.attention_reader)

        # if use copynet
        if self.copynet:
            if not self.identity:
                self.Is = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['dec_embedd_dim'],
                    name='in-trans'
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_embedd_dim']
                self.Is = Identity(name='ini')

            self.Os = Dense(
                self.config['dec_readout_dim']
                if not self.config['location_embed']
                    else self.config['dec_readout_dim'] + self.config['dec_embedd_dim'],
                self.config['dec_contxt_dim'],
                name='out-trans'
            )

            if self.config['copygate']:
                self.Gs = Dense(
                    self.config['dec_readout_dim'] + self.config['dec_embedd_dim'],
                    1,
                    name='copy-gate',
                    activation='linear',
                    learn_bias=True,
                    negative_bias=True
                )
                self._add(self.Gs)

            if self.config['location_embed']:
                self._add(self.Is)
            self._add(self.Os)

        logger.info('adjust decoder ok.')

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target, cc_matrix):
        '''
        create target input for decoder (append a zero to the head of each sequence)
        :param target: indexes of target words
        :param cc_matrix: copy-matrix, (batch_size, trg_len, src_len), cc_matrix[i][j][k]=1 if j-th word in target matches the k-th word in source
        :return:
            X:          embedding of target sequences(batch_size, trg_len, embedding_dim)
            X_mask:     if x is a real word or padding (batch_size, trg_len)
            LL:         simply the copy-matrix (batch_size, trg_len, src_len)
            XL_mask:    if word ll in LL has any copyable word in source (batch_size, trg_len)
            Y_mask:     original mask of target sequences, but why do we need this? (batch_size, trg_len)
            Count:      number of real words in target, original length of each target sequences. size=(batch_size, 1)
        '''
        # target:      (nb_samples, index_seq)
        # cc_matrix:   (nb_samples, maxlen_t, maxlen_s)
        # context:     (nb_samples)

        # create the embedding of target words and their masks
        Y,  Y_mask  = self.Embed(target, True)  # (batch_size, trg_len, embedding_dim), (batch_size, trg_len)

        # append a zero array to the beginning of input
        #   first word of each target sequence to be zero (just like <BOS>) as the initial input of decoder
        #   create a zero array and concate to Y: (batch_size, 1, embedding_dim) + (batch_size, maxlen_t - 1, embedding_dim)
        #   as it's sure that there's a least one <pad> in the end of Y, so feel free to drop the last word (Y[:, :-1, :])
        X           = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)

        # LL          = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, cc_matrix.shape[2]), cc_matrix[:, :-1, :]], axis=1)

        # LL is the copy matrix
        LL = cc_matrix

        # a mask of copy mask, XL_mask[i][j]=1 shows the word in target has copyable/matching words in source text (batch_size, trg_len)
        XL_mask     = T.cast(T.gt(T.sum(LL, axis=2), 0), dtype='float32')

        # 'use_input' means teacher forcing? if not, make decoder input to be zero
        if not self.config['use_input']:
            X *= 0

        # create the mask of target input, append an [1] array to show <BOS>, size=(batch_size, trg_len)
        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        # how many real words (non-zero/non-padding) in each target sequence, size=(batch_size, 1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, LL, XL_mask, Y_mask, Count

    """
    The most different part. Be cautious!!
    Very different from traditional RNN search.
    """
    def build_decoder(self,
                      target,
                      cc_matrix,
                      context,
                      c_mask,
                      return_count=False,
                      train=True):
        """
        Build the Computational Graph ::> Context is essential
        target:     (batch_size, trg_len)
        cc_matrix:  (batch_size, trg_len, src_len), cc_matrix[i][j][k]=1 if in the i-th sample, the j-th word in target matches the k-th word in source
        context:    (batch_size, src_len, 2 * enc_hidden_dim), encoding of each time step (concatenate both forward and backward RNN encodings)
        context:    (nb_samples, max_len, contxt_dim)
        c_mask:     (batch_size, src_len) mask, X_mask[i][j]=1 means j-th word of sample i in X is not 0 (index of <pad>)
        """
        assert c_mask is not None, 'c_mask must be supplied for this decoder.'
        assert context.ndim == 3, 'context must have 3 dimentions.'

        # A bridge layer transforming context vector of encoder_dim to decoder_dim, Is=(2 * enc_hidden_dim, dec_embedd_dim) if it's bidirectional
        context_A = self.Is(context) # (nb_samples, max_src_len, dec_embedd_dim)

        '''
        X:          embedding of target sequences(batch_size, trg_len, embedding_dim)
        X_mask:     if x is a real word or padding (batch_size, trg_len)
        LL:         simply the copy-matrix (batch_size, trg_len, src_len)
        XL_mask:    if word ll in LL has any copyable word in source (batch_size, trg_len)
        Y_mask:     original mask of target sequences, but why do we need this? (batch_size, trg_len)
        Count:      number of real words in target, original length of each target sequences. size=(batch_size, 1)
        '''
        X, X_mask, LL, XL_mask, Y_mask, Count = self.prepare_xy(target, cc_matrix)

        # input drop-out if any.
        if self.dropout > 0:
            X     = self.D(X, train=train)

        # Initial state of RNN
        Init_h   = self.Initializer(context[:, 0, :])  # initialize hidden vector by converting the last state
        Init_a   = T.zeros((context.shape[0], context.shape[1]), dtype='float32') # (batch_size, src_len)
        coverage = T.zeros((context.shape[0], context.shape[1]), dtype='float32') # (batch_size, src_len)

        # permute to make dim of trg_len first
        X        = X.dimshuffle((1, 0, 2))             # (trg_len, batch_size, embedding_dim)
        X_mask   = X_mask.dimshuffle((1, 0))           # (trg_len, batch_size)
        LL       = LL.dimshuffle((1, 0, 2))            # (trg_len, batch_size, src_len)
        XL_mask  = XL_mask.dimshuffle((1, 0))          # (trg_len, batch_size)

        def _recurrence(x, x_mask, ll, xl_mask, prev_h, prev_a, cov, cc, cm, ca):
            """
            x:      (nb_samples, embed_dims)        embedding of word in target sequence of current time step
            x_mask: (nb_samples, )                  if x is a real word (1) or padding (0)
            ll:     (nb_samples, maxlen_s)          if x can be copied from the i-th word in source sequence (1) or not (0)
            xl_mask:(nb_samples, )                  if x has any copyable word in source sequence
            -----------------------------------------
            prev_h: (nb_samples, hidden_dims)       hidden vector of previous step
            prev_a: (nb_samples, maxlen_s)          a distribution of source telling which words are copy-attended in the previous step (initialized with zero)
            cov:    (nb_samples, maxlen_s)          a coverage vector telling which parts have been covered, implemented by attention (initialized with zero)
            -----------------------------------------
            cc:     (nb_samples, maxlen_s, context_dim)     original context, encoding of source text
            cm:     (nb_samples, maxlen_s)                  mask, (batch_size, src_len), X_mask[i][j]=1 means j-th word of sample i in X is not <pad>
            ca:     (nb_samples, maxlen_s, decoder_dim)     converted context_A, the context vector transformed by the bridge layer Is()
            """
            '''
            Generative Decoding
            '''
            # Compute the attention and get the context weight, <pad> in source is masked
            prob  = self.attention_reader(prev_h, cc, Smask=cm, Cov=cov)
            ncov  = cov + prob

            # Compute the weighted context vector
            cxt   = T.sum(cc * prob[:, :, None], axis=1)

            # Input feeding: obtain the new input by concatenating current input word x and previous attended context
            x_in  = T.concatenate([x, T.sum(ca * prev_a[:, :, None], axis=1)], axis=-1)

            # compute the current hidden states of the RNN. hidden state of last time, shape=(nb_samples, output_emb_dim)
            next_h = self.RNN(x_in, mask=x_mask, C=cxt, init_h=prev_h, one_step=True)

            # compute the current readout vector.
            r_in  = [next_h]
            if self.config['context_predict']:
                r_in  += [cxt]
            if self.config['bigram_predict']:
                r_in  += [x_in]

            # readout the word logits
            r_in    = T.concatenate(r_in, axis=-1) # shape=(nb_samples, output_emb_dim)
            r_out   = self.hidden_readout(next_h)  # obtain the generative logits, (nb_samples, voc_size)
            if self.config['context_predict']:
                r_out += self.context_readout(cxt)
            if self.config['bigram_predict']:
                r_out += self.prev_word_readout(x_in)

            # Get the generate-mode output = tanh(r_out), note it's not logit nor prob
            for l in self.output_nonlinear:
                r_out = l(r_out)

            '''
            Copying Decoding
            '''
            # Eq.8, key=h_j*W_c. Os layer=tanh(dec_readout_dim+dec_embedd_dim, dec_contxt_dim), dec_readout_dim=output_emb_dim + enc_context_dim + dec_embed_dim
            key     = self.Os(r_in)  # output=(nb_samples, dec_contxt_dim) :: key for locating where to copy

            # Eq.8, compute the copy attention weights
            #    (nb_samples, 1, dec_contxt_dim) * (nb_samples, src_maxlen, cxt_dim) -> sum(nb_samples, src_maxlen, 1) -> (nb_samples, src_maxlen)
            Eng     = T.sum(key[:, None, :] * cc, axis=-1)

            # Copy gating, determine the contribution from generative and copying
            if self.config['copygate']:
                gt     = self.sigmoid(self.Gs(r_in))  # Gs=(dec_readout_dim + dec_embedd_dim, 1), output=(nb_samples, 1)
                # plus a log prob to stabilize the computation. but are r_out and Eng log-ed probs?
                r_out += T.log(gt.flatten()[:, None])
                Eng   += T.log(1 - gt.flatten()[:, None])

                # r_out *= gt.flatten()[:, None]
                # Eng   *= 1 - gt.flatten()[:, None]

            # compute the logSumExp of both generative and copying probs, <pad> in source is masked
            EngSum  = logSumExp(Eng, axis=-1, mask=cm, c=r_out)

            # (nb_samples, vocab_size + maxlen_s): T.exp(r_out - EngSum) is generate_prob, T.exp(Eng - EngSum) * cm is copy_prob
            next_p  = T.concatenate([T.exp(r_out - EngSum), T.exp(Eng - EngSum) * cm], axis=-1)
            '''
            self.config['dec_voc_size'] = 50000
                next_b: the first 50000 probs in next_p is p_generate
                next_c: probs after 50000 in next_p is p_copy
            '''
            next_c  = next_p[:, self.config['dec_voc_size']:] * ll           # copy_prob, mask off (ignore) the non-copyable words: (nb_samples, maxlen_s) * (nb_samples, maxlen_s) = (nb_samples, maxlen_s)
            next_b  = next_p[:, :self.config['dec_voc_size']]                # generate_prob
            sum_a   = T.sum(next_c, axis=1, keepdims=True)                   # sum of copy_prob, telling how helpful the copy part is (nb_samples,)
            next_a  = (next_c / (sum_a + err)) * xl_mask[:, None]            # normalize the copy_prob for numerically consideration (nb_samples, maxlen_s), ignored if there is not word can be copied from source

            next_c  = T.cast(next_c, 'float32')
            next_a  = T.cast(next_a, 'float32')

            return next_h, next_a, ncov, sum_a, next_b

        outputs, _ = theano.scan(
            _recurrence,
            sequences=[X, X_mask, LL, XL_mask],
            outputs_info=[Init_h, Init_a, coverage, None, None],
            non_sequences=[context, c_mask, context_A]
        )

        '''
        shuffle (trg_len, batch_size, x) -> (batch_size, trg_len, x)
        X_out:          hidden vector of each decoding step (not useful for computing error)
        source_prob:    normalized copy_prob distribution of each decoding step (not useful for computing error)
        coverages:      coverage vector  (not useful for computing error)
        source_sum:     generate_prob distribution of each decoding step
        prob_dist:      sum of copy_prob of each decoding step
        '''
        X_out, source_prob, coverages, source_sum, prob_dist = [z.dimshuffle((1, 0, 2)) for z in outputs]
        X        = X.dimshuffle((1, 0, 2))
        X_mask   = X_mask.dimshuffle((1, 0))
        XL_mask  = XL_mask.dimshuffle((1, 0))

        # unk masking
        U_mask   = T.ones_like(target, dtype='float32') * (1 - T.eq(target, 1))
        U_mask  += (1 - U_mask) * (1 - XL_mask)

        # The most different part is here !!
        # self._grab_prob(prob_dist, target) computes the error of generative part, source_sum.sum(axis=-1) gives the error of copying part
        log_prob = T.sum(T.log(
                   T.clip(self._grab_prob(prob_dist, target) * U_mask + source_sum.sum(axis=-1) + err, 1e-7, 1.0)
                   ) * X_mask, dtype='float32', axis=1)
        log_ppl  = log_prob / (Count + err)

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl

    """
    Sample one step
    """

    def _step_sample(self,
                     prev_word,
                     prev_stat,
                     prev_loc,
                     prev_cov,
                     context,
                     c_mask,
                     context_A):
        """
        Get the probability of next word, sec 3.2 and 3.3
        :param prev_word    :   index of previous words, size=(1, live_k)
        :param prev_stat    :   output encoding of last time, size=(1, live_k, output_dim)
        :param prev_loc     :   information needed for copy-based predicting
        :param prev_cov     :   information needed for copy-based predicting
        :param context      :   encoding of source text, shape = [live_k, sent_len, 2*output_dim]
        :param c_mask       :   mask fof source text, shape = [live_k, sent_len]
        :param context_A: an identity layer (do nothing but return the context)
        :returns:
            next_prob       : probabilities of next word, shape=(1, voc_size+sent_len)
                                next_prob0[:voc_size] is generative probability
                                next_prob0[voc_size:voc_size+sent_len] is copy probability
            next_sample     : only useful for stochastic
            next_stat       : output (decoding) vector after time t
            ncov            :
            next_stat       :
        """

        assert c_mask is not None, 'we need the source mask.'
        # word embedding (note that for the first word, embedding should be all zero)
        # if prev_word[:, None] < 0 (only the starting sysbol index=-1)
        #   then return zeros
        #       return alloc_zeros_matrix(prev_word.shape[0], 2 * self.config['dec_embedd_dim']),
        #   else return embedding of the previous words
        #       return self.Embed(prev_word)

        X = T.switch(
            prev_word[:, None] < 0,
            alloc_zeros_matrix(prev_word.shape[0], 2 * self.config['dec_embedd_dim']),
            T.concatenate([self.Embed(prev_word),
                           T.sum(context_A * prev_loc[:, :, None], axis=1)
                           ], axis=-1)
        )

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        Probs  = self.attention_reader(prev_stat, context, c_mask, Cov=prev_cov)
        ncov   = prev_cov + Probs

        cxt    = T.sum(context * Probs[:, :, None], axis=1)

        X_proj, zz, rr = self.RNN(X, C=cxt,
                                  init_h=prev_stat,
                                  one_step=True,
                                  return_gates=True)
        next_stat = X_proj

        # compute the readout probability distribution and sample it
        # here the readout is a matrix, different from the learner.
        readin      = [next_stat]
        if self.config['context_predict']:
            readin += [cxt]
        if self.config['bigram_predict']:
            readin += [X]

        # if gating
        # if self.config['copygate']:
        #     gt      = self.sigmoid(self.Gs(readin))   # (nb_samples, dim)
        #     readin *= 1 - gt
        #     readout = self.hidden_readout(next_stat * gt[:, :self.config['dec_hidden_dim']])
        #     if self.config['context_predict']:
        #         readout += self.context_readout(
        #                 cxt * gt[:, self.config['dec_hidden_dim']:
        #                          self.config['dec_hidden_dim'] + self.config['dec_contxt_dim']])
        #     if self.config['bigram_predict']:
        #         readout += self.prev_word_readout(
        #                 X * gt[:, -2 * self.config['dec_embedd_dim']:])
        # else:
        readout = self.hidden_readout(next_stat)
        if self.config['context_predict']:
            readout += self.context_readout(cxt)
        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        readin      = T.concatenate(readin, axis=-1)
        key         = self.Os(readin)
        Eng         = T.sum(key[:, None, :] * context, axis=-1)

        # # gating
        if self.config['copygate']:
            gt       = self.sigmoid(self.Gs(readin))  # (nb_samples, 1)
            readout += T.log(gt.flatten()[:, None])
            Eng     += T.log(1 - gt.flatten()[:, None])

        EngSum      = logSumExp(Eng, axis=-1, mask=c_mask, c=readout)

        next_prob   = T.concatenate([T.exp(readout - EngSum), T.exp(Eng - EngSum) * c_mask], axis=-1)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat, ncov, next_stat

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context   = T.tensor3()  # theano variable. shape=(n_sample, sent_len, 2*output_dim)
        c_mask    = T.matrix()   # mask of the input sentence.
        context_A = self.Is(context) # an bridge layer

        init_h = self.Initializer(context[:, 0, :])
        init_a = T.zeros((context.shape[0], context.shape[1]))
        cov    = T.zeros((context.shape[0], context.shape[1]))

        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], [init_h, init_a, cov], name='get_init_state', allow_input_downcast=True)
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')
        prev_a    = T.matrix('prev_a', dtype='float32')
        prev_cov  = T.matrix('prev_cov', dtype='float32')

        next_prob, next_sample, next_stat, ncov, alpha \
            = self._step_sample(prev_word,
                                prev_stat,
                                prev_a,
                                prev_cov,
                                context,
                                c_mask,
                                context_A)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs  = [prev_word, prev_stat, prev_a, prev_cov, context, c_mask]
        outputs = [next_prob, next_sample, next_stat, ncov, alpha]
        self.sample_next = theano.function(inputs, outputs, name='sample_next', allow_input_downcast=True)
        logger.info('done')

    """
    Generate samples, either with stochastic sampling or beam-search!

    [:-:] I have to think over how to modify the BEAM-Search!!
    """
    def get_sample(self,
                   context,  # the RNN encoding of source text at each time step, shape = [1, sent_len, 2*output_dim]
                   c_mask,  # shape = [1, sent_len]
                   sources,  # shape = [1, sent_len]
                   k=1, maxlen=30, stochastic=True,  # k = config['sample_beam'], maxlen = config['max_len']
                   argmax=False, fixlen=False,
                   return_attend=False,
                   type='extractive',
                   generate_ngram=True
                   ):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        Lmax   = self.config['dec_voc_size']
        sample = [] # predited sequences
        attention_probs    = [] # don't know what's this
        attend = []
        score  = [] # probability of predited sequences
        state =  [] # the output encoding of predited sequences

        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores  = np.zeros(live_k).astype(theano.config.floatX)
        hyp_attention_probs    = [[]] * live_k
        hyp_attends = [[]] * live_k

        # get initial state of decoder RNN with encoding
        #   feed in the encoding of time=0(why 0?! because the X_out of RNN is reverse?), do tanh(W*x+b) and output next_state shape=[1,output_dim]
        #   copy_word_prob and coverage are zeros[context.shape]
        previous_state, copy_word_prob, coverage = self.get_init_state(context)
        # indicator for the first target word (bos target), starts with [-1]
        previous_word = -1 * np.ones((1,)).astype('int32')

        # if aim is extractive, then set the initial beam size to be voc_size
        if type == 'extractive':
            input = sources[0]
            input_set = set(input)
            sequence_set = set()

            if generate_ngram:
                for i in range(len(input)): # loop over start
                    for j in range(1, maxlen): # loop over length
                        if i+j > len(input)-1:
                            break
                        hash_token = [str(s) for s in input[i:i+j]]
                        sequence_set.add('-'.join(hash_token))
                logger.info("Possible n-grams: %d" % len(sequence_set))

        # Start searching!
        for ii in range(maxlen):
            # make live_k copies of context, c_mask and source, to predict next words at once.
            #   np.tile(context, [live_k, 1, 1]) means copying along the axis=0
            context_copies     = np.tile(context, [live_k, 1, 1]) # shape = [live_k, sent_len, 2*output_dim]
            c_mask_copies      = np.tile(c_mask,  [live_k, 1])    # shape = [live_k, sent_len]
            source_copies      = np.tile(sources, [live_k, 1])    # shape = [live_k, sent_len]

            # process word
            def process_():
                """
                copy_mask[i] indicates which words in source have been copied (whether the previous_word[i] appears in source text)
                size = size(source_copies) = [live_k, sent_len]
                Caution:     word2idx['<eol>'] = 0, word2idx['<unk>'] = 1
                """
                copy_mask  = np.zeros((source_copies.shape[0], source_copies.shape[1]), dtype='float32')

                for i in range(previous_word.shape[0]): # loop over the previous_words, index of previous words, size=(1, live_k)
                    #   Note that the model predict a OOV word in the way like voc_size+position_in_source
                    #   if a previous predicted word is OOV (next_word[i] >= Lmax):
                    #       means it predicts the position of word in source text (next_word[i]=voc_size+position_in_source)
                    #           1. set copy_mask to 1, indicates which last word is copied;
                    #           2. set next_word to the real index of this word (source_copies[previous_word[i] - Lmax])
                    #   else:
                    #       means not a OOV word, but may be still copied from source
                    #       check if any word in source_copies[i] is same to previous_word[i]
                    if previous_word[i] >= Lmax:
                        copy_mask[i][previous_word[i] - Lmax] = 1.
                        previous_word[i] = source_copies[i][previous_word[i] - Lmax]
                    else:
                        copy_mask[i] = (source_copies[i] == previous_word[i, None])
                        # for k in range(sss.shape[1]):
                        #     ll[i][k] = (sss[i][k] == next_word[i])
                return copy_mask, previous_word

            copy_mask, previous_word = process_()
            copy_flag = (np.sum(copy_mask, axis=1, keepdims=True) > 0) # boolean indicates if any copy available

            # get the copy probability (eq 6 in paper?)
            next_a  = copy_word_prob * copy_mask # keep the copied ones
            next_a  = next_a / (err + np.sum(next_a, axis=1, keepdims=True)) * copy_flag # normalize
            '''
            Get the probability of next word, sec 3.2 and 3.3
                Return:
                    next_prob0  : probabilities of next word, shape=(live_k, voc_size+sent_len)
                                    next_prob0[:, :voc_size] is generative probability
                                    next_prob0[:, voc_size:voc_size+sent_len] is copy probability
                    next_word   : only useful for stochastic
                    next_state  : output (decoding) vector after time t
                    coverage    :
                    alpha       : just next_state, only useful if return_attend

                Inputs:
                    previous_word       : index of previous words, size=(1, live_k)
                    previous_state      : output encoding of last time, size=(1, live_k, output_dim)
                    next_a, coverage    : information needed for copy-based predicting
                    encoding_copies     : shape = [live_k, sent_len, 2*output_dim]
                    c_mask_copies       : shape = [live_k, sent_len]

                    if don't do copying, only previous_word,previous_state,context_copies,c_mask_copies are needed for predicting
            '''
            next_prob0, next_word, next_state, coverage, alpha \
                = self.sample_next(previous_word, previous_state, next_a, coverage, context_copies, c_mask_copies)
            if not self.config['decode_unk']: # eliminate the probability of <unk>
                next_prob0[:, 1]          = 0.
                next_prob0 /= np.sum(next_prob0, axis=1, keepdims=True)

            def merge_():
                # merge the probabilities, p(w) = p_generate(w)+p_copy(w)
                temple_prob  = copy.copy(next_prob0)
                source_prob  = copy.copy(next_prob0[:, Lmax:])
                for i in range(next_prob0.shape[0]): # loop over all the previous words
                    for j in range(source_copies.shape[1]): # loop over all the source words
                        if (source_copies[i, j] < Lmax) and (source_copies[i, j] != 1): # if word source_copies[i, j] in voc and not a unk
                            temple_prob[i, source_copies[i, j]] += source_prob[i, j] # add the copy prob to generative prob
                            temple_prob[i, Lmax + j]   = 0. # set the corresponding copy prob to be 0

                return temple_prob, source_prob
            # if word in voc, add the copy prob to generative prob and keep generate prob only, else keep the copy prob only
            generate_word_prob, copy_word_prob   = merge_()
            next_prob0[:, Lmax:] = 0. # [not quite useful]set the latter (copy) part to be zeros, actually next_prob0 become really generate_word_prob
            # print('0', next_prob0[:, 3165])
            # print('01', next_prob[:, 3165])
            # # print(next_prob[0, Lmax:])
            # print(ss_prob[0, :])

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = generate_word_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = self.rng.multinomial(pvals=generate_word_prob).argmax(1)

                sample.append(nw)
                score += generate_word_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                '''
                using beam-search, keep the top (k-dead_k) results (dead_k is disabled by memray)
                we can only computed in a flatten way!
                '''
                # add the score of new predicted word to the score of whole sequence, the reason why the score of longer sequence getting smaller
                #       add a 1e-10 to avoid log(0)
                #       size(hyp_scores)=[live_k,1], size(generate_word_prob)=[live_k,voc_size+sent_len]
                cand_scores     = hyp_scores[:, None] - np.log(generate_word_prob + 1e-10)
                cand_flat       = cand_scores.flatten()
                ranks_flat      = cand_flat.argsort()[:(k - dead_k)] # get the index of top k predictions

                # recover(stack) the flat results, fetch the best results.
                voc_size        = generate_word_prob.shape[1]
                sequence_index  = ranks_flat / voc_size # flat_index/voc_size is the original sequence index
                next_word_index = ranks_flat % voc_size # flat_index%voc_size is the original word index
                costs           = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples         = []
                new_hyp_attention_probs = []
                new_hyp_attends         = []
                new_hyp_scores          = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states          = []
                new_hyp_coverage        = []
                new_hyp_copy_word_prob  = []

                for idx, [ti, wi] in enumerate(zip(sequence_index, next_word_index)):
                    ti = int(ti)
                    wi = int(wi)
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])

                    new_hyp_states.append(copy.copy(next_state[ti]))
                    new_hyp_coverage.append(copy.copy(coverage[ti]))
                    new_hyp_copy_word_prob.append(copy.copy(copy_word_prob[ti]))

                    # what's the ppp? generative attention and copy attention?
                    if not return_attend:
                        # probability of current predicted word (generative part and both generative/copying part)
                        new_hyp_attention_probs.append(hyp_attention_probs[ti] + [[next_prob0[ti][wi], generate_word_prob[ti][wi]]])
                    else:
                        # copying probability and attention probability of current predicted word
                        new_hyp_attention_probs.append(hyp_attention_probs[ti] + [(copy_word_prob[ti], alpha[ti])])

                # check the finished samples
                new_live_k          = 0
                hyp_samples         = []
                hyp_scores          = []
                hyp_states          = []
                hyp_coverage        = []
                hyp_attention_probs = []
                hyp_copy_word_prob  = []

                for idx in range(len(new_hyp_samples)):
                    # [bug] change to new_hyp_samples[idx][-1] == 0
                    # if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                    if (new_hyp_samples[idx][-1] == 0 and not fixlen):
                        '''
                        predict an <eos>, this sequence is done
                        put successful prediction into result list
                        '''
                        # worth-noting that if the word index is larger than voc_size, it means a OOV word
                        sample.append(new_hyp_samples[idx])
                        attention_probs.append(new_hyp_attention_probs[idx])
                        score.append(new_hyp_scores[idx])
                        state.append(new_hyp_states[idx])
                        # dead_k += 1
                    if new_hyp_samples[idx][-1] != 0:
                        '''
                        sequence prediction not complete
                        put into candidate list for next round prediction
                        '''
                        # limit predictions must appear in text
                        if type == 'extractive':
                            if new_hyp_samples[idx][-1] not in input_set:
                                continue
                            if generate_ngram:
                                if '-'.join([str(s) for s in new_hyp_samples[idx]]) not in sequence_set:
                                    continue

                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_attention_probs.append(new_hyp_attention_probs[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])
                        hyp_coverage.append(new_hyp_coverage[idx])
                        hyp_copy_word_prob.append(new_hyp_copy_word_prob[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                # if dead_k >= k:
                #     break

                # prepare the variables for predicting next round
                previous_word   = np.array([w[-1] for w in hyp_samples])
                previous_state  = np.array(hyp_states)
                coverage        = np.array(hyp_coverage)
                copy_word_prob  = np.array(hyp_copy_word_prob)
                pass

            logger.info('\t Depth=%d, #(hypotheses)=%d, #(completed)=%d' % (ii, len(hyp_samples), len(sample)))

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in range(live_k):
                    sample.append(hyp_samples[idx])
                    attention_probs.append(hyp_attention_probs[idx])
                    score.append(hyp_scores[idx])
                    state.append(hyp_states[idx])

        # sort the result
        result = zip(sample, score, attention_probs, state)
        sorted_result = sorted(result, key=lambda entry: entry[1], reverse=False)
        sample, score, attention_probs, state = zip(*sorted_result)
        return sample, score, attention_probs, state


class FnnDecoder(Model):
    def __init__(self, config, rng, prefix='fnndec'):
        """
        mode = RNN: use a RNN Decoder
        """
        super(FnnDecoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.name = prefix

        """
        Create Dense Predictor.
        """

        self.Tr = Dense(self.config['dec_contxt_dim'],
                             self.config['dec_hidden_dim'],
                             activation='maxout2',
                             name='{}_Tr'.format(prefix))
        self._add(self.Tr)

        self.Pr = Dense(self.config['dec_hidden_dim'] / 2,
                             self.config['dec_voc_size'],
                             activation='softmax',
                             name='{}_Pr'.format(prefix))
        self._add(self.Pr)
        logger.info("FF decoder ok.")

    @staticmethod
    def _grab_prob(probs, X):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    def build_decoder(self, target, context):
        """
        Build the Decoder Computational Graph
        """
        prob_dist = self.Pr(self.Tr(context[:, None, :]))
        log_prob  = T.sum(T.log(self._grab_prob(prob_dist, target) + err), axis=1)
        return log_prob

    def build_sampler(self):
        context   = T.matrix()
        prob_dist = self.Pr(self.Tr(context))
        next_sample = self.rng.multinomial(pvals=prob_dist).argmax(1)
        self.sample_next = theano.function([context], [prob_dist, next_sample], name='sample_next_{}'.format(self.prefix), allow_input_downcast=True)
        logger.info('done')

    def get_sample(self, context, argmax=True):

        prob, sample = self.sample_next(context)
        if argmax:
            return prob[0].argmax()
        else:
            return sample[0]


########################################################################################################################
# Encoder-Decoder Models ::::
#
class RNNLM(Model):
    """
    RNN-LM, with context vector = 0.
    It is very similar with the implementation of VAE.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'rnnlm'

    def build_(self):
        logger.info("build the RNN-decoder")
        self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)

        # registration:
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get('adadelta')

        # saved the initial memories
        if self.config['mode'] == 'NTM':
            self.memory    = initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth']))

        logger.info("create the RECURRENT language model. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            if not contrastive:
                self.compile_train()
            else:
                self.compile_train_CE()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):

        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        if self.config['mode']   == 'RNN':
            context = alloc_zeros_matrix(inputs.shape[0], self.config['dec_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # decoding.
        target  = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # reconstruction loss
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        L1       = T.sum([T.sum(abs(w)) for w in self.params])
        loss     = loss_rec

        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun',
                                      allow_input_downcast=True
                                      )
        logger.info("pre-training functions compile done.")

        # add monitoring:
        self.monitor['context'] = context
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)

    @abstractmethod
    def compile_train_CE(self):
        pass

    def compile_sample(self):
        # context vectors (as)
        self.decoder.build_sampler()
        logger.info("display functions compile done.")

    @abstractmethod
    def compile_inference(self):
        pass

    def default_context(self):
        if self.config['mode'] == 'RNN':
            return np.zeros(shape=(1, self.config['dec_contxt_dim']), dtype=theano.config.floatX)
        elif self.config['mode'] == 'NTM':
            memory = self.memory.get_value()
            memory = memory.reshape((1, memory.shape[0], memory.shape[1]))
            return memory

    def generate_(self, context=None, max_len=None, mode='display'):
        """
        :param action: action vector to guide the question.
                       If None, use a Gaussian to simulate the action.
        :return: question sentence in natural language.
        """
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        if context is None:
            context = self.default_context()

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)

        sample, score = self.decoder.get_sample(context, **args)
        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)


class AutoEncoder(RNNLM):
    """
    Regular Auto-Encoder: RNN Encoder/Decoder
    """

    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name = 'vae'

    def build_(self):
        logger.info("build the RNN auto-encoder")
        self.encoder = Encoder(self.config, self.rng, prefix='enc')
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', embed=self.encoder.Embed)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec')

        """
        Build the Transformation
        """
        if self.config['nonlinear_A']:
            self.action_trans = Dense(
                self.config['enc_hidden_dim'],
                self.config['action_dim'],
                activation='tanh',
                name='action_transform'
            )
        else:
            assert self.config['enc_hidden_dim'] == self.config['action_dim'], \
                    'hidden dimension must match action dimension'
            self.action_trans = Identity(name='action_transform')

        if self.config['nonlinear_B']:
            self.context_trans = Dense(
                self.config['action_dim'],
                self.config['dec_contxt_dim'],
                activation='tanh',
                name='context_transform'
            )
        else:
            assert self.config['dec_contxt_dim'] == self.config['action_dim'], \
                    'action dimension must match context dimension'
            self.context_trans = Identity(name='context_transform')

        # registration
        self._add(self.action_trans)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'], kwargs={'lr': self.config['lr']})

        logger.info("create Helmholtz RECURRENT neural network. ok")

    def compile_train(self, mode='train'):
        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        context = alloc_zeros_matrix(inputs.shape[0], self.config['dec_contxt_dim'])
        assert context.ndim == 2

        # decoding.
        target  = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # reconstruction loss
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        L1       = T.sum([T.sum(abs(w)) for w in self.params])
        loss     = loss_rec

        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun',
                                      allow_input_downcast=True)
        logger.info("pre-training functions compile done.")

        if mode == 'display' or mode == 'all':
            """
            build the sampler function here <:::>
            """
            # context vectors (as)
            self.decoder.build_sampler()
            logger.info("display functions compile done.")

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)


class NRM(Model):
    """
    Neural Responding Machine
    A Encoder-Decoder based responding model.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation',
                 use_attention=False,
                 copynet=False,
                 identity=False):
        super(NRM, self).__init__()

        self.config   = config
        self.n_rng    = n_rng  # numpy random stream
        self.rng      = rng  # Theano random stream
        self.mode     = mode
        self.name     = 'nrm'
        self.attend   = use_attention
        self.copynet  = copynet
        self.identity = identity

    def build_(self, lr=None, iterations=None):
        logger.info("build the Neural Responding Machine")

        # encoder-decoder:: <<==>>
        self.encoder = Encoder(self.config, self.rng, prefix='enc', mode=self.mode)
        if not self.attend:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)
        else:
            self.decoder = DecoderAtt(self.config, self.rng, prefix='dec', mode=self.mode,
                                      copynet=self.copynet, identity=self.identity)

        self._add(self.encoder)
        self._add(self.decoder)

        # objectives and optimizers
        if self.config['optimizer'] == 'adam':
            self.optimizer = optimizers.get(self.config['optimizer'],
                                         kwargs=dict(rng=self.rng,
                                                     save=False,
                                                     clipnorm = self.config['clipnorm']
                                                     ))
        else:
            self.optimizer = optimizers.get(self.config['optimizer'])
        if lr is not None:
            self.optimizer.lr.set_value(floatX(lr))
            self.optimizer.iterations.set_value(floatX(iterations))
        logger.info("build ok.")

    def compile_(self, mode='all', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        # ps = 'params: {\n'
        # for p in self.params:
        #     ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        # ps += '}.'
        # logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            self.compile_train()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):

        # questions (theano variables)
        inputs    = T.imatrix()  # padded input word sequence (for training)
        target    = T.imatrix()  # padded target word sequence (for training)
        cc_matrix = T.tensor3(dtype='int16')

        # encoding & decoding
        # enc_context=[nb_sample, src_max_len, 2*enc_hidden_dim], c_mask=[nb_sample, src_max_len]
        enc_context, _, c_mask, _ = self.encoder.build_encoder(inputs, None, return_sequence=True, return_embed=True)
        # code: (nb_samples, max_len, contxt_dim)
        if 'explicit_loc' in self.config:
            if self.config['explicit_loc']:
                print('use explicit location!!')
                max_len = enc_context.shape[1]
                expLoc  = T.eye(max_len, self.config['encode_max_len'], dtype='float32')[None, :, :]
                expLoc  = T.repeat(expLoc, enc_context.shape[0], axis=0)
                enc_context    = T.concatenate([enc_context, expLoc], axis=2)

        # self.decoder.build_decoder(target, cc_matrix, code, c_mask)
        #       feed target(index vector of target), cc_matrix(copy matrix), code(encoding of source text), c_mask (mask of source text) into decoder, get objective value
        #       logPxz,logPPL are tensors in [nb_samples,1], cross-entropy and Perplexity of each sample
        # normal seq2seq
        logPxz, logPPL     = self.decoder.build_decoder(target, cc_matrix, enc_context, c_mask)

        # responding loss
        loss_rec = -logPxz
        loss_ppl = T.exp(-logPPL)
        loss     = T.mean(loss_rec, dtype='float32')

        updates  = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")

        # input contains inputs, target and cc_matrix
        # inputs=(batch_size, src_len), target=(batch_size, trg_len)
        # cc_matrix=(batch_size, trg_len, src_len), cc_matrix[i][j][k]=1 if j-th word in target matches the k-th word in source
        train_inputs = [inputs, target, cc_matrix]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun',
                                      allow_input_downcast=True)
        self.train_guard = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_nanguard_fun',
                                      allow_input_downcast=True,
                                      mode=NanGuardMode(nan_is_error=True, inf_is_error=True, big_is_error=True))
        self.validate_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      name='validate_fun',
                                      allow_input_downcast=True)

        logger.info("training functions compile done.")

        # # add monitoring:
        # self.monitor['context'] = context
        # self._monitoring()
        #
        # # compiling monitoring
        # self.compile_monitoring(train_inputs)

    def compile_sample(self):
        if not self.attend:
            self.encoder.compile_encoder(with_context=False)
        else:
            self.encoder.compile_encoder(with_context=False, return_sequence=True, return_embed=True)

        self.decoder.build_sampler()
        logger.info("sampling functions compile done.")

    def compile_inference(self):
        pass

    def generate_(self, inputs, mode='display', return_attend=False, return_all=False):
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'],
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None,
                    return_attend=return_attend)
        context, _, c_mask, _, Z, R = self.encoder.gtenc(inputs)
        # c_mask[0, 3] = c_mask[0, 3] * 0
        # L   = context.shape[1]
        # izz = np.concatenate([np.arange(3), np.asarray([1,2]), np.arange(3, L)])
        # context = context[:, izz, :]
        # c_mask  = c_mask[:, izz]
        # inputs  = inputs[:, izz]
        # context, _, c_mask, _ = self.encoder.encode(inputs)
        # import pylab as plt
        # # visualize_(plt.subplots(), Z[0][:, 300:], normal=False)
        # visualize_(plt.subplots(), context[0], normal=False)

        if 'explicit_loc' in self.config:
            if self.config['explicit_loc']:
                max_len = context.shape[1]
                expLoc  = np.eye(max_len, self.config['encode_max_len'], dtype='float32')[None, :, :]
                expLoc  = np.repeat(expLoc, context.shape[0], axis=0)
                context = np.concatenate([context, expLoc], axis=2)

        sample, score, ppp, _    = self.decoder.get_sample(context, c_mask, inputs, **args)
        if return_all:
            return sample, score, ppp

        if not args['stochastic']:
            score  = score / np.array([len(s) for s in sample])
            idz    = score.argmin()
            sample = sample[idz]
            score  = score.min()
            ppp    = ppp[idz]
        else:
            score /= float(len(sample))

        return sample, np.exp(score), ppp


    def generate_multiple(self, inputs, mode='display', return_attend=False, return_all=True, return_encoding=False):
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'
        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'],
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None,
                    return_attend=return_attend,
                    type=self.config['predict_type']
                    )
        '''
        Return the encoding of input.
            Similar to encoder.encode(), but gate values are returned as well
            I think only gtenc with attention
            default: with_context=False, return_sequence=True, return_embed=True
        '''

        """
        return
            context:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
            _:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
            c_mask: mask, an array showing which elements in context are not 0 [nb_sample, max_len]
            _: encoding of end of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]
            Z:  value of update gate, shape=(nb_sample, 1)
            R:  value of update gate, shape=(nb_sample, 1)
        but.. Z and R are not used here
        """
        context, _, c_mask, _, Z, R = self.encoder.gtenc(inputs)
        # c_mask[0, 3] = c_mask[0, 3] * 0
        # L   = context.shape[1]
        # izz = np.concatenate([np.arange(3), np.asarray([1,2]), np.arange(3, L)])
        # context = context[:, izz, :]
        # c_mask  = c_mask[:, izz]
        # inputs  = inputs[:, izz]
        # context, _, c_mask, _ = self.encoder.encode(inputs)
        # import pylab as plt
        # # visualize_(plt.subplots(), Z[0][:, 300:], normal=False)
        # visualize_(plt.subplots(), context[0], normal=False)

        if 'explicit_loc' in self.config: # no
            if self.config['explicit_loc']:
                max_len = context.shape[1]
                expLoc  = np.eye(max_len, self.config['encode_max_len'], dtype='float32')[None, :, :]
                expLoc  = np.repeat(expLoc, context.shape[0], axis=0)
                context = np.concatenate([context, expLoc], axis=2)

        sample, score, ppp, output_encoding    = self.decoder.get_sample(context, c_mask, inputs, **args)
        if return_all:
            if return_encoding:
                return context, sample, score, output_encoding
            else:
                return sample, score
        return sample, score

    def evaluate_(self, inputs, outputs, idx2word, inputs_unk=None, encode=True):
        def cut_zero_yes(sample, idx2word, ppp=None, Lmax=None):
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if ppp is None:
                if 0 not in sample:
                    return ['{}'.format(idx2word[w].encode('utf-8'))
                            if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                            for w in sample]

                return ['{}'.format(idx2word[w].encode('utf-8'))
                        if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                        for w in sample[:sample.index(0)]]
            else:
                if 0 not in sample:
                    return ['{0} ({1:1.1f})'.format(
                            idx2word[w].encode('utf-8'), p)
                            if w < Lmax
                            else '{0} ({1:1.1f})'.format(
                            idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                            for w, p in zip(sample, ppp)]
                idz = sample.index(0)
                return ['{0} ({1:1.1f})'.format(
                        idx2word[w].encode('utf-8'), p)
                        if w < Lmax
                        else '{0} ({1:1.1f})'.format(
                        idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                        for w, p in zip(sample[:idz], ppp[:idz])]

    def evaluate_multiple(self, inputs, outputs,
                          original_input, original_outputs,
                          samples, scores, idx2word,
                          number_to_predict=10):
        '''
        inputs_unk is same as inputs except for filtered out all the low-freq words to 1 (<unk>)
        return the top few keywords, number is set in config
        :param: original_input, same as inputs, the vector of one input sentence
        :param: original_outputs, vectors of corresponding multiple outputs (e.g. keyphrases)
        :return:
        '''

        def cut_zero(sample, idx2word, Lmax=None):
            sample = list(sample)
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if 0 not in sample:
                return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]
            # return the string before 0 (<eol>)
            return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

        stemmer = PorterStemmer()
        # Generate keyphrases
        # if inputs_unk is None:
        #     samples, scores = self.generate_multiple(inputs[None, :], return_all=True)
        # else:
        #     samples, scores = self.generate_multiple(inputs_unk[None, :], return_all=True)

        # Evaluation part
        outs = []
        metrics = []

        # load stopword
        with open(self.config['path'] + '/dataset/stopword/stopword_en.txt') as stopword_file:
            stopword_set = set([stemmer.stem(w.strip()) for w in stopword_file])

        for input_sentence, target_list, predict_list, score_list in zip(inputs, original_outputs, samples, scores):
            '''
            enumerate each document, process target/predict/score and measure via p/r/f1
            '''
            target_outputs = []
            predict_outputs = []
            predict_scores = []
            predict_set = set()
            correctly_matched = np.asarray([0] * max(len(target_list), len(predict_list)), dtype='int32')

            # stem the original input
            stemmed_input = [stemmer.stem(w) for w in cut_zero(input_sentence, idx2word)]

            # convert target index into string
            for target in target_list:
                target = cut_zero(target, idx2word)
                target = [stemmer.stem(w) for w in target]

                keep = True
                # whether do filtering on groundtruth phrases. if config['target_filter']==None, do nothing
                if self.config['target_filter']:
                    match = None
                    for i in range(len(stemmed_input) - len(target) + 1):
                        match = None
                        j = 0
                        for j in range(len(target)):
                            if target[j] != stemmed_input[i + j]:
                                match = False
                                break
                        if j == len(target) - 1 and match == None:
                            match = True
                            break

                    if match == True:
                        # if match and 'appear-only', keep this phrase
                        if self.config['target_filter'] == 'appear-only':
                            keep = keep and True
                        elif self.config['target_filter'] == 'non-appear-only':
                            keep = keep and False
                    elif match == False:
                        # if not match and 'appear-only', discard this phrase
                        if self.config['target_filter'] == 'appear-only':
                            keep = keep and False
                        # if not match and 'non-appear-only', keep this phrase
                        elif self.config['target_filter'] == 'non-appear-only':
                            keep = keep and True

                if not keep:
                    continue

                target_outputs.append(target)

            # convert predict index into string
            for id, (predict, score) in enumerate(zip(predict_list, score_list)):
                predict = cut_zero(predict, idx2word)
                predict = [stemmer.stem(w) for w in predict]

                # filter some not good ones
                keep = True
                if len(predict) == 0:
                    keep = False
                number_digit = 0
                for w in predict:
                    if w.strip() == '<unk>':
                        keep = False
                    if w.strip() == '<digit>':
                        number_digit += 1

                if len(predict) >= 1 and (predict[0] in stopword_set or predict[-1] in stopword_set):
                    keep = False

                if len(predict) <= 1:
                    keep = False

                # whether do filtering on predicted phrases. if config['predict_filter']==None, do nothing
                if self.config['predict_filter']:
                    match = None
                    for i in range(len(stemmed_input) - len(predict) + 1):
                        match = None
                        j = 0
                        for j in range(len(predict)):
                            if predict[j] != stemmed_input[i + j]:
                                match = False
                                break
                        if j == len(predict) - 1 and match == None:
                            match = True
                            break

                    if match == True:
                        # if match and 'appear-only', keep this phrase
                        if self.config['predict_filter'] == 'appear-only':
                            keep = keep and True
                        elif self.config['predict_filter'] == 'non-appear-only':
                            keep = keep and False
                    elif match == False:
                        # if not match and 'appear-only', discard this phrase
                        if self.config['predict_filter'] == 'appear-only':
                            keep = keep and False
                        # if not match and 'non-appear-only', keep this phrase
                        elif self.config['predict_filter'] == 'non-appear-only':
                            keep = keep and True

                key = '-'.join(predict)
                # remove this phrase and its score from list
                if not keep or number_digit == len(predict) or key in predict_set:
                    continue

                predict_outputs.append(predict)
                predict_scores.append(score)
                predict_set.add(key)

                # check whether correct
                for target in target_outputs:
                    if len(target) == len(predict):
                        flag = True
                        for i, w in enumerate(predict):
                            if predict[i] != target[i]:
                                flag = False
                        if flag:
                            correctly_matched[len(predict_outputs) - 1] = 1
                            # print('%s correct!!!' % predict)

            predict_outputs = np.asarray(predict_outputs)
            predict_scores = np.asarray(predict_scores)
            # normalize the score?
            if self.config['normalize_score']:
                predict_scores = np.asarray([math.log(math.exp(score) / len(predict)) for predict, score in
                                             zip(predict_outputs, predict_scores)])
                score_list_index = np.argsort(predict_scores)
                predict_outputs = predict_outputs[score_list_index]
                predict_scores = predict_scores[score_list_index]
                correctly_matched = correctly_matched[score_list_index]

            metric_dict = {}
            metric_dict['p'] = float(sum(correctly_matched[:number_to_predict])) / float(number_to_predict)

            if len(target_outputs) != 0:
                metric_dict['r'] = float(sum(correctly_matched[:number_to_predict])) / float(len(target_outputs))
            else:
                metric_dict['r'] = 0

            if metric_dict['p'] + metric_dict['r'] != 0:
                metric_dict['f1'] = 2 * metric_dict['p'] * metric_dict['r'] / float(
                    metric_dict['p'] + metric_dict['r'])
            else:
                metric_dict['f1'] = 0

            metric_dict['valid_target_number'] = len(target_outputs)
            metric_dict['target_number'] = len(target_list)
            metric_dict['correct_number'] = sum(correctly_matched[:number_to_predict])

            metrics.append(metric_dict)

            # print(stuff)
            a = '[SOURCE]: {}\n'.format(' '.join(cut_zero(input_sentence, idx2word)))
            logger.info(a)

            b = '[TARGET]: %d/%d targets\n\t\t' % (len(target_outputs), len(target_list))
            for id, target in enumerate(target_outputs):
                b += ' '.join(target) + '; '
            b += '\n'
            logger.info(b)
            c = '[DECODE]: %d/%d predictions' % (len(predict_outputs), len(predict_list))
            for id, (predict, score) in enumerate(zip(predict_outputs, predict_scores)):
                if correctly_matched[id] == 1:
                    c += ('\n\t\t[%.3f]' % score) + ' '.join(predict) + ' [correct!]'
                    # print(('\n\t\t[%.3f]'% score) + ' '.join(predict) + ' [correct!]')
                else:
                    c += ('\n\t\t[%.3f]' % score) + ' '.join(predict)
                    # print(('\n\t\t[%.3f]'% score) + ' '.join(predict))
            c += '\n'

            # c = '[DECODE]: {}'.format(' '.join(cut_zero(phrase, idx2word)))
            # if inputs_unk is not None:
            #     k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
            #     logger.info(k)
            # a += k
            logger.info(c)
            a += b + c
            d = 'Precision=%.4f, Recall=%.4f, F1=%.4f\n' % (metric_dict['p'], metric_dict['r'], metric_dict['f1'])
            logger.info(d)
            a += d

            outs.append(a)

        return outs, metrics

        def cut_zero_no(sample, idx2word, ppp=None, Lmax=None):
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if ppp is None:
                if 0 not in sample:
                    return ['{}'.format(idx2word[w])
                            if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                            for w in sample]

                return ['{}'.format(idx2word[w])
                        if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                        for w in sample[:sample.index(0)]]
            else:
                if 0 not in sample:
                    return ['{0} ({1:1.1f})'.format(
                            idx2word[w], p)
                            if w < Lmax
                            else '{0} ({1:1.1f})'.format(
                            idx2word[inputs[w - Lmax]], p)
                            for w, p in zip(sample, ppp)]
                idz = sample.index(0)
                return ['{0} ({1:1.1f})'.format(
                        idx2word[w].encode('utf-8'), p)
                        if w < Lmax
                        else '{0} ({1:1.1f})'.format(
                        idx2word[inputs[w - Lmax]], p)
                        for w, p in zip(sample[:idz], ppp[:idz])]

        if inputs_unk is None:
            result, _, ppp = self.generate_(inputs[None, :])
        else:
            result, _, ppp = self.generate_(inputs_unk[None, :])

        if encode:
            cut_zero = cut_zero_yes
        else:
            cut_zero = cut_zero_no
        pp0, pp1 = [np.asarray(p) for p in zip(*ppp)]
        pp = (pp1 - pp0) / pp1
        # pp = (pp1 - pp0) / pp1
        logger.info(len(ppp))

        logger.info('<Environment> [lr={0}][iter={1}]'.format(self.optimizer.lr.get_value(),
                                                        self.optimizer.iterations.get_value()))

        a = '[SOURCE]: {}\n'.format(' '.join(cut_zero(inputs.tolist(),  idx2word, Lmax=len(idx2word))))
        b = '[TARGET]: {}\n'.format(' '.join(cut_zero(outputs.tolist(), idx2word, Lmax=len(idx2word))))
        c = '[DECODE]: {}\n'.format(' '.join(cut_zero(result, idx2word)))
        d = '[CpRate]: {}\n'.format(' '.join(cut_zero(result, idx2word, pp.tolist())))
        e = '[CpRate]: {}\n'.format(' '.join(cut_zero(result, idx2word, result)))
        logger.info(a)
        logger.info( '{0} -> {1}'.format(len(a.split()), len(b.split())))

        if inputs_unk is not None:
            k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
            logger.info( k )
            a += k

        logger.info(b)
        logger.info(c)
        logger.info(d)
        # print(e)
        a += b + c + d
        return a

    def analyse_(self, inputs, outputs, idx2word, inputs_unk=None, return_attend=False, name=None, display=False):
        def cut_zero(sample, idx2word, ppp=None, Lmax=None):
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if ppp is None:
                if 0 not in sample:
                    return ['{}'.format(idx2word[w].encode('utf-8'))
                            if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                            for w in sample]

                return ['{}'.format(idx2word[w].encode('utf-8'))
                        if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                        for w in sample[:sample.index(0)]]
            else:
                if 0 not in sample:
                    return ['{0} ({1:1.1f})'.format(
                            idx2word[w].encode('utf-8'), p)
                            if w < Lmax
                            else '{0} ({1:1.1f})'.format(
                            idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                            for w, p in zip(sample, ppp)]
                idz = sample.index(0)
                return ['{0} ({1:1.1f})'.format(
                        idx2word[w].encode('utf-8'), p)
                        if w < Lmax
                        else '{0} ({1:1.1f})'.format(
                        idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                        for w, p in zip(sample[:idz], ppp[:idz])]

        if inputs_unk is None:
            result, _, ppp = self.generate_(inputs[None, :],
                                            return_attend=return_attend)
        else:
            result, _, ppp = self.generate_(inputs_unk[None, :],
                                            return_attend=return_attend)

        source = '{}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word, Lmax=len(idx2word))))
        target = '{}'.format(' '.join(cut_zero(outputs.tolist(), idx2word, Lmax=len(idx2word))))
        decode = '{}'.format(' '.join(cut_zero(result, idx2word)))

        if display:
            print(source)
            print(target)
            print(decode)

            idz    = result.index(0)
            p1, p2 = [np.asarray(p) for p in zip(*ppp)]
            print(p1.shape)
            import pylab as plt
            # plt.rc('text', usetex=True)
            # plt.rc('font', family='serif')
            visualize_(plt.subplots(), 1 - p1[:idz, :].T, grid=True, name=name)
            visualize_(plt.subplots(), 1 - p2[:idz, :].T, name=name)

            # visualize_(plt.subplots(), 1 - np.mean(p2[:idz, :], axis=1, keepdims=True).T)
        return target == decode

    def analyse_cover(self, inputs, outputs, idx2word, inputs_unk=None, return_attend=False, name=None, display=False):
        def cut_zero(sample, idx2word, ppp=None, Lmax=None):
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if ppp is None:
                if 0 not in sample:
                    return ['{}'.format(idx2word[w].encode('utf-8'))
                            if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                            for w in sample]

                return ['{}'.format(idx2word[w].encode('utf-8'))
                        if w < Lmax else '{}'.format(idx2word[inputs[w - Lmax]].encode('utf-8'))
                        for w in sample[:sample.index(0)]]
            else:
                if 0 not in sample:
                    return ['{0} ({1:1.1f})'.format(
                            idx2word[w].encode('utf-8'), p)
                            if w < Lmax
                            else '{0} ({1:1.1f})'.format(
                            idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                            for w, p in zip(sample, ppp)]
                idz = sample.index(0)
                return ['{0} ({1:1.1f})'.format(
                        idx2word[w].encode('utf-8'), p)
                        if w < Lmax
                        else '{0} ({1:1.1f})'.format(
                        idx2word[inputs[w - Lmax]].encode('utf-8'), p)
                        for w, p in zip(sample[:idz], ppp[:idz])]

        if inputs_unk is None:
            results, _, ppp = self.generate_(inputs[None, :],
                                            return_attend=return_attend,
                                            return_all=True)
        else:
            results, _, ppp = self.generate_(inputs_unk[None, :],
                                            return_attend=return_attend,
                                            return_all=True)

        source = '{}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word, Lmax=len(idx2word))))
        target = '{}'.format(' '.join(cut_zero(outputs.tolist(), idx2word, Lmax=len(idx2word))))
        # decode = '{}'.format(' '.join(cut_zero(result, idx2word)))

        score  = [target == '{}'.format(' '.join(cut_zero(result, idx2word))) for result in results]
        return max(score)


================================================
FILE: emolga/models/encdec.py
================================================
import math

__author__ = 'jiataogu, memray'
import theano

import logging
import copy
import emolga.basic.objectives as objectives
import emolga.basic.optimizers as optimizers

from theano.compile.nanguardmode import NanGuardMode
from emolga.layers.core import Dropout, Dense, Dense2, Identity
from emolga.layers.recurrent import *
from emolga.layers.ntm_minibatch import Controller
from emolga.layers.embeddings import *
from emolga.layers.attention import *
from emolga.models.core import Model

from nltk.stem.porter import *

logger = logging.getLogger(__name__)
RNN    = GRU             # change it here for other RNN models.


########################################################################################################################
# Encoder/Decoder Blocks ::::
#
# Encoder Back-up
# class Encoder(Model):
#     """
#     Recurrent Neural Network-based Encoder
#     It is used to compute the context vector.
#     """
#
#     def __init__(self,
#                  config, rng, prefix='enc',
#                  mode='Evaluation', embed=None, use_context=False):
#         super(Encoder, self).__init__()
#         self.config = config
#         self.rng = rng
#         self.prefix = prefix
#         self.mode = mode
#         self.name = prefix
#         self.use_context = use_context
#
#         """
#         Create all elements of the Encoder's Computational graph
#         """
#         # create Embedding layers
#         logger.info("{}_create embedding layers.".format(self.prefix))
#         if embed:
#             self.Embed = embed
#         else:
#             self.Embed = Embedding(
#                 self.config['enc_voc_size'],
#                 self.config['enc_embedd_dim'],
#                 name="{}_embed".format(self.prefix))
#             self._add(self.Embed)
#
#         if self.use_context:
#             self.Initializer = Dense(
#                 config['enc_contxt_dim'],
#                 config['enc_hidden_dim'],
#                 activation='tanh',
#                 name="{}_init".format(self.prefix)
#             )
#             self._add(self.Initializer)
#
#         """
#         Encoder Core
#         """
#         if self.config['encoder'] == 'RNN':
#             # create RNN cells
#             if not self.config['bidirectional']:
#                 logger.info("{}_create RNN cells.".format(self.prefix))
#                 self.RNN = RNN(
#                     self.config['enc_embedd_dim'],
#                     self.config['enc_hidden_dim'],
#                     None if not use_context
#                     else self.config['enc_contxt_dim'],
#                     name="{}_cell".format(self.prefix)
#                 )
#                 self._add(self.RNN)
#             else:
#                 logger.info("{}_create forward RNN cells.".format(self.prefix))
#                 self.forwardRNN = RNN(
#                     self.config['enc_embedd_dim'],
#                     self.config['enc_hidden_dim'],
#                     None if not use_context
#                     else self.config['enc_contxt_dim'],
#                     name="{}_fw_cell".format(self.prefix)
#                 )
#                 self._add(self.forwardRNN)
#
#                 logger.info("{}_create backward RNN cells.".format(self.prefix))
#                 self.backwardRNN = RNN(
#                     self.config['enc_embedd_dim'],
#                     self.config['enc_hidden_dim'],
#                     None if not use_context
#                     else self.config['enc_contxt_dim'],
#                     name="{}_bw_cell".format(self.prefix)
#                 )
#                 self._add(self.backwardRNN)
#
#             logger.info("create encoder ok.")
#
#         elif self.config['encoder'] == 'WS':
#             # create weighted sum layers.
#             if self.config['ws_weight']:
#                 self.WS  = Dense(self.config['enc_embedd_dim'],
#                                  self.config['enc_hidden_dim'], name='{}_ws'.format(self.prefix))
#                 self._add(self.WS)
#
#             logger.info("create encoder ok.")
#
#     def build_encoder(self, source, context=None, return_embed=False):
#         """
#         Build the Encoder Computational Graph
#         """
#         # Initial state
#         Init_h = None
#         if self.use_context:
#             Init_h = self.Initializer(context)
#
#         # word embedding
#         if self.config['encoder'] == 'RNN':
#             if not self.config['bidirectional']:
#                 X, X_mask = self.Embed(source, True)
#                 if not self.config['pooling']:
#                     X_out = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=False)
#                 else:
#                     X_out = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=True)
#             else:
#                 source2 = source[:, ::-1]
#                 X,  X_mask = self.Embed(source, True)
#                 X2, X2_mask = self.Embed(source2, True)
#
#                 if not self.config['pooling']:
#                     X_out1 = self.backwardRNN(X, X_mask, C=context, init_h=Init_h, return_sequence=False)
#                     X_out2 = self.forwardRNN( X2, X2_mask, C=context, init_h=Init_h, return_sequence=False)
#                     X_out  = T.concatenate([X_out1, X_out2], axis=1)
#                 else:
#                     X_out1 = self.backwardRNN(X, X_mask, C=context, init_h=Init_h, return_sequence=True)
#                     X_out2 = self.forwardRNN( X2, X2_mask, C=context, init_h=Init_h, return_sequence=True)
#                     X_out  = T.concatenate([X_out1, X_out2], axis=2)
#
#             if self.config['pooling'] == 'max':
#                 X_out = T.max(X_out, axis=1)
#             elif self.config['pooling'] == 'mean':
#                 X_out = T.mean(X_out, axis=1)
#
#         elif self.config['encoder'] == 'WS':
#             X, X_mask = self.Embed(source, True)
#             if self.config['ws_weight']:
#                 X_out = T.sum(self.WS(X) * X_mask[:, :, None], axis=1) / T.sum(X_mask, axis=1, keepdims=True)
#             else:
#                 assert self.config['enc_embedd_dim'] == self.config['enc_hidden_dim'], \
#                     'directly sum should match the dimension'
#                 X_out = T.sum(X * X_mask[:, :, None], axis=1) / T.sum(X_mask, axis=1, keepdims=True)
#         else:
#             raise NotImplementedError
#
#         if return_embed:
#             return X_out, X, X_mask
#         return X_out
#
#     def compile_encoder(self, with_context=False):
#         source  = T.imatrix()
#         if with_context:
#             context = T.matrix()
#             self.encode = theano.function([source, context],
#                                           self.build_encoder(source, context))
#         else:
#             self.encode = theano.function([source],
#                                       self.build_encoder(source, None))

class Encoder(Model):
    """
    Recurrent Neural Network-based Encoder
    It is used to compute the context vector.
    """

    def __init__(self,
                 config, rng, prefix='enc',
                 mode='Evaluation', embed=None, use_context=False):
        super(Encoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.mode = mode
        self.name = prefix
        self.use_context = use_context

        self.return_embed = False
        self.return_sequence = False

        """
        Create all elements of the Encoder's Computational graph
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['enc_voc_size'],
                self.config['enc_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        if self.use_context:
            self.Initializer = Dense(
                config['enc_contxt_dim'],
                config['enc_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )
            self._add(self.Initializer)

        """
        Encoder Core
        """
        # create RNN cells
        if not self.config['bidirectional']:
            logger.info("{}_create RNN cells.".format(self.prefix))
            self.RNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_cell".format(self.prefix)
            )
            self._add(self.RNN)
        else:
            logger.info("{}_create forward RNN cells.".format(self.prefix))
            self.forwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_fw_cell".format(self.prefix)
            )
            self._add(self.forwardRNN)

            logger.info("{}_create backward RNN cells.".format(self.prefix))
            self.backwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_bw_cell".format(self.prefix)
            )
            self._add(self.backwardRNN)

        logger.info("create encoder ok.")

    def build_encoder(self, source, context=None, return_embed=False, return_sequence=False):
        """
        Build the Encoder Computational Graph

        For the default configurations (with attention)
            with_context=False, return_sequence=True, return_embed=True
        Input:
            source : source text, a list of indexes [nb_sample * max_len]
            context: None
        Return:
            For Attention model:
                return_sequence=True: to return the embedding at each time, not just the end state
                return_embed=True:
                    X_out:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
                    X:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
                    X_mask: mask, an array showing which elements in X are not 0 [nb_sample, max_len]
                    X_tail: encoding of ending of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]
                there's bug on X_tail, but luckily we don't use it often

        nb_sample:  number of samples, defined by batch size
        max_len:    max length of sentence (should be same after padding)
        """
        # Initial state
        Init_h = None
        if self.use_context:
            Init_h = self.Initializer(context)

        # word embedding
        if not self.config['bidirectional']:
            X, X_mask = self.Embed(source, True)
            X_out     = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=return_sequence)
            if return_sequence:
                X_tail    = X_out[:, -1]
            else:
                X_tail    = X_out
        else:
            # reverse the source for backwardRNN
            source2 = source[:, ::-1]
            # map text to embedding
            X,  X_mask = self.Embed(source, True)
            X2, X2_mask = self.Embed(source2, True)

            # get the encoding at each time t. [Bug?] run forwardRNN on the reverse text?
            X_out1 = self.backwardRNN(X, X_mask,  C=context, init_h=Init_h, return_sequence=return_sequence)
            X_out2 = self.forwardRNN(X2, X2_mask, C=context, init_h=Init_h, return_sequence=return_sequence)

            # concatenate vectors of both forward and backward
            if not return_sequence:
                # [Bug]I think the X_out of backwardRNN is time 0, but for forwardRNN is ending time
                X_out  = T.concatenate([X_out1, X_out2], axis=1)
                X_tail = X_out
            else:
                # reverse the encoding of forwardRNN(actually backwardRNN), so the X_out is backward
                X_out  = T.concatenate([X_out1, X_out2[:, ::-1, :]], axis=2)
                # [Bug] X_out1[-1] is time 0, but X_out2[-1] is ending time
                X_tail = T.concatenate([X_out1[:, -1], X_out2[:, -1]], axis=1)

        X_mask  = T.cast(X_mask, dtype='float32')
        if return_embed:
            return X_out, X, X_mask, X_tail
        return X_out

    def compile_encoder(self, with_context=False, return_embed=False, return_sequence=False):
        source  = T.imatrix()
        self.return_embed = return_embed
        self.return_sequence = return_sequence
        if with_context:
            context = T.matrix()

            self.encode = theano.function([source, context],
                                          self.build_encoder(source, context,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence))
        else:
            self.encode = theano.function([source],
                                          self.build_encoder(source, None,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence))


class Decoder(Model):
    """
    Recurrent Neural Network-based Decoder.
    It is used for:
        (1) Evaluation: compute the probability P(Y|X)
        (2) Prediction: sample the best result based on P(Y|X)
        (3) Beam-search
        (4) Scheduled Sampling (how to implement it?)
    """

    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 highway=False):
        """
        mode = RNN: use a RNN Decoder
        """
        super(Decoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.name = prefix
        self.mode = mode

        self.highway = highway
        self.init = initializations.get('glorot_uniform')
        self.sigmoid = activations.get('sigmoid')

        # use standard drop-out for input & output.
        # I believe it should not use for context vector.
        self.dropout = config['dropout']
        if self.dropout > 0:
            logger.info('Use standard-dropout!!!!')
            self.D   = Dropout(rng=self.rng, p=self.dropout, name='{}_Dropout'.format(prefix))

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['dec_voc_size'],
                self.config['dec_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        # create Initialization Layers
        logger.info("{}_create initialization layers.".format(self.prefix))
        if not config['bias_code']:
            self.Initializer = Zero()
        else:
            self.Initializer = Dense(
                config['dec_contxt_dim'],
                config['dec_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )

        # create RNN cells
        logger.info("{}_create RNN cells.".format(self.prefix))
        self.RNN = RNN(
            self.config['dec_embedd_dim'],
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            name="{}_cell".format(self.prefix)
        )

        self._add(self.Initializer)
        self._add(self.RNN)

        # HighWay Gating
        if highway:
            logger.info("HIGHWAY CONNECTION~~~!!!")
            assert self.config['context_predict']
            assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']

            self.C_x = self.init((self.config['dec_contxt_dim'],
                                  self.config['dec_hidden_dim']))
            self.H_x = self.init((self.config['dec_hidden_dim'],
                                  self.config['dec_hidden_dim']))
            self.b_x = initializations.get('zero')(self.config['dec_hidden_dim'])

            self.C_x.name = '{}_Cx'.format(self.prefix)
            self.H_x.name = '{}_Hx'.format(self.prefix)
            self.b_x.name = '{}_bx'.format(self.prefix)
            self.params += [self.C_x, self.H_x, self.b_x]

        # create readout layers
        logger.info("_create Readout layers")

        # 1. hidden layers readout.
        self.hidden_readout = Dense(
            self.config['dec_hidden_dim'],
            self.config['output_dim']
            if self.config['deep_out'] # what's deep out?
            else self.config['dec_voc_size'],
            activation='linear',
            name="{}_hidden_readout".format(self.prefix)
        )

        # 2. previous word readout
        self.prev_word_readout = None
        if self.config['bigram_predict']:
            self.prev_word_readout = Dense(
                self.config['dec_embedd_dim'],
                self.config['output_dim']
                if self.config['deep_out']
                else self.config['dec_voc_size'],
                activation='linear',
                name="{}_prev_word_readout".format(self.prefix),
                learn_bias=False
            )

        # 3. context readout
        self.context_readout = None
        if self.config['context_predict']:
            if not self.config['leaky_predict']:
                self.context_readout = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['output_dim']
                    if self.config['deep_out']
                    else self.config['dec_voc_size'],
                    activation='linear',
                    name="{}_context_readout".format(self.prefix),
                    learn_bias=False
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']
                self.context_readout = self.hidden_readout

        # option: deep output (maxout)
        if self.config['deep_out']:
            self.activ = Activation(config['deep_out_activ'])
            # self.dropout = Dropout(rng=self.rng, p=config['dropout'])
            self.output_nonlinear = [self.activ]  # , self.dropout]
            self.output = Dense(
                self.config['output_dim'] / 2
                if config['deep_out_activ'] == 'maxout2'
                else self.config['output_dim'],

                self.config['dec_voc_size'],
                activation='softmax',
                name="{}_output".format(self.prefix),
                learn_bias=False
            )
        else:
            self.output_nonlinear = []
            self.output = Activation('softmax')

        # registration:
        self._add(self.hidden_readout)

        if not self.config['leaky_predict']:
            self._add(self.context_readout)

        self._add(self.prev_word_readout)
        self._add(self.output)

        if self.config['deep_out']:
            self._add(self.activ)
        # self._add(self.dropout)

        logger.info("create decoder ok.")

    @staticmethod
    def _grab_prob(probs, X):
        '''
        return the predicted probabilities of target term
        :param probs:           [nb_samples, max_len_target, vocab_size], predicted probabilities of all terms(size=vocab_size) on each target position (size=max_len_target)
        :param X:               [nb_sample, max_len_target], contains the index of target term
        :return: probs_target:  [nb_sample, max_len_target], predicted probabilities of each target term
        '''
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]
        # reshape to a 2D list, axis0 is batch-term, axis1 is vocabulary
        probs = probs.reshape((batch_size * max_len, vocab_size))
        '''
        return the predicting probability of target term
              T.arange(batch_size * max_len) indicates the index of each prediction
              X.flatten(1), convert X into a 1-D list, index of target terms
              reshape(X.shape), reshape to X's shape [nb_sample, max_len_target]
        '''
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target):
        # Word embedding of target, mask is a list of [0,1] shows which elements are not zero
        Y, Y_mask = self.Embed(target, True)  # (nb_samples, max_len, embedding_dim)

        if self.config['use_input']:
            X = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)
        else:
            X = 0 * Y

        # option ## drop words.

        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, Y, Y_mask, Count

    def build_decoder(self, target, context=None,
                      return_count=False,
                      train=True):

        """
        Build the Decoder Computational Graph
        For training/testing
        """
        X, X_mask, Y, Y_mask, Count = self.prepare_xy(target)

        # input drop-out if any.
        if self.dropout > 0:
            X = self.D(X, train=train)

        # Initial state of RNN
        Init_h = self.Initializer(context)
        if not self.highway:
            X_out  = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=True)

            # Readout
            readout = self.hidden_readout(X_out)
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            if self.config['context_predict']:
                readout += self.context_readout(context).dimshuffle(0, 'x', 1)
        else:
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

            def _recurrence(x, x_mask, prev_h, c):
                # compute the highway gate for context vector.
                xx    = dot(c, self.C_x, self.b_x) + dot(prev_h, self.H_x)  # highway gate.
                xx    = self.sigmoid(xx)

                cy    = xx * c   # the path without using RNN
                x_out = self.RNN(x, mask=x_mask, C=c, init_h=prev_h, one_step=True)
                hx    = (1 - xx) * x_out
                return x_out, hx, cy

            outputs, _ = theano.scan(
                _recurrence,
                sequences=[X, X_mask],
                outputs_info=[Init_h, None, None],
                non_sequences=[context]
            )

            # hidden readout + context readout
            readout   = self.hidden_readout( outputs[1].dimshuffle((1, 0, 2)))
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            readout  += self.context_readout(outputs[2].dimshuffle((1, 0, 2)))

            # return to normal size.
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        prob_dist = self.output(readout)  # (nb_samples, max_len, vocab_size)
        # log_old  = T.sum(T.log(self._grab_prob(prob_dist, target)), axis=1)
        log_prob = T.sum(T.log(self._grab_prob(prob_dist, target)) * X_mask, axis=1)
        log_ppl  = log_prob / Count

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl


    def _step_sample(self, prev_word, prev_stat, context):
        """
        Sample one step
        """

        # word embedding (note that for the first word, embedding should be all zero)
        if self.config['use_input']:
            X = T.switch(
                prev_word[:, None] < 0,
                alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                self.Embed(prev_word)
            )
        else:
            X = alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim'])

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        if not self.highway:
            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            # compute the readout probability distribution and sample it
            # here the readout is a matrix, different from the learner.
            readout = self.hidden_readout(next_stat)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            if self.config['context_predict']:
                readout += self.context_readout(context)
        else:
            xx     = dot(context, self.C_x, self.b_x) + dot(prev_stat, self.H_x)  # highway gate.
            xx     = self.sigmoid(xx)

            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            readout  = self.hidden_readout((1 - xx) * X_proj)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            readout += self.context_readout(xx * context)

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        next_prob = self.output(readout)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat

    """
    Build the sampler for sampling/greedy search/beam search
    """

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.matrix()  # theano variable.

        init_h = self.Initializer(context)
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state')
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')

        next_prob, next_sample, next_stat \
            = self._step_sample(prev_word, prev_stat, context)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_word, prev_stat, context]
        outputs = [next_prob, next_sample, next_stat]

        self.sample_next = theano.function(inputs, outputs, name='sample_next')
        logger.info('done')
        pass

    """
    Build a Stochastic Sampler which can use SCAN to work on GPU.
    However it cannot be used in Beam-search.
    """

    def build_stochastic_sampler(self):
        context = T.matrix()
        init_h = self.Initializer(context)

        logger.info('compile the function: sample')
        pass

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context
        next_state = self.get_init_state(context)
        next_word = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)

        # Start searching!
        for ii in xrange(maxlen):
            # print(next_word)
            ctx = np.tile(context, [live_k, 1]) # copy context live_k times, into a list size=[live_k]
            next_prob, next_word, next_state \
                = self.sample_next(next_word, next_state, ctx)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                sample.append(nw)
                score += next_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only computed in a flatten way!
                cand_scores = hyp_scores[:, None] - np.log(next_prob)
                cand_flat = cand_scores.flatten()
                ranks_flat = cand_flat.argsort()[:(k - dead_k)]

                # fetch the best results.
                voc_size = next_prob.shape[1]
                trans_index = ranks_flat / voc_size
                word_index = ranks_flat % voc_size
                costs = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []

                for idx, [ti, wi] in enumerate(zip(trans_index, word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []

                for idx in xrange(len(new_hyp_samples)):
                    if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        dead_k += 1
                    else:
                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                if dead_k >= k:
                    break

                next_word = np.array([w[-1] for w in hyp_samples])
                next_state = np.array(hyp_states)
                pass
            pass

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in xrange(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])

        return sample, score


class DecoderAtt(Decoder):
    """
    Recurrent Neural Network-based Decoder
    with Attention Mechanism
    """
    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 copynet=False, identity=False):
        super(DecoderAtt, self).__init__(
                config, rng, prefix,
                 mode, embed, False)

        self.copynet  = copynet
        self.identity = identity
        # attention reader
        self.attention_reader = Attention(
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            1000,
            name='source_attention'
        )
        self._add(self.attention_reader)

        # if use copynet
        if self.copynet:
            if not self.identity:
                self.Is = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['dec_embedd_dim'],
                    name='in-trans'
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_embedd_dim']
                self.Is = Identity(name='ini')

            self.Os = Dense(
                self.config['dec_readout_dim'],
                self.config['dec_contxt_dim'],
                name='out-trans'
            )
            self._add(self.Is)
            self._add(self.Os)

        logger.info('adjust decoder ok.')

    def prepare_xy(self, target, context=None):
        """
        Build the decoder for evaluation

        We have padded the last column of target (target[-1]) to be 0s
        The Y and Y_mask here are embedding and mask of target.

        Here for training purpose, we create a new input X and X_mask,
                in which pad one column of 0s at the beginning as start signal, and delete the last column of 0s

        self.config['use_input']: True, if False, may mean that don't input the current word, only h_t = g(h_t-1)+h(0)
        :return X:              a matrix of 0s [nb_sample, 1, enc_embedd_dim], concatenate with Y [nb_sample, max_len-1, enc_embedd_dim]
                                result in [nb_sample, [0]+Y[:-1], enc_embedd_dim], first one is 0, and latter one is y[t-1]
        :return X_mask:         a matrix of 1s [nb_sample, 1],  concatenate with Y_mask [nb_sample, max_len-1]
                                result in [nb_sample, [0]+Y_mask[:-1]], first one is 0, and latter one is y_mask[t-1]

        :return Y:              embedding of target Y [nb_sample, max_len_target, enc_embedd_dim]
        :return Y_mask:         mask of target Y  [nb_sample, max_len_target]
        :return Count:          how many word in X, T.sum(X_mask, axis=1), is used for computing entropy and ppl
        """
        if not self.copynet:
            # Word embedding
            Y, Y_mask = self.Embed(target, True)  # (nb_samples, max_len, enc_embedd_dim)
        else:
            Y, Y_mask = self.Embed(target, True, context=self.Is(context))

        if self.config['use_input']:
            X = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)
        else:
            X = 0 * Y

        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, Y, Y_mask, Count

    def build_decoder(self,
                      target,
                      context, c_mask,
                      return_count=False,
                      train=True):
        """
        Build the Computational Graph ::> Context is essential
        :param target:              index vector of target   [nb_sample, max_len]
        :param context:             encoding of source text, [nb_sample, max_len, 2*enc_hidden_dim]
        :param c_mask:              mask of source text
        :param return_count,train:  not used

        :return value of objective function
            log_prob and log_ppl
        """
        assert c_mask is not None, 'context must be supplied for this decoder.'
        assert context.ndim == 3, 'context must have 3 dimentions.'
        # context: (nb_samples, max_len, contxt_dim)

        X, X_mask, Y, Y_mask, Count = self.prepare_xy(target, context)

        # input drop-out if any.
        if self.dropout > 0:
            X     = self.D(X, train=train)

        # Initial state of RNN
        Init_h  = self.Initializer(context[:, 0, :])  # time 0 of each sequence embedding
        X       = X.dimshuffle((1, 0, 2)) # shuffle to [max_len_target, nb_sample, 2*enc_hidden_dim]
        X_mask  = X_mask.dimshuffle((1, 0)) # shuffle to [max_len_target, nb_sample]

        def _recurrence(x, x_mask, prev_h, cc, cm):
            # compute the attention
            prob  = self.attention_reader(prev_h, cc, Smask=cm)
            # get the context vector after attention by context * atten_prob
            c     = T.sum(cc * prob[:, :, None], axis=1)
            # get the RNN output vector of this step
            x_out = self.RNN(x, mask=x_mask, C=c, init_h=prev_h, one_step=True)
            # return RNN output, attention prob and attentioned context
            #    x_out is used as the prev_h of next iteration
            return x_out, prob, c

        outputs, _ = theano.scan(
            _recurrence,
            sequences=[X, X_mask],
            outputs_info=[Init_h, None, None],
            non_sequences=[context, c_mask],
            name='decoder_scan'
        )
        X_out, Probs, Ctx = [z.dimshuffle((1, 0, 2)) for z in outputs] #shuffle to [nb_sample, max_len_target, dec_hidden_dim]
        # return to normal size.
        X       = X.dimshuffle((1, 0, 2)) # shuffle back to [nb_sample, max_len_target, 2*enc_hidden_dim]
        X_mask  = X_mask.dimshuffle((1, 0)) # shuffle back to [nb_sample, max_len_target]

        # Readout
        readin  = [X_out] # RNN output at each time t
        # a linear activation layer, take input size=dec_hidden_dim, output size=dec_voc_size
        #   just return W*X+b [nb_sample, max_len_target, dec_voc_size]
        readout = self.hidden_readout(X_out)
        if self.dropout > 0:
            readout = self.D(readout, train=train)

        # don't know what's this
        #   take another linear layer context_readout, return [nb_sample, 1]
        if self.config['context_predict']:
            readin  += [Ctx]
            readout += self.context_readout(Ctx)

        # don't know what's this
        #   another linear layer prev_word_readout, return [nb_sample, 1]
        if self.config['bigram_predict']:
            readin  += [X]
            readout += self.prev_word_readout(X)

        # only have non-linear for maxout, so not work here
        for l in self.output_nonlinear:
            readout = l(readout)

        if self.copynet:
            readin  = T.concatenate(readin, axis=-1)
            key     = self.Os(readin)

            # (nb_samples, max_len_T, embed_size) :: key
            # (nb_samples, max_len_S, embed_size) :: context
            Eng     = T.sum(key[:, :, None, :] * context[:, None, :, :], axis=-1)
            # (nb_samples, max_len_T, max_len_S)  :: Eng
            EngSum  = logSumExp(Eng, axis=2, mask=c_mask[:, None, :], c=readout)
            prob_dist = T.concatenate([T.exp(readout - EngSum), T.exp(Eng - EngSum) * c_mask[:, None, :]], axis=-1)
        else:
            # output after a simple softmax, return a tensor size in [nb_samples, max_len, vocab_size], indicates the probabilities of next (predicted) word
            prob_dist = self.output(readout)  # (nb_samples, max_len, vocab_size)

        '''
        Compute Cross-entropy!
        1. _grab_prob(prob_dist, target), predicted probabilities of each target term, shape=[nb_sample, max_len_target]
              prob_dist is [nb_samples, max_len_source, vocab_size], target is [nb_sample, max_len_target]
        2. T.log(self._grab_prob(prob_dist, target)) * X_mask,
            to remove the probabilities of padding terms (index=0)
        3. sum up the log(predicted probability), to get the cross-entropy loss of target sequence (must be less than 0, add a minus to make it possitive, we want the minimize this value to 0)
        4. return a tensor in [nb_samples,1], overall loss of each sample
        [Important] Add value clipping in case of log(0)=-inf. 2016/12/11
        '''
        log_prob = T.sum(T.log(T.clip(self._grab_prob(prob_dist, target), 1e-10, 1.0)) * X_mask, axis=1)
        #  Count is number of terms in each targets
        log_ppl  = log_prob / Count

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl


    def build_representer(self,
                      target,
                      context, c_mask,
                      train=True):
        """
        Very similar to build_decoder, but instead of return cross-entropy of generating target sequences,
            we return the probability of generating target sequences (similar to _step_sample)
        :param target:              index vector of target   [nb_sample, max_len]
        :param context:             encoding of source text, [nb_sample, max_len, 2*enc_hidden_dim]
        :param last_words:          the index of the last word in target

        :return prob_dist:          probability of generating target sequences, size=TODO
        :return final_state:        decode vector of generating target sequences, size=TODO
        """
        assert context.ndim == 3, 'context must have 3 dimentions.'
        # context: (nb_samples, max_len, contxt_dim)

        # prepare the inputs
        #      X                :  embedding of targets, padded first word as 0 [nb_sample, max_len, enc_embedd_dim]
        #      X_mask           :  mask of targets, padded first word as 0      [nb_sample, max_len]
        #      Y,Y_mask,Count   :  not used
        X, X_mask, _, _, _ = self.prepare_xy(target, context)

        # input drop-out if any.
        if self.dropout > 0:
            X     = self.D(X, train=train)

        # Initial state of RNN
        Init_h  = self.Initializer(context[:, 0, :])  # time 0 of each sequence embedding
        X       = X.dimshuffle((1, 0, 2))   # shuffle to [max_len_target, nb_sample, enc_hidden_dim]
        X_mask  = X_mask.dimshuffle((1, 0)) # shuffle to [max_len_target, nb_sample]

        def _recurrence(x, x_mask, prev_h, source_context, cm):
            '''

            :param x:       word embedding of current target word[nb_sample, enc_hidden_dim]
            :param x_mask:  mask of target [nb_sample]
            :param prev_h:  hidden state of previous time t-1
            :param source_context:      encoding vector of source
            :param cm:      mask of source
            :return: x_out: decoding vector after time t
            :return: prob:  attention probability
            :return: c:     context vector after attention
            '''
            # compute the attention
            attention_prob  = self.attention_reader(prev_h, source_context, Smask=cm)
            # get the context vector after attention by context * atten_prob
            c     = T.sum(source_context * attention_prob[:, :, None], axis=1)
            # get the RNN output vector of this step
            x_out = self.RNN(x, mask=x_mask, C=c, init_h=prev_h, one_step=True)
            # return RNN output, attention prob and attentioned context
            #    x_out is used as the prev_h of next iteration
            return x_out, attention_prob, c

        # return the outputs of _recurrence, update is ignored (the _)
        outputs, _ = theano.scan(
            _recurrence,
            sequences=[X, X_mask],
            outputs_info=[Init_h, None, None],
            non_sequences=[context, c_mask],
            name='decoder_scan'
        )

        # X_out is the decoding output [nb_sample, max_len_target, dec_hidden_dim]
        X_out, Probs, Ctx = [z.dimshuffle((1, 0, 2)) for z in outputs] # shuffle to [nb_sample, max_len_target, dec_hidden_dim]

        # we are only interested in the states at final, reshape to [nb_sample, dec_hidden_dim]
        final_state = X_out[ :, -1, :]
        Ctx         = Ctx[ :, -1, :]

        # return to normal size.
        X       = X.dimshuffle((1, 0, 2)) # shuffle back to [nb_sample, max_len_target, 2*enc_hidden_dim]
        X       = X[ :, -1, :]
        X_mask  = X_mask.dimshuffle((1, 0)) # shuffle back to [nb_sample, max_len_target]

        # Readout
        readin  = [final_state] # RNN output at each time t
        # a linear activation layer, take input size=[nb_sample, max_len_target, dec_hidden_dim], output size=[nb_sample, max_len_target, dec_voc_size]
        #   just return W*X+b [nb_sample, max_len_target, dec_voc_size]
        readout = self.hidden_readout(final_state)

        if self.dropout > 0:
            readout = self.D(readout, train=train)

        # don't know what's this
        #   take another linear layer context_readout, return [nb_sample, 1]
        if self.config['context_predict']:
            readin  += [Ctx]
            readout += self.context_readout(Ctx)

        # don't know what's this
        #   another linear layer prev_word_readout, return [nb_sample, 1]
        if self.config['bigram_predict']:
            readin  += [X]
            readout += self.prev_word_readout(X)

        # only have non-linear for maxout, so not work here
        for l in self.output_nonlinear:
            readout = l(readout)

        if self.copynet:
            readin  = T.concatenate(readin, axis=-1)
            key     = self.Os(readin)

            # (nb_samples, max_len_T, embed_size) :: key
            # (nb_samples, max_len_S, embed_size) :: context
            Eng     = T.sum(key[:, :, None, :] * context[:, None, :, :], axis=-1)
            # (nb_samples, max_len_T, max_len_S)  :: Eng
            EngSum  = logSumExp(Eng, axis=2, mask=c_mask[:, None, :], c=readout)
            prob_dist = T.concatenate([T.exp(readout - EngSum), T.exp(Eng - EngSum) * c_mask[:, None, :]], axis=-1)
        else:
            # output after a simple softmax, return a tensor size in [nb_samples, max_len, vocab_size], indicates the probabilities of next (predicted) word
            prob_dist = self.output(readout)  # (nb_samples, vocab_size)

        return prob_dist, final_state


    def _step_sample(self, prev_word, prev_stat, context, c_mask):
        """
        Sample one step
        """
        assert c_mask is not None, 'we need the source mask.'
        # word embedding (note that for the first word, embedding should be all zero)
        if self.config['use_input']:
            if not self.copynet:
                X = T.switch(
                    prev_word[:, None] < 0,
                    alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                    self.Embed(prev_word)
                )
            else:
                X = T.switch(
                    prev_word[:, None] < 0,
                    alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                    self.Embed(prev_word, context=self.Is(context))
                )
        else:
            X = alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim'])

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        Probs  = self.attention_reader(prev_stat, context, c_mask)
        cxt    = T.sum(context * Probs[:, :, None], axis=1)
        X_proj = self.RNN(X, C=cxt, init_h=prev_stat, one_step=True)
        next_stat = X_proj

        # compute the readout probability distribution and sample it
        # here the readout is a matrix, different from the learner.
        readout = self.hidden_readout(next_stat)
        readin  = [next_stat]
        if self.dropout > 0:
            readout = self.D(readout, train=False)

        if self.config['context_predict']:
            readout += self.context_readout(cxt)
            readin  += [cxt]

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)
            readin  += [X]

        for l in self.output_nonlinear:
            readout = l(readout)

        if self.copynet:
            readin  = T.concatenate(readin, axis=-1)
            key     = self.Os(readin)

            # (nb_samples, embed_size) :: key
            # (nb_samples, max_len_S, embed_size) :: context
            Eng     = T.sum(key[:, None, :] * context[:, :, :], axis=-1)
            # (nb_samples, max_len_S)  :: Eng
            EngSum  = logSumExp(Eng, axis=-1, mask=c_mask, c=readout)
            next_prob = T.concatenate([T.exp(readout - EngSum), T.exp(Eng - EngSum) * c_mask], axis=-1)
        else:
            next_prob = self.output(readout)  # (nb_samples, max_len, vocab_size)

        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.tensor3()  # theano variable.
        c_mask  = T.matrix()   # mask of the input sentence.

        init_h = self.Initializer(context[:, 0, :])
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state')
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')

        next_prob, next_sample, next_stat \
            = self._step_sample(prev_word, prev_stat, context, c_mask)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_word, prev_stat, context, c_mask]
        outputs = [next_prob, next_sample, next_stat]

        self.sample_next = theano.function(inputs, outputs, name='sample_next', allow_input_downcast=True)
        logger.info('done')
        pass

    def get_sample(self, encoding, c_mask, inputs,
                   k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False,
                   type='extractive'):
        '''
        Generate samples, either with stochastic sampling or beam-search!
        both inputs and context contain multiple sentences, so could this function generate sequence with regard to each input spontaneously?
        :param inputs: the source text, used for extraction
        :param encoding: the encoding of input sequence on each word, shape=[len(sent),2*D], 2*D is due to bidirectional
        :param c_mask: whether x in input is not zero (is padding)
        :param k: config['sample_beam']
        :param maxlen: config['max_len']
        :param stochastic: config['sample_stoch']
        :param argmax: config['sample_argmax']
        :param fixlen:
        :return:
        '''
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score  = []
        state  = []

        # if stochastic:
        #     score = 0

        live_k = 1
        dead_k = 0

        # initial prediction pool
        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context, size = 1*D
        previous_state = self.get_init_state(encoding)
        # indicator for the first target word (bos target). Why it's [-1]?
        previous_word = -1 * np.ones((1,)).astype('int32')

        # if aim is extractive, then set the initial beam size to be voc_size
        if type == 'extractive':
            input = inputs[0]
            input_set = set(input)
            sequence_set = set()

            for i in range(len(input)): # loop over start
                for j in range(1, maxlen): # loop over length
                    if i+j > len(input)-1:
                        break
                    hash_token = [str(s) for s in input[i:i+j]]
                    sequence_set.add('-'.join(hash_token))
            logger.info("Possible n-grams: %d" % len(sequence_set))

        # Start searching!
        for ii in xrange(maxlen):
            # predict next_word
            # make live_k copies of context and c_mask, to predict next word at once
            encoding_copies    = np.tile(encoding, [live_k, 1, 1])
            c_mask_copies      = np.tile(c_mask, [live_k, 1])
            '''
            based on the live_k alive prediction, predict the next word of them at a time.
            Inputs:
                previous_word:      a list of index of last word, size=[live_k]
                previous_state:     decoding vector of time t-1, size= [live_k, dec_hidden_dim]
                encoding_copies:    encoding of source, size=[live_k, len_source, 2*enc_hidden_dim]
                c_mask_copies:      mask of source ,    size=[live_k, len_source]
            Return live_k groups of results (next_prob, next_word, next_state)
                next_prob is [live_k, voc_size], contains the probabilities of predicted next word.
                next_word is a list of [live_k] given by = self.rng.multinomial(pvals=next_prob).argmax(1), not useful for beam-search.
                next_state is the current hidden state of decoder, size=[live_k, dec_hidden_dim]
            '''
            next_prob, next_word, next_state \
                = self.sample_next(previous_word, previous_state, encoding_copies, c_mask_copies)

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                # sample.append(nw)
                # score += next_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only compute in a flatten way!
                cand_scores = hyp_scores[:, None] - np.log(next_prob + 1e-10) # the smaller the better
                cand_flat = cand_scores.flatten() # transform the k*V into a list of [1*kV]
                # get the index of highest words for each beam
                ranks_flat = cand_flat.argsort()[:(k - dead_k)] # get the (global) top k prediction words

                # fetch the best results. Get the index of best predictions. trans_index is the index of its previous word, word_index is the index of prediction
                voc_size                = next_prob.shape[1]
                previous_sequence_index = ranks_flat / voc_size
                next_word_index         = ranks_flat % voc_size
                costs                   = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []
                # enumerate (last word, predicted word), store corresponding: 1. new_hyp_samples: current sequence; 2.new_hyp_scores: current score (probability); 3. new_hyp_states: the hidden state
                for idx, [ti, wi] in enumerate(zip(previous_sequence_index, next_word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []
                # check all the predictions
                for idx in xrange(len(new_hyp_samples)):
                    if (new_hyp_samples[idx][-1] == 0) and (not fixlen): # bug??? why new_hyp_states[idx][-1] == 0? I think it should be new_hyp_samples[idx][-1] == 0
                        # if the predicted words is <eol>(reach the end), add to final list
                        # sample.append(new_hyp_samples[idx])
                        # score.append(new_hyp_scores[idx])

                        # disable the dead_k to extend the prediction
                        # dead_k += 1
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        state.append(new_hyp_states[idx])
                    else:
                        # not end, check whether current new_hyp_samples[idx] is in original text,
                        # if yes, add the queue for predicting next round
                        # if no, discard
                        # limit predictions must appear in text
                        if type == 'extractive':
                            '''
                            only predict the subsequences that appear in the original text
                            actually not all the n-grams will be included
                            as only after predicting a <eol> , this prediction will be put into final results
                            '''
                            if new_hyp_samples[idx][-1] not in input_set:
                                continue

                            # TODO something wrong here
                            if '-'.join([str(s) for s in new_hyp_samples[idx]]) not in sequence_set:
                                continue

                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                # ending condition
                if new_live_k < 1:
                    break
                # if dead_k >= k:
                #     break

                '''
                set the predicted word and hidden_state as the next_word and next_state
                '''
                previous_word = np.array([w[-1] for w in hyp_samples])
                previous_state = np.array(hyp_states)

            logger.info('\t Depth=%d, get %d outputs' % (ii, len(sample)))

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in xrange(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])
                    state.append(hyp_states[idx])

        # sort the result
        result = zip(sample, score, state)
        sorted_result = sorted(result, key=lambda entry: entry[1], reverse=False)
        sample, score, state = zip(*sorted_result)
        return sample, score, state


class FnnDecoder(Model):
    def __init__(self, config, rng, prefix='fnndec'):
        """
        mode = RNN: use a RNN Decoder
        """
        super(FnnDecoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.name = prefix

        """
        Create Dense Predictor.
        """

        self.Tr = Dense(self.config['dec_contxt_dim'],
                             self.config['dec_hidden_dim'],
                             activation='maxout2',
                             name='{}_Tr'.format(prefix))
        self._add(self.Tr)

        self.Pr = Dense(self.config['dec_hidden_dim'] / 2,
                             self.config['dec_voc_size'],
                             activation='softmax',
                             name='{}_Pr'.format(prefix))
        self._add(self.Pr)
        logger.info("FF decoder ok.")

    @staticmethod
    def _grab_prob(probs, X):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    def build_decoder(self, target, context):
        """
        Build the Decoder Computational Graph
        """
        prob_dist = self.Pr(self.Tr(context[:, None, :]))
        log_prob  = T.sum(T.log(self._grab_prob(prob_dist, target)), axis=1)
        return log_prob

    def build_sampler(self):
        context   = T.matrix()
        prob_dist = self.Pr(self.Tr(context))
        next_sample = self.rng.multinomial(pvals=prob_dist).argmax(1)
        self.sample_next = theano.function([context], [prob_dist, next_sample], name='sample_next_{}'.format(self.prefix))
        logger.info('done')

    def get_sample(self, context, argmax=True):

        prob, sample = self.sample_next(context)
        if argmax:
            return prob[0].argmax()
        else:
            return sample[0]


########################################################################################################################
# Encoder-Decoder Models ::::
#
class RNNLM(Model):
    """
    RNN-LM, with context vector = 0.
    It is very similar with the implementation of VAE.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'rnnlm'

    def build_(self):
        logger.info("build the RNN-decoder")
        self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)

        # registration:
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get('adadelta')

        # saved the initial memories
        if self.config['mode'] == 'NTM':
            self.memory    = initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth']))

        logger.info("create the RECURRENT language model. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            if not contrastive:
                self.compile_train()
            else:
                self.compile_train_CE()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):

        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        if self.config['mode']   == 'RNN':
            context = alloc_zeros_matrix(inputs.shape[0], self.config['dec_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # decoding.
        target  = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # reconstruction loss
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        L1       = T.sum([T.sum(abs(w)) for w in self.params])
        loss     = loss_rec

        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun',
                                      allow_input_downcast=True
                                      )

        logger.info("pre-training functions compile done.")

        # add monitoring:
        self.monitor['context'] = context
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)

    @abstractmethod
    def compile_train_CE(self):
        pass

    def compile_sample(self):
        # context vectors (as)
        self.decoder.build_sampler()
        logger.info("display functions compile done.")

    @abstractmethod
    def compile_inference(self):
        pass

    def default_context(self):
        if self.config['mode'] == 'RNN':
            return np.zeros(shape=(1, self.config['dec_contxt_dim']), dtype=theano.config.floatX)
        elif self.config['mode'] == 'NTM':
            memory = self.memory.get_value()
            memory = memory.reshape((1, memory.shape[0], memory.shape[1]))
            return memory

    def generate_(self, context=None, max_len=None, mode='display'):
        """
        :param action: action vector to guide the question.
                       If None, use a Gaussian to simulate the action.
        :return: question sentence in natural language.
        """
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        if context is None:
            context = self.default_context()

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)

        sample, score = self.decoder.get_sample(context, **args)
        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)


class AutoEncoder(RNNLM):
    """
    Regular Auto-Encoder: RNN Encoder/Decoder
    """

    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name = 'vae'

    def build_(self):
        logger.info("build the RNN auto-encoder")
        self.encoder = Encoder(self.config, self.rng, prefix='enc')
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', embed=self.encoder.Embed)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec')

        """
        Build the Transformation
        """
        if self.config['nonlinear_A']:
            self.action_trans = Dense(
                self.config['enc_hidden_dim'],
                self.config['action_dim'],
                activation='tanh',
                name='action_transform'
            )
        else:
            assert self.config['enc_hidden_dim'] == self.config['action_dim'], \
                    'hidden dimension must match action dimension'
            self.action_trans = Identity(name='action_transform')

        if self.config['nonlinear_B']:
            self.context_trans = Dense(
                self.config['action_dim'],
                self.config['dec_contxt_dim'],
                activation='tanh',
                name='context_transform'
            )
        else:
            assert self.config['dec_contxt_dim'] == self.config['action_dim'], \
                    'action dimension must match context dimension'
            self.context_trans = Identity(name='context_transform')

        # registration
        self._add(self.action_trans)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'], kwargs={'lr': self.config['lr']})

        logger.info("create Helmholtz RECURRENT neural network. ok")

    def compile_train(self, mode='train'):
        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        context = alloc_zeros_matrix(inputs.shape[0], self.config['dec_contxt_dim'])
        assert context.ndim == 2

        # decoding.
        target  = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # reconstruction loss
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        L1       = T.sum([T.sum(abs(w)) for w in self.params])
        loss     = loss_rec

        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun')
        logger.info("pre-training functions compile done.")

        if mode == 'display' or mode == 'all':
            """
            build the sampler function here <:::>
            """
            # context vectors (as)
            self.decoder.build_sampler()
            logger.info("display functions compile done.")

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)


class NRM(Model):
    """
    Neural Responding Machine
    A Encoder-Decoder based responding model.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation',
                 use_attention=False,
                 copynet=False,
                 identity=False):
        super(NRM, self).__init__()

        self.config   = config
        self.n_rng    = n_rng  # numpy random stream
        self.rng      = rng  # Theano random stream
        self.mode     = mode
        self.name     = 'nrm'
        self.attend   = use_attention
        self.copynet  = copynet
        self.identity = identity

    def build_(self):
        logger.info("build the Neural Responding Machine")

        # encoder-decoder:: <<==>>
        self.encoder = Encoder(self.config, self.rng, prefix='enc', mode=self.mode)
        if not self.attend:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)
        else:
            self.decoder = DecoderAtt(self.config, self.rng, prefix='dec', mode=self.mode,
                                      copynet=self.copynet, identity=self.identity)

        self._add(self.encoder)
        self._add(self.decoder)

        # objectives and optimizers
        # self.optimizer = optimizers.get(self.config['optimizer'])
        assert self.config['optimizer'] == 'adam'
        self.optimizer = optimizers.get(self.config['optimizer'],
                                        kwargs=dict(rng=self.rng,
                                                    save=False,
                                                    clipnorm = self.config['clipnorm']))
        logger.info("build ok.")

    def compile_(self, mode='all', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            self.compile_train()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):

        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        target  = T.imatrix()  # padded target word sequence (for training)

        # encoding & decoding
        if not self.attend:
            code               = self.encoder.build_encoder(inputs, None)
            logPxz, logPPL     = self.decoder.build_decoder(target, code)
        else:
            # encode text by encoder, return encoded vector at each time (code) and mask showing non-zero elements
            code, _, c_mask, _ = self.encoder.build_encoder(inputs, None, return_sequence=True, return_embed=True)
            # feed target(index vector of target), code(encoding of source text), c_mask (mask of source text) into decoder, get objective value
            #    logPxz,logPPL are tensors in [nb_samples,1], cross-entropy and Perplexity of each sample
            logPxz, logPPL     = self.decoder.build_decoder(target, code, c_mask)
            prob, stat         = self.decoder.build_representer(target, code, c_mask)


        # responding loss
        loss_rec = -logPxz # get the mean of cross-entropy of this batch
        loss_ppl = T.exp(-logPPL)
        loss     = T.mean(loss_rec)

        updates  = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs, target]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun'
                                      # , mode=NanGuardMode(nan_is_error=True, inf_is_error=True, big_is_error=True)
                                      # , mode='DebugMode'
                                      )

        self.train_guard = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_nanguard_fun',
                                      mode=NanGuardMode(nan_is_error=True, inf_is_error=True, big_is_error=True))

        self.validate_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      name='validate_fun',
                                      allow_input_downcast=True)

        self.represent_ = theano.function(train_inputs,
                                      [prob, stat],
                                      name='represent_fun',
                                      allow_input_downcast=True
                                      )


        logger.info("training functions compile done.")

        # # add monitoring:
        # self.monitor['context'] = context
        # self._monitoring()
        #
        # # compiling monitoring
        # self.compile_monitoring(train_inputs)

    def compile_sample(self):
        if not self.attend:
            self.encoder.compile_encoder(with_context=False)
        else:
            self.encoder.compile_encoder(with_context=False, return_sequence=True, return_embed=True)

        self.decoder.build_sampler()
        logger.info("sampling functions compile done.")

    def compile_inference(self):
        pass

    def generate_(self, inputs, mode='display', return_all=False):
        '''
        Generate output sequence with regards to the given input sequences
        :param inputs:
        :param mode:
        :param return_all:
        :return:
        '''
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'],
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)

        if not self.attend:
            context = self.encoder.encode(inputs)
            sample, score = self.decoder.get_sample(context, **args)
        else:
            # context: input sentence embedding, the last state of sequence
            # c_mask:  whether x in input is not zero (is padding)
            context, _, c_mask, _ = self.encoder.encode(inputs)
            sample, score = self.decoder.get_sample(context, c_mask, **args)

        if return_all:
            return sample, score

        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)

    def generate_multiple(self, inputs, mode='display', return_all=True, all_ngram=True, generate_ngram=True):
        '''
        Generate output sequence
        '''
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'],
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None,
                    type=self.config['predict_type'])

        if not self.attend:
            # get the encoding of the inputs
            context = self.encoder.encode(inputs)
            # generate outputs
            sample, score, _ = self.decoder.get_sample(context, inputs, **args)
        else:
            # context: input sentence embedding
            # c_mask:  whether x in input is not zero (is padding)
            context, _, c_mask, _ = self.encoder.encode(inputs)
            sample, score, _ = self.decoder.get_sample(context, c_mask, inputs, **args)

        if return_all:
            return sample, score

        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)

    # def evaluate_(self, inputs, outputs, idx2word,
    #               origin=None, idx2word_o=None):
    #     '''
    #     This function doesn't support the <unk>, don't use this if voc_size is set
    #     :param inputs:
    #     :param outputs:
    #     :param idx2word:
    #     :param origin:
    #     :param idx2word_o:
    #     :return:
    #     '''
    #
    #     def cut_zero(sample, idx2word, idx2word_o):
    #         Lmax = len(idx2word)
    #         if not self.copynet:
    #             if 0 not in sample:
    #                 return [idx2word[w] for w in sample]
    #             return [idx2word[w] for w in sample[:sample.index(0)]]
    #         else:
    #             if 0 not in sample:
    #                 if origin is None:
    #                     return [idx2word[w] if w < Lmax else idx2word[inputs[w - Lmax]]
    #                             for w in sample]
    #                 else:
    #                     return [idx2word[w] if w < Lmax else idx2word_o[origin[w - Lmax]]
    #                             for w in sample]
    #             if origin is None:
    #                 return [idx2word[w] if w < Lmax else idx2word[inputs[w - Lmax]]
    #                         for w in sample[:sample.index(0)]]
    #             else:
    #                 return [idx2word[w] if w < Lmax else idx2word_o[origin[w - Lmax]]
    #                         for w in sample[:sample.index(0)]]
    #
    #     result, _ = self.generate_(inputs[None, :])
    #
    #     if origin is not None:
    #         logger.info( '[ORIGIN]: {}'.format(' '.join(cut_zero(origin.tolist(), idx2word_o, idx2word_o))))
    #     logger.info('[DECODE]: {}'.format(' '.join(cut_zero(result, idx2word, idx2word_o))))
    #     logger.info('[SOURCE]: {}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word, idx2word_o))))
    #     logger.info('[TARGET]: {}'.format(' '.join(cut_zero(outputs.tolist(), idx2word, idx2word_o))))
    #
    #     return True
    #
    def evaluate_(self, inputs, outputs, idx2word, inputs_unk=None):

        def cut_zero(sample, idx2word, Lmax=None):
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if 0 not in sample:
                return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]
            return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

        if inputs_unk is None:
            result, _ = self.generate_(inputs[None, :])
        else:
            result, _ = self.generate_(inputs_unk[None, :])

        a = '[SOURCE]: {}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word)))
        b = '[TARGET]: {}'.format(' '.join(cut_zero(outputs.tolist(), idx2word)))
        c = '[DECODE]: {}'.format(' '.join(cut_zero(result, idx2word)))
        print(a)
        if inputs_unk is not None:
            k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
            print(k)
            a += k
        print(b)
        print(c)
        a += b + c
        return a

    def evaluate_multiple(self, inputs, outputs,
                            original_input, original_outputs,
                            samples, scores, idx2word):
        '''
        inputs_unk is same as inputs except for filtered out all the low-freq words to 1 (<unk>)
        return the top few keywords, number is set in config
        :param: original_input, same as inputs, the vector of one input sentence
        :param: original_outputs, vectors of corresponding multiple outputs (e.g. keyphrases)
        :return:
        '''

        def cut_zero(sample, idx2word, Lmax=None):
            sample = list(sample)
            if Lmax is None:
                Lmax = self.config['dec_voc_size']
            if 0 not in sample:
                return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]
            # return the string before 0 (<eol>)
            return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

        # Generate keyphrases
        # if inputs_unk is None:
        #     samples, scores = self.generate_multiple(inputs[None, :], return_all=True)
        # else:
        #     samples, scores = self.generate_multiple(inputs_unk[None, :], return_all=True)

        stemmer = PorterStemmer()
        # Evaluation part
        outs = []
        metrics = []

        # load stopword
        with open(self.config['path']+'/dataset/stopword/stopword_en.txt') as stopword_file:
            stopword_set = set([stemmer.stem(w.strip()) for w in stopword_file])

        for input_sentence, target_list, predict_list, score_list in zip(inputs, original_outputs, samples, scores):
            '''
            enumerate each document, process target/predict/score and measure via p/r/f1
            '''
            target_outputs  = []
            predict_outputs = []
            predict_scores  = []
            predict_set     = set()
            correctly_matched = np.asarray([0]*max(len(target_list), len(predict_list)), dtype='int32')

            # stem the original input
            stemmed_input = [stemmer.stem(w) for w in cut_zero(input_sentence, idx2word)]

            # convert target index into string
            for target in target_list:
                target = cut_zero(target, idx2word)
                target = [stemmer.stem(w) for w in target]

                keep = True
                # whether do filtering on groundtruth phrases. if config['target_filter']==None, do nothing
                if self.config['target_filter']:
                    match = None
                    for i in range(len(stemmed_input) - len(target) + 1):
                        match = None
                        j = 0
                        for j in range(len(target)):
                            if target[j] != stemmed_input[i + j]:
                                match = False
                                break
                        if j == len(target)-1 and match == None:
                            match = True
                            break

                    if match == True:
                        # if match and 'appear-only', keep this phrase
                        if self.config['target_filter'] == 'appear-only':
                            keep = keep and True
                        elif self.config['target_filter'] == 'non-appear-only':
                            keep = keep and False
                    elif match == False:
                        # if not match and 'appear-only', discard this phrase
                        if self.config['target_filter'] == 'appear-only':
                            keep = keep and False
                        # if not match and 'non-appear-only', keep this phrase
                        elif self.config['target_filter'] == 'non-appear-only':
                            keep = keep and True

                if not keep:
                    continue

                target_outputs.append(target)

            # convert predict index into string
            for id, (predict, score) in enumerate(zip(predict_list, score_list)):
                predict = cut_zero(predict, idx2word)
                predict = [stemmer.stem(w) for w in predict]

                # filter some not good ones
                keep = True
                if len(predict) == 0:
                    keep = False
                number_digit = 0
                for w in predict:
                    if w.strip() == '<unk>':
                        keep = False
                    if w.strip() == '<digit>':
                        number_digit += 1

                if len(predict) >= 1 and (predict[0] in stopword_set or predict[-1] in stopword_set):
                    keep = False

                if len(predict) <= 1:
                    keep = False

                # whether do filtering on predicted phrases. if config['predict_filter']==None, do nothing
                if self.config['predict_filter']:
                    match = None
                    for i in range(len(stemmed_input) - len(predict) + 1):
                        match = None
                        j = 0
                        for j in range(len(predict)):
                            if predict[j] != stemmed_input[i + j]:
                                match = False
                                break
                        if j == len(predict)-1 and match == None:
                            match = True
                            break

                    if match == True:
                        # if match and 'appear-only', keep this phrase
                        if self.config['predict_filter'] == 'appear-only':
                            keep = keep and True
                        elif self.config['predict_filter'] == 'non-appear-only':
                            keep = keep and False
                    elif match == False:
                        # if not match and 'appear-only', discard this phrase
                        if self.config['predict_filter'] == 'appear-only':
                            keep = keep and False
                        # if not match and 'non-appear-only', keep this phrase
                        elif self.config['predict_filter'] == 'non-appear-only':
                            keep = keep and True

                key = '-'.join(predict)
                # remove this phrase and its score from list
                if not keep or number_digit == len(predict) or key in predict_set:
                    continue

                predict_outputs.append(predict)
                predict_scores.append(score)
                predict_set.add(key)

                # check whether correct
                for target in target_outputs:
                    if len(target)==len(predict):
                        flag = True
                        for i,w in enumerate(predict):
                            if predict[i]!=target[i]:
                                flag = False
                        if flag:
                            correctly_matched[len(predict_outputs) - 1] = 1
                        # print('%s correct!!!' % predict)

            predict_outputs = np.asarray(predict_outputs)
            predict_scores = np.asarray(predict_scores)
            # normalize the score?
            if self.config['normalize_score']:
                predict_scores = np.asarray([math.log(math.exp(score)/len(predict)) for predict, score in zip(predict_outputs, predict_scores)])
                score_list_index = np.argsort(predict_scores)
                predict_outputs = predict_outputs[score_list_index]
                predict_scores = predict_scores[score_list_index]
                correctly_matched = correctly_matched[score_list_index]

            metric_dict = {}

            for number_to_predict in [5,10,15]:
                metric_dict['p@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict]))/float(number_to_predict)

                if len(target_outputs) != 0:
                    metric_dict['r@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict]))/float(len(target_outputs))
                else:
                    metric_dict['r@%d' % number_to_predict] = 0

                if metric_dict['p@%d' % number_to_predict]+metric_dict['r@%d' % number_to_predict] != 0:
                    metric_dict['f1@%d' % number_to_predict]= 2*metric_dict['p@%d' % number_to_predict]*metric_dict['r@%d' % number_to_predict]/float(metric_dict['p@%d' % number_to_predict]+metric_dict['r@%d' % number_to_predict])
                else:
                    metric_dict['f1@%d' % number_to_predict] = 0

                metric_dict['valid_target_number']  = len(target_outputs)
                metric_dict['target_number']  = len(target_list)
                metric_dict['correct_number@%d' % number_to_predict] = sum(correctly_matched[:number_to_predict])

            metrics.append(metric_dict)

            # print(stuff)
            a = '[SOURCE]: {}\n'.format(' '.join(cut_zero(input_sentence,  idx2word)))
            logger.info(a)

            b = '[TARGET]: %d/%d targets\n\t\t' % (len(target_outputs), len(target_list))
            for id, target in enumerate(target_outputs):
                b += ' '.join(target) + '; '
            b += '\n'
            logger.info(b)
            c = '[DECODE]: %d/%d predictions' % (len(predict_outputs), len(predict_list))
            for id, (predict, score) in enumerate(zip(predict_outputs, predict_scores)):
                if correctly_matched[id]==1:
                    c += ('\n\t\t[%.3f]'% score) + ' '.join(predict) + ' [correct!]'
                    # print(('\n\t\t[%.3f]'% score) + ' '.join(predict) + ' [correct!]')
                else:
                    c += ('\n\t\t[%.3f]'% score) + ' '.join(predict)
                    # print(('\n\t\t[%.3f]'% score) + ' '.join(predict))
            c += '\n'

            # c = '[DECODE]: {}'.format(' '.join(cut_zero(phrase, idx2word)))
            # if inputs_unk is not None:
            #     k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
            #     logger.info(k)
                # a += k
            logger.info(c)
            a += b + c

            for number_to_predict in [5,10,15]:
                d = '@%d - Precision=%.4f, Recall=%.4f, F1=%.4f\n' % (number_to_predict, metric_dict['p@%d' % number_to_predict],metric_dict['r@%d' % number_to_predict],metric_dict['f1@%d' % number_to_predict])
                logger.info(d)
                a += d

            outs.append(a)

        return outs, metrics

    def analyse_(self, inputs, outputs, idx2word):
        Lmax = len(idx2word)

        def cut_zero(sample, idx2word):
            if 0 not in sample:
                return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]

            return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

        result, _ = self.generate_(inputs[None, :])
        flag   = 0
        source = '{}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word)))
        target = '{}'.format(' '.join(cut_zero(outputs.tolist(), idx2word)))
        result = '{}'.format(' '.join(cut_zero(result, idx2word)))

        return target == result

    def analyse_cover(self, inputs, outputs, idx2word):
        Lmax = len(idx2word)

        def cut_zero(sample, idx2word):
            if 0 not in sample:
                return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]

            return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

        results, _ = self.generate_(inputs[None, :], return_all=True)
        flag   = 0
        source = '{}'.format(' '.join(cut_zero(inputs.tolist(),  idx2word)))
        target = '{}'.format(' '.join(cut_zero(outputs.tolist(), idx2word)))

        score  = [target == '{}'.format(' '.join(cut_zero(result, idx2word))) for result in results]
        return max(score)

================================================
FILE: emolga/models/ntm_encdec.py
================================================
__author__ = 'jiataogu'

import theano
theano.config.exception_verbosity = 'high'

import logging
import copy

import emolga.basic.objectives as objectives
import emolga.basic.optimizers as optimizers
from emolga.layers.recurrent import *
from emolga.layers.ntm_minibatch import Controller, BernoulliController
from emolga.layers.embeddings import *
from core import Model

logger = logging.getLogger(__name__)
RNN    = JZS3              # change it here for other RNN models.


class RecurrentBase(Model):
    """
    The recurrent base for SimpleRNN, GRU, JZS3, LSTM and Neural Turing Machines
    """
    def __init__(self, config, model='RNN', prefix='enc', use_contxt=True, name=None):
        super(RecurrentBase, self).__init__()

        self.config     = config
        self.model      = model
        self.prefix     = prefix
        self.use_contxt = use_contxt
        if not name:
            self.name   = self.prefix
        else:
            self.name   = name

        if self.config['binary']:
            NTM         = BernoulliController
        else:
            NTM         = Controller

        def _build_RNN():
            logger.info('BUILD::>>>>>>>> Gated Recurrent Units.')
            core = RNN(
                self.config['{}_embedd_dim'.format(self.prefix)],
                self.config['{}_hidden_dim'.format(self.prefix)],
                self.config['{}_contxt_dim'.format(self.prefix)] if use_contxt else None,
                name='{}_rnn'.format(self.prefix)
            )

            if self.config['bias_code']:
                init = Dense(
                    self.config['{}_contxt_dim'.format(self.prefix)],
                    self.config['{}_hidden_dim'.format(self.prefix)],
                    activation='tanh',
                    name='{}_init'.format(self.prefix)
                )
            else:
                init = Zero()

            return core, [init]

        def _build_NTM():
            """
            Build a simple Neural Turing Machine.
            We use a feedforward controller here.
            """
            logger.info('BUILD::>>>>>>>> Controller Units.')
            core = NTM(
                self.config['{}_embedd_dim'.format(self.prefix)],
                self.config['{}_memory_dim'.format(self.prefix)],
                self.config['{}_memory_wdth'.format(self.prefix)],
                self.config['{}_hidden_dim'.format(self.prefix)],
                self.config['{}_shift_width'.format(self.prefix)],
                name="{}_ntm".format(self.prefix),
                readonly=self.config['{}_read-only'.format(self.prefix)],
                curr_input=self.config['{}_curr_input'.format(self.prefix)],
                recurrence=self.config['{}_recurrence'.format(self.prefix)]
            )

            if self.config['bias_code']:
                raise NotImplementedError
            else:
                init_w = T.nnet.softmax(initializations.get('glorot_uniform')((1, self.config['{}_memory_dim'.format(self.prefix)])))
                init_r = T.nnet.softmax(initializations.get('glorot_uniform')((1, self.config['{}_memory_dim'.format(self.prefix)])))
                init_c = initializations.get('glorot_uniform')((1, self.config['{}_hidden_dim'.format(self.prefix)]))
                return core, [init_w, init_r, init_c]

        if model   == 'RNN':
            self.core, self.init = _build_RNN()
        elif model == 'NTM':
            self.core, self.init = _build_NTM()
        else:
            raise NotImplementedError

        self._add(self.core)
        if model == 'RNN':
            for init in self.init:
                self._add(init)

        self.set_name(name)

    # *****************************************************************
    # For Theano inputs.

    def get_context(self, context):
        # get context if "use_context" is True
        info  = dict()
        # if self.use_contxt:
        if self.model == 'RNN':
            # context is a matrix (nb_samples, context_dim)
            info['C'] = context
            info['init_h'] = self.init[0](context)

        elif self.model == 'NTM':
            # context is a tensor (nb_samples, memory_dim, memory_width)
            info['M']       = context
            if self.config['bias_code']:
                raise NotImplementedError
            else:
                info['init_ww'] = T.repeat(self.init[0], context.shape[0], axis=0)
                info['init_wr'] = T.repeat(self.init[1], context.shape[0], axis=0)
                info['init_c']  = T.repeat(self.init[2], context.shape[0], axis=0)
        else:
            raise NotImplementedError
        return info

    def loop(self, X, X_mask, info=None, return_sequence=False, return_full=False):
        if self.model == 'NTM':
            info['return_full'] = return_full

        Z = self.core(X, X_mask, return_sequence=return_sequence, **info)
        self._monitoring()
        return Z

    def step(self, X, prev_info):
        # run one step of the Recurrence
        if self.model == 'RNN':
            out = self.core(X, one_step=True, **prev_info)
            next_state = out
            next_info  = {'init_h': out, 'C': prev_info['C']}
        elif self.model == 'NTM':
            out = self.core(X, one_step=True, **prev_info)
            next_state = out[3]
            next_info  = dict()
            next_info['M']       = out[0]
            next_info['init_ww'] = out[1]
            next_info['init_wr'] = out[2]
            next_info['init_c']  = out[3]
        else:
            raise NotImplementedError
        return next_state, next_info

    def build_(self):
        # build a sampler in theano function for sampling.
        if self.model == 'RNN':
            context   = T.matrix()  # theano variable.
            logger.info('compile the function: get_init_state')
            info      = self.get_context(context)
            self.get_init_state \
                      = theano.function([context], info['init_h'],
                                        name='get_init_state')

            # **************************************************** #
            context   = T.matrix()  # theano variable.
            prev_X    = T.matrix('prev_X', dtype='float32')
            prev_stat = T.matrix('prev_state', dtype='float32')
            prev_info = dict()
            prev_info['C']      = context
            prev_info['init_h'] = prev_stat

            next_stat, next_info \
                = self.step(prev_X, prev_info)

            logger.info('compile the function: sample_next_state')
            inputs  = [prev_X, prev_stat, context]
            outputs = next_stat
            self.sample_next_state = theano.function(inputs, outputs, name='sample_next_state')

        elif self.model == 'NTM':
            memory  = T.tensor3()  # theano variable
            logger.info('compile the funtion: get_init_state')
            info    = self.get_context(memory)

            self.get_init_wr = theano.function([memory], info['init_wr'], name='get_init_wr')
            self.get_init_ww = theano.function([memory], info['init_ww'], name='get_init_ww')
            self.get_init_c  = theano.function([memory], info['init_c'],  name='get_init_c')

            # **************************************************** #
            memory    = T.tensor3()  # theano variable
            prev_X    = T.matrix('prev_X',  dtype='float32')
            prev_ww   = T.matrix('prev_ww', dtype='float32')
            prev_wr   = T.matrix('prev_wr', dtype='float32')
            prev_stat = T.matrix('prev_stat', dtype='float32')
            prev_info = {'M': memory, 'init_ww': prev_ww, 'init_wr': prev_wr, 'init_c': prev_stat}
            logger.info('compile the function: sample_next_0123')

            next_stat, next_info = self.step(prev_X, prev_info)
            inputs    = [prev_X, prev_ww, prev_wr, memory, prev_stat]
            outputs   = [next_info['M'], next_info['init_ww'], next_info['init_wr'], next_stat]
            self.sample_next_state = theano.function(inputs, outputs, name='sample_next_state')

        else:
            raise NotImplementedError

        logger.info('done.')

    # *****************************************************************
    # For Numpy inputs.
    def get_init(self, context):
        info = dict()
        if self.model == 'RNN':
            info['init_h'] = self.get_init_state(context)
            info['C']      = context
        elif self.model == 'NTM':
            if hasattr(self, 'get_init_ww'):
                info['init_ww'] = self.get_init_ww(context)
            if hasattr(self, 'get_init_wr'):
                info['init_wr'] = self.get_init_wr(context)
            if hasattr(self, 'get_init_c'):
                info['init_c']  = self.get_init_c(context)

            info['M'] = context
        else:
            raise NotImplementedError

        return info

    def get_next_state(self, prev_X, prev_info):
        if self.model == 'RNN':
            next_state = self.sample_next_state(
                prev_X, prev_info['init_h'], prev_info['C'])

            next_info = dict()
            next_info['C'] = prev_info['C']
            next_info['init_h'] = next_state
        elif self.model == 'NTM':
            next_info  = dict()
            assert 'init_ww' in prev_info
            assert 'init_wr' in prev_info
            assert 'init_c'  in prev_info
            assert 'M'       in prev_info

            next_info['M'], next_info['init_ww'], \
            next_info['init_wr'], next_info['init_c'] = self.sample_next_state(
                prev_X, prev_info['init_ww'], prev_info['init_wr'],
                prev_info['M'], prev_info['init_c'])

            next_state = next_info['init_c']
        else:
            raise NotImplementedError

        return next_state, next_info


class Encoder(Model):
    """
    Recurrent Neural Network/Neural Turing Machine-based Encoder
    It is used to compute the context vector.
    """

    def __init__(self,
                 config, rng, prefix='enc',
                 mode='RNN', embed=None):
        """
        mode = RNN: use a RNN Encoder
        mode = NTM: use a NTM Encoder
        """
        super(Encoder, self).__init__()
        self.config = config
        self.rng    = rng
        self.prefix = prefix
        self.mode   = mode
        self.name   = prefix

        """
        Create all elements of the Encoder's Computational graph
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['enc_voc_size'],
                self.config['enc_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        # create Recurrent Base
        logger.info("{}_create Recurrent layers.".format(self.prefix))
        if self.mode == 'RNN' and self.config['bidirectional']:
            self.Forward = RecurrentBase(self.config, model=self.mode, name='forward',
                                         prefix='enc', use_contxt=self.config['enc_use_contxt'])
            self.Bakward = RecurrentBase(self.config, model=self.mode, name='backward',
                                         prefix='enc', use_contxt=self.config['enc_use_contxt'])

            self._add(self.Forward)
            self._add(self.Bakward)
        else:
            self.Recurrence = RecurrentBase(self.config, model=self.mode, name='encoder',
                                            prefix='enc', use_contxt=self.config['enc_use_contxt'])
            self._add(self.Recurrence)

        # there is no readout layers for encoder.

    def build_encoder(self, source, context=None):
        """
        Build the Encoder Computational Graph
        """
        if self.mode == 'RNN':
            # we use a Recurrent Neural Network Encoder (GRU)
            if not self.config['bidirectional']:
                X, X_mask = self.Embed(source, True)
                info      = self.Recurrence.get_context(context)
                X_out = self.Recurrence.loop(X, X_mask, info, return_sequence=False)
            else:
                source_back = source[:, ::-1]
                X1, X1_mask = self.Embed(source, True)
                X2, X2_mask = self.Embed(source_back, True)

                info        = self.Forward.get_context(context)
                X_out1      = self.Forward.loop(X1, X1_mask, info, return_sequence=False)
                info        = self.Bakward.get_context(context)
                X_out2      = self.Bakward.loop(X2, X2_mask, info, return_sequence=False)
                # X_out       = T.concatenate([X_out1, X_out2], axis=1)
                X_out       = 0.5 * X_out1 + 0.5 * X_out2
        elif self.mode == 'NTM':
            if not self.config['bidirectional']:
                X, X_mask = self.Embed(source, True)
            else:
                source_back = source[:, ::-1]
                X1, X1_mask = self.Embed(source, True)
                X2, X2_mask = self.Embed(source_back, True)
                X           = T.concatenate([X1, X2], axis=1)
                X_mask      = T.concatenate([X1_mask, X2_mask], axis=1)

            info  = self.Recurrence.get_context(context)
            # X_out here is the extracted memorybook. which can be used as a the initial memory of NTM Decoder.
            X_out = self.Recurrence.loop(X, X_mask, info, return_sequence=False, return_full=True)[0]
        else:
            raise NotImplementedError

        self._monitoring()
        return X_out


class Decoder(Model):
    """
    Recurrent Neural Network-based Decoder.
    It is used for:
        (1) Evaluation: compute the probability P(Y|X)
        (2) Prediction: sample the best result based on P(Y|X)
        (3) Beam-search
        (4) Scheduled Sampling (how to implement it?)
    """

    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None):
        """
        mode = RNN: use a RNN Decoder
        mode = NTM: use a NTM Decoder (Neural Turing Machine)
        """
        super(Decoder, self).__init__()
        self.config = config
        self.rng    = rng
        self.prefix = prefix
        self.name   = prefix
        self.mode   = mode

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['dec_voc_size'],
                self.config['dec_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        # create Recurrent Base.
        logger.info("{}_create Recurrent layers.".format(self.prefix))
        self.Recurrence = RecurrentBase(self.config, model=self.mode, name='decoder',
                                        prefix='dec', use_contxt=self.config['dec_use_contxt'])

        # create readout layers
        logger.info("_create Readout layers")

        # 1. hidden layers readout.
        self.hidden_readout = Dense(
            self.config['dec_hidden_dim'],
            self.config['output_dim']
            if self.config['deep_out']
            else self.config['dec_voc_size'],
            activation='linear',
            name="{}_hidden_readout".format(self.prefix)
        )

        # 2. previous word readout
        self.prev_word_readout = None
        if self.config['bigram_predict']:
            self.prev_word_readout = Dense(
                self.config['dec_embedd_dim'],
                self.config['output_dim']
                if self.config['deep_out']
                else self.config['dec_voc_size'],
                activation='linear',
                name="{}_prev_word_readout".format(self.prefix),
                learn_bias=False
            )

        # 3. context readout
        self.context_readout = None
        if self.config['context_predict']:
            self.context_readout = Dense(
                self.config['dec_contxt_dim'],
                self.config['output_dim']
                if self.config['deep_out']
                else self.config['dec_voc_size'],
                activation='linear',
                name="{}_context_readout".format(self.prefix),
                learn_bias=False
            )

        # option: deep output (maxout)
        if self.config['deep_out']:
            self.activ = Activation(config['deep_out_activ'])
            # self.dropout = Dropout(rng=self.rng, p=config['dropout'])
            self.output_nonlinear = [self.activ]  # , self.dropout]
            self.output = Dense(
                self.config['output_dim'] / 2
                if config['deep_out_activ'] == 'maxout2'
                else self.config['output_dim'],

                self.config['dec_voc_size'],
                activation='softmax',
                name="{}_output".format(self.prefix),
                learn_bias=False
            )
        else:
            self.output_nonlinear = []
            self.output = Activation('softmax')

        # registration:
        self._add(self.Recurrence)
        self._add(self.hidden_readout)
        self._add(self.context_readout)
        self._add(self.prev_word_readout)
        self._add(self.output)

        if self.config['deep_out']:
            self._add(self.activ)
        # self._add(self.dropout)

        logger.info("create decoder ok.")

    @staticmethod
    def _grab_prob(probs, X):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target):
        # Word embedding
        Y, Y_mask = self.Embed(target, True)  # (nb_samples, max_len, embedding_dim)

        if self.config['use_input']:
            X = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)
        else:
            X = 0 * Y

        # option ## drop words.

        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, Y, Y_mask, Count

    def build_decoder(self, target, context=None, return_count=False):
        """
        Build the Decoder Computational Graph
        """
        X, X_mask, Y, Y_mask, Count = self.prepare_xy(target)
        info  = self.Recurrence.get_context(context)
        X_out = self.Recurrence.loop(X, X_mask, info=info, return_sequence=True)

        # Readout
        readout = self.hidden_readout(X_out)

        if self.config['context_predict']:
            # warning: only supports RNN, cannot supports Memory
            readout += self.context_readout(context).dimshuffle(0, 'x', 1) \

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        prob_dist = self.output(readout)  # (nb_samples, max_len, vocab_size)

        # log_old  = T.sum(T.log(self._grab_prob(prob_dist, target)), axis=1)
        log_prob = T.sum(T.log(self._grab_prob(prob_dist, target)) * X_mask, axis=1)
        log_ppl  = log_prob / Count

        self._monitoring()

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl

    """
    Sampling Functions.
    """
    def _step_embed(self, prev_word):
        # word embedding (note that for the first word, embedding should be all zero)
        if self.config['use_input']:
            X = T.switch(
                prev_word[:, None] < 0,
                alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                self.Embed(prev_word)
            )
        else:
            X = alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim'])

        return X

    def _step_sample(self, X, next_stat, context):
        # compute the readout probability distribution and sample it
        # here the readout is a matrix, different from the learner.
        readout = self.hidden_readout(next_stat)

        if context.ndim == 2 and self.config['context_predict']:
            # warning: only supports RNN, cannot supports Memory
            readout += self.context_readout(context)

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        next_prob = self.output(readout)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample

    """
    Build the sampler for sampling/greedy search/beam search
    """

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        prev_word = T.vector('prev_word', dtype='int32')
        prev_X    = self._step_embed(prev_word)
        self.prev_embed = theano.function([prev_word], prev_X)

        self.Recurrence.build_()

        prev_X    = T.matrix('prev_X', dtype='float32')
        next_stat = T.matrix('next_state', dtype='float32')
        logger.info('compile the function: sample_next')

        if self.config['mode'] == 'RNN':
            context   = T.matrix('context')
        else:
            context   = T.tensor3('memory')

        next_prob, next_sample = self._step_sample(prev_X, next_stat, context)
        self.sample_next = theano.function([prev_X, next_stat, context],
                                           [next_prob, next_sample],
                                           name='sample_next',
                                           on_unused_input='warn')

        logger.info('done')

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # prepare for searching
        sample = []
        score  = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []
        hyp_infos  = []

        # get initial state of decoder Recurrence
        next_info  = self.Recurrence.get_init(context)
        # print 'sample with memory:\t', next_info['M'][0]
        # next_state = next_info['init_h']
        next_word  = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)
        print '<0e~k>'
        # Start searching!
        for ii in xrange(maxlen):
            # print next_word
            ctx = np.tile(context, [live_k, 1])
            next_embedding        = self.prev_embed(next_word)
            next_state, next_info = self.Recurrence.get_next_state(next_embedding, next_info)
            next_prob, next_word  = self.sample_next(next_embedding, next_state, ctx)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                sample.append(nw)
                score += next_prob[0, nw]

                if nw == 0:  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only computed in a flatten way!
                # Recently beam-search does not support NTM !!

                cand_scores = hyp_scores[:, None] - np.log(next_prob)
                cand_flat = cand_scores.flatten()
                ranks_flat = cand_flat.argsort()[:(k - dead_k)]

                # fetch the best results.
                voc_size = next_prob.shape[1]
                trans_index = ranks_flat / voc_size
                word_index = ranks_flat % voc_size
                costs = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []
                new_hyp_infos  = {w: [] for w in next_info}

                for idx, [ti, wi] in enumerate(zip(trans_index, word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                    for w in next_info:
                        new_hyp_infos[w].append(copy.copy(next_info[w][ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []
                hyp_infos  = {w: [] for w in next_info}

                for idx in xrange(len(new_hyp_samples)):
                    if new_hyp_states[idx][-1] == 0:
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        dead_k += 1
                    else:
                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])
                        for w in next_info:
                            hyp_infos[w].append(copy.copy(new_hyp_infos[w][ti]))

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                if dead_k >= k:
                    break

                next_word = np.array([w[-1] for w in hyp_samples])
                next_state = np.array(hyp_states)
                for w in hyp_infos:
                    next_info[w] = np.array(hyp_infos[w])
                pass
            pass

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in xrange(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])

        return sample, score


class RNNLM(Model):
    """
    RNN-LM, with context vector = 0.
    It is very similar with the implementation of VAE.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'rnnlm'

    def build_(self):
        logger.info("build the RNN/NTM-decoder")
        self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)

        # registration:
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get('adadelta')

        # saved the initial memories
        self.memory    = initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth']))

        logger.info("create the RECURRENT language model. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            if not contrastive:
                self.compile_train()
            else:
                self.compile_train_CE()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):

        # questions (theano variables)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        if self.config['mode']   == 'RNN':
            context = alloc_zeros_matrix(inputs.shape[0], self.config['dec_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # decoding.
        target  = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # reconstruction loss
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        L1       = T.sum([T.sum(abs(w)) for w in self.params])
        loss     = loss_rec

        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_rec, loss_ppl],
                                      updates=updates,
                                      name='train_fun')
        logger.info("pre-training functions compile done.")

        # add monitoring:
        self.monitor['context'] = context
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)

    def compile_train_CE(self):
        pass

    def compile_sample(self):
        # context vectors (as)
        self.decoder.build_sampler()
        logger.info("display functions compile done.")

    def compile_inference(self):
        pass

    def default_context(self):
        if self.config['mode'] == 'RNN':
            return np.zeros(shape=(1, self.config['dec_contxt_dim']), dtype=theano.config.floatX)
        elif self.config['mode'] == 'NTM':
            memory = self.memory.get_value()
            memory = memory.reshape((1, memory.shape[0], memory.shape[1]))
            return memory

    def generate_(self, context=None, mode='display', max_len=None):
        """
        :param action: action vector to guide the question.
                       If None, use a Gaussian to simulate the action.
        :return: question sentence in natural language.
        """
        # assert self.config['sample_stoch'], 'RNNLM sampling must be stochastic'
        # assert not self.config['sample_argmax'], 'RNNLM sampling cannot use argmax'

        if context is None:
            context = self.default_context()

        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)

        sample, score = self.decoder.get_sample(context, **args)
        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)


class Helmholtz(RNNLM):
    """
    Helmholtz Machine as an probabilistic version AutoEncoder
    It is very similar with Variational Auto-Encoder
    We implement the Helmholtz RNN as well as Helmholtz Turing Machine here.
    Reference:
        Reweighted Wake-Sleep
            http://arxiv.org/abs/1406.2751
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='RNN'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'helmholtz'

    def build_(self):
        logger.info("build the Helmholtz auto-encoder")
        if self.mode == 'NTM':
            assert self.config['enc_memory_dim']  == self.config['dec_memory_dim']
            assert self.config['enc_memory_wdth'] == self.config['dec_memory_wdth']

        self.encoder = Encoder(self.config, self.rng, prefix='enc', mode=self.mode)
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed, mode=self.mode)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)

        # registration
        self._add(self.encoder)
        self._add(self.decoder)

        # The main difference between VAE and HM is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a sigmoid prior here.

        # prior distribution is a bias layer
        if self.mode == 'RNN':
            # here we first forcus on Helmholtz Turing Machine
            # Thus the RNN version will be copied from Dial-DRL projects.
            raise NotImplementedError

        elif self.mode == 'NTM':
            self.Prior  = MemoryLinear(
                self.config['enc_memory_dim'],
                self.config['enc_memory_wdth'],
                activation='sigmoid',
                name='prior_proj',
                has_input=False
            )

            self.Post   = MemoryLinear(
                self.config['enc_memory_dim'],
                self.config['enc_memory_wdth'],
                activation='sigmoid',
                name='post_proj',
                has_input=True
            )

            self.Trans  = MemoryLinear(
                self.config['enc_memory_dim'],
                self.config['enc_memory_wdth'],
                activation='linear',
                name='trans_proj',
                has_input=True
            )

            # registration
            self._add(self.Prior)
            self._add(self.Post)
            self._add(self.Trans)

        else:
            raise NotImplementedError

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'])

        # saved the initial memories
        self.memory    = initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth']))

        logger.info("create Helmholtz Machine. ok")

    def compile_train(self):
        # questions (theano variables)
        inputs         = T.imatrix()  # padded input word sequence (for training)
        batch_size     = inputs.shape[0]
        if self.config['mode']   == 'RNN':
            context    = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context    = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # encoding
        memorybook     = self.encoder.build_encoder(inputs, context)

        # get Q(a|y) = sigmoid
        q_dis          = self.Post(memorybook)

        # repeats
        L              = self.config['repeats']
        target         = T.repeat(inputs[:, None, :],
                                  L,
                                  axis=1).reshape((inputs.shape[0] * L, inputs.shape[1]))
        q_dis          = T.repeat(q_dis[:, None, :, :],
                                  L,
                                  axis=1).reshape((q_dis.shape[0] * L, q_dis.shape[1], q_dis.shape[2]))

        # sample actions
        u              = self.rng.uniform(q_dis.shape)
        action         = T.cast(u <= q_dis, dtype=theano.config.floatX)

        # compute the exact probability for actions
        logQax         = action * T.log(q_dis) + (1 - action) * T.log(1 - q_dis)
        logQax         = logQax.sum(axis=-1).sum(axis=-1)

        # decoding.
        memorybook2    = self.Trans(action)
        logPxa, count  = self.decoder.build_decoder(target, memorybook2, return_count=True)

        # prior.
        p_dis          = self.Prior()
        logPa          = action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis)
        logPa          = logPa.sum(axis=-1).sum(axis=-1)

        """
        Compute the weights
        """
        # reshape
        logQax         = logQax.reshape((batch_size, L))
        logPa          = logPa.reshape((batch_size, L))
        logPxa         = logPxa.reshape((batch_size, L))

        logPx_a        = logPa + logPxa

        # normalizing the weights
        log_wk         = logPx_a - logQax
        log_bpk        = logPa - logQax

        log_w_sum      = logSumExp(log_wk, axis=1)
        log_bp_sum     = logSumExp(log_bpk, axis=1)

        log_wnk        = log_wk - log_w_sum
        log_bpnk       = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator
        logPx          = T.mean(log_w_sum - T.log(L))
        perplexity     = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        """
        Compute the Loss function
        """
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights        = T.exp(log_wnk)
        bp             = T.exp(log_bpnk)
        bq             = 1. / L
        ess            = T.mean(1 / T.sum(weights ** 2, axis=1))

        factor         = self.config['factor']
        if self.config['variant_control']:
            lossQ   = -T.mean(T.sum(logQax * (weights - bq), axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * (weights - bp), axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))          # log p(x|a)
            lossP   = lossPxa + lossPa

            updates = self.optimizer.get_updates(self.params, [lossP + factor * lossQ, weights, bp])
        else:
            lossQ   = -T.mean(T.sum(logQax * weights, axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * weights, axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))   # log p(x|a)
            lossP   = lossPxa + lossPa

            updates = self.optimizer.get_updates(self.params, [lossP + factor * lossQ, weights])

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_    = theano.function(train_inputs,
                                         [lossPa, lossPxa, lossQ, perplexity, ess],
                                         updates=updates,
                                         name='train_fun')

        logger.info("pre-training functions compile done.")

    def compile_sample(self):
        # # for Typical Auto-encoder, only conditional generation is useful.
        # inputs        = T.imatrix()  # padded input word sequence (for training)
        # if self.config['mode']   == 'RNN':
        #     context   = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        # elif self.config['mode'] == 'NTM':
        #     context   = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        # else:
        #     raise NotImplementedError
        # pass

        # sample the memorybook
        p_dis         = self.Prior()
        l             = T.iscalar()
        u             = self.rng.uniform((l, p_dis.shape[-2], p_dis.shape[-1]))
        binarybook    = T.cast(u <= p_dis, dtype=theano.config.floatX)
        memorybook    = self.Trans(binarybook)

        self.take     = theano.function([l], [binarybook, memorybook], name='take_action')

        # compile the sampler.
        self.decoder.build_sampler()
        logger.info('sampler function compile done.')

    def compile_inference(self):
        """
        build the hidden action prediction.
        """
        inputs         = T.imatrix()  # padded input word sequence (for training)

        if self.config['mode']   == 'RNN':
            context    = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context    = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # encoding
        memorybook     = self.encoder.build_encoder(inputs, context)

        # get Q(a|y) = sigmoid(.|Posterior * encoded)
        q_dis          = self.Post(memorybook)
        p_dis          = self.Prior()

        self.inference_ = theano.function([inputs], [memorybook, q_dis, p_dis])
        logger.info("inference function compile done.")

    def default_context(self):
        return self.take(1)[-1]


class BinaryHelmholtz(RNNLM):
    """
    Helmholtz Machine as an probabilistic version AutoEncoder
    It is very similar with Variational Auto-Encoder
    We implement the Helmholtz RNN as well as Helmholtz Turing Machine here.
    Reference:
        Reweighted Wake-Sleep
            http://arxiv.org/abs/1406.2751
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode='RNN'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'helmholtz'

    def build_(self):
        logger.info("build the Binary-Helmholtz auto-encoder")
        if self.mode == 'NTM':
            assert self.config['enc_memory_dim']  == self.config['dec_memory_dim']
            assert self.config['enc_memory_wdth'] == self.config['dec_memory_wdth']

        self.encoder = Encoder(self.config, self.rng, prefix='enc', mode=self.mode)
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed, mode=self.mode)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)

        # registration
        self._add(self.encoder)
        self._add(self.decoder)

        # The main difference between VAE and HM is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a sigmoid prior here.

        # prior distribution is a bias layer
        if self.mode == 'RNN':
            # here we first forcus on Helmholtz Turing Machine
            # Thus the RNN version will be copied from Dial-DRL projects.
            raise NotImplementedError

        elif self.mode == 'NTM':
            self.Prior  = MemoryLinear(
                self.config['enc_memory_dim'],
                self.config['enc_memory_wdth'],
                activation='sigmoid',
                name='prior_proj',
                has_input=False
            )

            # registration
            self._add(self.Prior)
        else:
            raise NotImplementedError

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'])

        # saved the initial memories
        self.memory    = T.nnet.sigmoid(initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth'])))

        logger.info("create Helmholtz Machine. ok")

    def compile_train(self):
        # questions (theano variables)
        inputs         = T.imatrix()  # padded input word sequence (for training)
        batch_size     = inputs.shape[0]
        if self.config['mode']   == 'RNN':
            context    = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context    = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # encoding
        memorybook     = self.encoder.build_encoder(inputs, context)

        # get Q(a|y) = sigmoid
        q_dis          = memorybook

        # repeats
        L              = self.config['repeats']
        target         = T.repeat(inputs[:, None, :],
                                  L,
                                  axis=1).reshape((inputs.shape[0] * L, inputs.shape[1]))
        q_dis          = T.repeat(q_dis[:, None, :, :],
                                  L,
                                  axis=1).reshape((q_dis.shape[0] * L, q_dis.shape[1], q_dis.shape[2]))

        # sample actions
        u              = self.rng.uniform(q_dis.shape)
        action         = T.cast(u <= q_dis, dtype=theano.config.floatX)

        # compute the exact probability for actions
        logQax         = action * T.log(q_dis) + (1 - action) * T.log(1 - q_dis)
        logQax         = logQax.sum(axis=-1).sum(axis=-1)

        # decoding.
        memorybook2    = action
        logPxa, count  = self.decoder.build_decoder(target, memorybook2, return_count=True)

        # prior.
        p_dis          = self.Prior()
        logPa          = action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis)
        logPa          = logPa.sum(axis=-1).sum(axis=-1)

        """
        Compute the weights
        """
        # reshape
        logQax         = logQax.reshape((batch_size, L))
        logPa          = logPa.reshape((batch_size, L))
        logPxa         = logPxa.reshape((batch_size, L))

        logPx_a        = logPa + logPxa

        # normalizing the weights
        log_wk         = logPx_a - logQax
        log_bpk        = logPa - logQax

        log_w_sum      = logSumExp(log_wk, axis=1)
        log_bp_sum     = logSumExp(log_bpk, axis=1)

        log_wnk        = log_wk - log_w_sum
        log_bpnk       = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator
        logPx          = T.mean(log_w_sum - T.log(L))
        perplexity     = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        """
        Compute the Loss function
        """
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights        = T.exp(log_wnk)
        bp             = T.exp(log_bpnk)
        bq             = 1. / L
        ess            = T.mean(1 / T.sum(weights ** 2, axis=1))

        factor         = self.config['factor']
        if self.config['variant_control']:
            lossQ   = -T.mean(T.sum(logQax * (weights - bq), axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * (weights - bp), axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))          # log p(x|a)
            lossP   = lossPxa + lossPa

            updates = self.optimizer.get_updates(self.params, [lossP + factor * lossQ, weights, bp])
        else:
            lossQ   = -T.mean(T.sum(logQax * weights, axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * weights, axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))   # log p(x|a)
            lossP   = lossPxa + lossPa

            updates = self.optimizer.get_updates(self.params, [lossP + factor * lossQ, weights])

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_    = theano.function(train_inputs,
                                         [lossPa, lossPxa, lossQ, perplexity, ess],
                                         updates=updates,
                                         name='train_fun')

        logger.info("pre-training functions compile done.")

    def compile_sample(self):
        # # for Typical Auto-encoder, only conditional generation is useful.
        # inputs        = T.imatrix()  # padded input word sequence (for training)
        # if self.config['mode']   == 'RNN':
        #     context   = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        # elif self.config['mode'] == 'NTM':
        #     context   = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        # else:
        #     raise NotImplementedError
        # pass

        # sample the memorybook
        p_dis         = self.Prior()
        l             = T.iscalar()
        u             = self.rng.uniform((l, p_dis.shape[-2], p_dis.shape[-1]))
        binarybook    = T.cast(u <= p_dis, dtype=theano.config.floatX)

        self.take     = theano.function([l], binarybook, name='take_action')

        # compile the sampler.
        self.decoder.build_sampler()
        logger.info('sampler function compile done.')

    def compile_inference(self):
        """
        build the hidden action prediction.
        """
        inputs         = T.imatrix()  # padded input word sequence (for training)

        if self.config['mode']   == 'RNN':
            context    = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context    = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # encoding
        memorybook     = self.encoder.build_encoder(inputs, context)

        # get Q(a|y) = sigmoid(.|Posterior * encoded)
        q_dis          = memorybook
        p_dis          = self.Prior()

        self.inference_ = theano.function([inputs], [memorybook, q_dis, p_dis])
        logger.info("inference function compile done.")

    def default_context(self):
        return self.take(1)


class AutoEncoder(RNNLM):
    """
    Regular Auto-Encoder: RNN Encoder/Decoder
    Regular Neural Turing Machine
    """

    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'autoencoder'

    def build_(self):
        logger.info("build the RNN/NTM auto-encoder")
        self.encoder = Encoder(self.config, self.rng, prefix='enc', mode=self.mode)
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed, mode=self.mode)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', mode=self.mode)


        # registration
        self._add(self.encoder)
        self._add(self.decoder)

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'])

        # saved the initial memories
        self.memory    = initializations.get('glorot_uniform')(
                    (self.config['dec_memory_dim'], self.config['dec_memory_wdth']))

        logger.info("create Autoencoder Network. ok")

    def compile_train(self, mode='train'):
        # questions (theano variables)
        inputs      = T.imatrix()  # padded input word sequence (for training)
        if self.config['mode']   == 'RNN':
            context    = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context    = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError

        # encoding
        memorybook     = self.encoder.build_encoder(inputs, context)

        # decoding.
        target         = inputs
        logPxz, logPPL = self.decoder.build_decoder(target, memorybook)

        # reconstruction loss
        loss_rec       = T.mean(-logPxz)
        loss_ppl       = T.exp(T.mean(-logPPL))

        loss           = loss_rec
        updates        = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs   = [inputs]

        self.train_    = theano.function(train_inputs,
                                         [loss_rec, loss_ppl],
                                         updates=updates,
                                         name='train_fun')
        self.test      = theano.function(train_inputs,
                                         [loss_rec, loss_ppl],
                                         name='test_fun')
        logger.info("pre-training functions compile done.")

    def compile_sample(self):
        # for Typical Auto-encoder, only conditional generation is useful.
        inputs        = T.imatrix()  # padded input word sequence (for training)
        if self.config['mode']   == 'RNN':
            context   = alloc_zeros_matrix(inputs.shape[0], self.config['enc_contxt_dim'])
        elif self.config['mode'] == 'NTM':
            context   = T.repeat(self.memory[None, :, :], inputs.shape[0], axis=0)
        else:
            raise NotImplementedError
        pass

        # encoding
        memorybook    = self.encoder.build_encoder(inputs, context)
        self.memorize = theano.function([inputs], memorybook, name='memorize')

        # compile the sampler.
        self.decoder.build_sampler()
        logger.info('sampler function compile done.')


================================================
FILE: emolga/models/pointers.py
================================================
__author__ = 'jiataogu'
import theano
import logging
import copy

from emolga.layers.recurrent import *
from emolga.layers.ntm_minibatch import Controller
from emolga.layers.embeddings import *
from emolga.layers.attention import *
from emolga.layers.highwayNet import *
from emolga.models.encdec import *
from core import Model

# theano.config.exception_verbosity = 'high'
logger = logging          #.getLogger(__name__)
RNN    = GRU              # change it here for other RNN models.


class PtrDecoder(Model):
    """
    RNN-Decoder for Pointer Networks
    """
    def __init__(self,
                 config, rng, prefix='ptrdec'):
        super(PtrDecoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Initialization Layers
        logger.info("{}_create initialization layers.".format(self.prefix))
        self.Initializer = Dense(
            config['ptr_contxt_dim'],
            config['ptr_hidden_dim'],
            activation='tanh',
            name="{}_init".format(self.prefix)
        )

        # create RNN cells
        logger.info("{}_create RNN cells.".format(self.prefix))
        self.RNN = RNN(
            self.config['ptr_embedd_dim'],
            self.config['ptr_hidden_dim'],
            self.config['ptr_contxt_dim'],
            name="{}_cell".format(self.prefix)
        )
        self._add(self.Initializer)
        self._add(self.RNN)

        # create readout layers
        logger.info("_create Attention-Readout layers")
        self.attender = Attention(
            self.config['ptr_hidden_dim'],
            self.config['ptr_source_dim'],
            self.config['ptr_middle_dim'],
            name='{}_attender'.format(self.prefix)
        )
        self._add(self.attender)

    @staticmethod
    def grab_prob(probs, X):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    @staticmethod
    def grab_source(source, target):
        # source : (nb_samples, source_num, source_dim)
        # target : (nb_samples, target_num)
        assert source.ndim == 3

        batch_size = source.shape[0]
        source_num = source.shape[1]
        source_dim = source.shape[2]
        target_num = target.shape[1]

        source_flt = source.reshape((batch_size * source_num, source_dim))
        target_idx = (target + (T.arange(batch_size) * source_num)[:, None]).reshape((batch_size * target_num,))

        value      = source_flt[target_idx].reshape((batch_size, target_num, source_dim))
        return value

    def build_decoder(self,
                      inputs,
                      source, target,
                      smask=None, tmask=None, context=None):
        """
        Build the Pointer Network Decoder Computational Graph
        """
        # inputs : (nb_samples, source_num, ptr_embedd_dim)
        # source : (nb_samples, source_num, source_dim)
        # smask  : (nb_samples, source_num)
        # target : (nb_samples, target_num)
        # tmask  : (nb_samples, target_num)
        # context: (nb_sample, context_dim)

        # initialized hidden state.
        assert context is not None
        Init_h = self.Initializer(context)

        # target is the source inputs.
        X      = self.grab_source(inputs, target)  # (nb_samples, target_num, source_dim)
        X      = T.concatenate([alloc_zeros_matrix(X.shape[0], 1, X.shape[2]),
                                X[:, :-1, :]], axis=1)

        X      = X.dimshuffle((1, 0, 2))
        # tmask  = tmask.dimshuffle((1, 0))

        # eat by recurrent net
        def _recurrence(x, prev_h, c, s, s_mask):
            # RNN read-out
            x_out  = self.RNN(x, mask=None, C=c, init_h=prev_h, one_step=True)
            s_out  = self.attender(x_out, s, s_mask, return_log=True)
            return x_out, s_out

        outputs, _ = theano.scan(
            _recurrence,
            sequences=[X],
            outputs_info=[Init_h, None],
            non_sequences=[context, source, smask]
        )

        log_prob_dist = outputs[-1].dimshuffle((1, 0, 2))
        # tmask         = tmask.dimshuffle((1, 0))
        log_prob      = T.sum(self.grab_prob(log_prob_dist, target) * tmask, axis=1)
        return log_prob

    """
    Sample one step
    """
    def _step_sample(self, prev_idx, prev_stat,
                     context, inputs, source, smask):
        X = T.switch(
                prev_idx[:, None] < 0,
                alloc_zeros_matrix(prev_idx.shape[0], self.config['ptr_embedd_dim']),
                self.grab_source(inputs, prev_idx[:, None])
                )

        # one step RNN
        X_out = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
        next_stat = X_out

        # compute the attention read-out
        next_prob = self.attender(X_out, source, smask)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.matrix()       # theano variable.
        init_h  = self.Initializer(context)
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state')
        logger.info('done.')

        # sampler: 1 x 1
        prev_idx  = T.vector('prev_idx', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')
        inputs    = T.tensor3()
        source    = T.tensor3()
        smask     = T.imatrix()

        next_prob, next_sample, next_stat \
            = self._step_sample(prev_idx, prev_stat, context,
                                inputs, source, smask)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_idx, prev_stat, context, inputs, source, smask]
        outputs = [next_prob, next_sample, next_stat]
        self.sample_next = theano.function(inputs, outputs, name='sample_next')
        logger.info('done')
        pass

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, inputs, source, smask,
                   k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context
        next_state = self.get_init_state(context)
        next_word = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)

        # Start searching!
        for ii in xrange(maxlen):
            # print next_word
            ctx = np.tile(context, [live_k, 1])
            ipt = np.tile(inputs,  [live_k, 1, 1])
            sor = np.tile(source,  [live_k, 1, 1])
            smk = np.tile(smask,   [live_k, 1])

            next_prob, next_word, next_state \
                = self.sample_next(next_word, next_state,
                                   ctx, ipt, sor, smk)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                sample.append(nw)
                score += next_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only computed in a flatten way!
                cand_scores = hyp_scores[:, None] - np.log(next_prob)
                cand_flat = cand_scores.flatten()
                ranks_flat = cand_flat.argsort()[:(k - dead_k)]

                # fetch the best results.
                voc_size = next_prob.shape[1]
                trans_index = ranks_flat / voc_size
                word_index = ranks_flat % voc_size
                costs = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []

                for idx, [ti, wi] in enumerate(zip(trans_index, word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []

                for idx in xrange(len(new_hyp_samples)):
                    if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        dead_k += 1
                    else:
                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                if dead_k >= k:
                    break

                next_word = np.array([w[-1] for w in hyp_samples])
                next_state = np.array(hyp_states)
                pass
            pass

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in xrange(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])

        return sample, score


class PointerDecoder(Model):
    """
    RNN-Decoder for Pointer Networks [version 2]
    Pointer to 2 place once a time.
    """
    def __init__(self,
                 config, rng, prefix='ptrdec'):
        super(PointerDecoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Initialization Layers
        logger.info("{}_create initialization layers.".format(self.prefix))
        self.Initializer = Dense(
            config['ptr_contxt_dim'],
            config['ptr_hidden_dim'],
            activation='tanh',
            name="{}_init".format(self.prefix)
        )

        # create RNN cells
        logger.info("{}_create RNN cells.".format(self.prefix))
        self.RNN = RNN(
            self.config['ptr_embedd_dim'],
            self.config['ptr_hidden_dim'],
            self.config['ptr_contxt_dim'],
            name="{}_cell".format(self.prefix)
        )
        self._add(self.Initializer)
        self._add(self.RNN)

        # create 2 attention heads
        logger.info("_create Attention-Readout layers")
        self.att_head = Attention(
            self.config['ptr_hidden_dim'],
            self.config['ptr_source_dim'],
            self.config['ptr_middle_dim'],
            name='{}_head_attender'.format(self.prefix)
        )
        self.att_tail = Attention(
            self.config['ptr_hidden_dim'],
            self.config['ptr_source_dim'],
            self.config['ptr_middle_dim'],
            name='{}_tail_attender'.format(self.prefix)
        )

        self._add(self.att_head)
        self._add(self.att_tail)

    @staticmethod
    def grab_prob(probs, X):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    @staticmethod
    def grab_source(source, target):
        # source : (nb_samples, source_num, source_dim)
        # target : (nb_samples, target_num)
        assert source.ndim == 3

        batch_size = source.shape[0]
        source_num = source.shape[1]
        source_dim = source.shape[2]
        target_num = target.shape[1]

        source_flt = source.reshape((batch_size * source_num, source_dim))
        target_idx = (target + (T.arange(batch_size) * source_num)[:, None]).reshape((batch_size * target_num,))

        value      = source_flt[target_idx].reshape((batch_size, target_num, source_dim))
        return value

    def build_decoder(self,
                      inputs,
                      source, target,
                      smask=None, tmask=None, context=None):
        """
        Build the Pointer Network Decoder Computational Graph
        """
        # inputs : (nb_samples, source_num, ptr_embedd_dim)
        # source : (nb_samples, source_num, source_dim)
        # smask  : (nb_samples, source_num)
        # target : (nb_samples, target_num)
        # tmask  : (nb_samples, target_num)
        # context: (nb_sample, context_dim)

        # initialized hidden state.
        assert context is not None
        Init_h = self.Initializer(context)

        # target is the source inputs.
        X      = self.grab_source(inputs, target)  # (nb_samples, target_num, source_dim)

        nb_dim = X.shape[0]
        tg_num = X.shape[1]
        sc_dim = X.shape[2]

        # since it changes to two pointers once a time:
        # concatenate + reshape
        def _get_ht(A, mask=False):
            if A.ndim == 2:
                B = A[:, -1:]
                if mask:
                    B *= 0.
                A = T.concatenate([A, B], axis=1)
                return A[:, ::2], A[:, 1::2]
            else:
                B = A[:, -1:, :]
                print B.ndim
                if mask:
                    B *= 0.
                A = T.concatenate([A, B], axis=1)
                return A[:, ::2, :], A[:, 1::2, :]

        Xh, Xt = _get_ht(X)
        Th, Tt = _get_ht(target)
        Mh, Mt = _get_ht(tmask, mask=True)

        Xa     = Xh + Xt
        Xa     = T.concatenate([alloc_zeros_matrix(nb_dim, 1, sc_dim),
                                Xa[:, :-1, :, :]], axis=1)
        Xa     = Xa.dimshuffle((1, 0, 2))

        # eat by recurrent net
        def _recurrence(x, prev_h, c, s, s_mask):
            # RNN read-out
            x_out  = self.RNN(x, mask=None, C=c, init_h=prev_h, one_step=True)
            h_out  = self.att_head(x_out, s, s_mask, return_log=True)
            t_out  = self.att_tail(x_out, s, s_mask, return_log=True)

            return x_out, h_out, t_out

        outputs, _ = theano.scan(
            _recurrence,
            sequences=[Xa],
            outputs_info=[Init_h, None, None],
            non_sequences=[context, source, smask]
        )
        log_prob_head = outputs[1].dimshuffle((1, 0, 2))
        log_prob_tail = outputs[2].dimshuffle((1, 0, 2))

        log_prob      = T.sum(self.grab_prob(log_prob_head, Th) * Mh, axis=1) \
                      + T.sum(self.grab_prob(log_prob_tail, Tt) * Mt, axis=1)
        return log_prob

    """
    Sample one step
    """
    def _step_sample(self,
                     prev_idx_h, prev_idx_t,
                     prev_stat,
                     context, inputs, source, smask):
        X = T.switch(
                prev_idx_h[:, None] < 0,
                alloc_zeros_matrix(prev_idx_h.shape[0], self.config['ptr_embedd_dim']),
                self.grab_source(inputs, prev_idx_h[:, None]) + self.grab_source(inputs, prev_idx_t[:, None])
                )

        # one step RNN
        X_out = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
        next_stat = X_out

        # compute the attention read-out
        next_prob_h = self.att_head(X_out, source, smask)
        next_sample_h = self.rng.multinomial(pvals=next_prob_h).argmax(1)

        next_prob_t = self.att_tail(X_out, source, smask)
        next_sample_t = self.rng.multinomial(pvals=next_prob_t).argmax(1)
        return next_prob_h, next_sample_h, next_prob_t, next_sample_t, next_stat

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.matrix()       # theano variable.
        init_h  = self.Initializer(context)
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state')
        logger.info('done.')

        # sampler: 1 x 1
        prev_idxh = T.vector('prev_idxh', dtype='int32')
        prev_idxt = T.vector('prev_idxt', dtype='int32')

        prev_stat = T.matrix('prev_state', dtype='float32')
        inputs    = T.tensor3()
        source    = T.tensor3()
        smask     = T.imatrix()

        next_prob_h, next_sample_h, next_prob_t, next_sample_t, next_stat \
            = self._step_sample(prev_idxh, prev_idxt, prev_stat, context,
                                inputs, source, smask)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_idxh, prev_idxt, prev_stat, context, inputs, source, smask]
        outputs = [next_prob_h, next_sample_h, next_prob_t, next_sample_t, next_stat]
        self.sample_next = theano.function(inputs, outputs, name='sample_next')
        logger.info('done')
        pass

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, inputs, source, smask,
                   k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context
        next_state = self.get_init_state(context)

        next_wordh = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)
        next_wordt = -1 * np.ones((1,)).astype('int32')

        # Start searching!
        for ii in xrange(maxlen):
            # print next_word
            ctx = np.tile(context, [live_k, 1])
            ipt = np.tile(inputs,  [live_k, 1, 1])
            sor = np.tile(source,  [live_k, 1, 1])
            smk = np.tile(smask,   [live_k, 1])

            next_probh, next_wordh, next_probt, next_wordt, next_state \
                = self.sample_next(next_wordh, next_wordt, next_state,
                                   ctx, ipt, sor, smk)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_probh[0].argmax()
                    next_wordh[0] = nw
                else:
                    nw = next_wordh[0]

                sample.append(nw)
                score += next_probh[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

                if argmax:
                    nw = next_probt[0].argmax()
                    next_wordt[0] = nw
                else:
                    nw = next_wordt[0]

                sample.append(nw)
                score += next_probt[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # I don't know how to apply 2 point beam-search
                # we can only computed in a flatten way!
                assert True, 'In this stage, I do not know how to use Beam-search for this problem.'

        return sample, score


class MemNet(Model):
    """
    Memory Networks:
        ==> Assign a Matrix to store rules
    """
    def __init__(self,
                 config, rng, learn_memory=False,
                 prefix='mem'):
        super(MemNet, self).__init__()
        self.config = config
        self.rng    = rng    # Theano random stream
        self.prefix = prefix
        self.init = initializations.get('glorot_uniform')

        if learn_memory:
            self.memory = self.init((self.config['mem_size'], self.config['mem_source_dim']))
            self.memory.name = '{}_inner_memory'.format(self.prefix)
            self.params += [self.memory]
        """
        Create the read-head of the MemoryNets
        """
        if self.config['mem_type'] == 'dnn':
            self.attender = Attention(
                config['mem_hidden_dim'],
                config['mem_source_dim'],
                config['mem_middle_dim'],
                name='{}_attender'.format(self.prefix)
            )
        else:
            self.attender = CosineAttention(
                config['mem_hidden_dim'],
                config['mem_source_dim'],
                use_pipe=config['mem_use_pipe'],
                name='{}_attender'.format(self.prefix)
            )
        self._add(self.attender)

    def __call__(self, key, memory=None, mem_mask=None, out_memory=None):
        # key:    (nb_samples, mem_hidden_dim)
        # memory: (nb_samples, mem_size, mem_source_dim)
        nb_samples = key.shape[0]
        if not memory:
            memory   = T.repeat(self.memory[None, :, :], nb_samples, axis=0)
            mem_mask = None

        if memory.ndim == 2:
            memory   = T.repeat(memory[None, :, :], nb_samples, axis=0)

        probout     = self.attender(key, memory, mem_mask)  # (nb_samples, mem_size)
        if self.config['mem_att_drop'] > 0:
            probout = T.clip(probout - self.config['mem_att_drop'], 0, 1)

        if out_memory is None:
            readout    = T.sum(memory * probout[:, :, None], axis=1)
        else:
            readout    = T.sum(out_memory * probout[:, :, None], axis=1)
        return readout, probout


class PtrNet(Model):
    """
    Pointer Networks [with/without] External Rule Memory
    """
    def __init__(self, config, n_rng, rng,
                 name='PtrNet', w_mem=True):
        super(PtrNet, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.name   = name
        self.w_mem  = w_mem

    def build_(self, encoder=None):
        logger.info("build the Pointer Networks")

        # encoder
        if not encoder:
            self.encoder = Encoder(self.config, self.rng, prefix='enc1')
            self._add(self.encoder)
        else:
            self.encoder = encoder

        if self.config['mem_output_mem']:
            self.encoder_out = Encoder(self.config, self.rng, prefix='enc_out')
            self._add(self.encoder_out)

        # twice encoding
        if self.config['ptr_twice_enc']:
            self.encoder2 = Encoder(self.config, self.rng, prefix='enc2', use_context=True)
            self._add(self.encoder2)

        # pointer decoder
        self.ptrdec  = PtrDecoder(self.config, self.rng)  # PtrDecoder(self.config, self.rng)
        self._add(self.ptrdec)

        # memory grabber
        self.grabber = MemNet(self.config, self.rng)
        self._add(self.grabber)

        # memory predictor :: alternative ::
        if self.config['use_predict']:
            logger.info('create a predictor AS Long Term Memory.s')
            if self.config['pred_type'] == 'highway':
                self.predictor = HighwayNet(self.config['mem_hidden_dim'],
                                            self.config['pred_depth'],
                                            activation='relu',
                                            name='phw')
            elif self.config['pred_type'] == 'dense':
                self.predictor = Dense(self.config['mem_hidden_dim'],
                                       self.config['mem_hidden_dim'],
                                       name='pdnn')
            elif self.config['pred_type'] == 'encoder':
                config = self.config
                # config['enc_embedd_dim'] = 300
                # config['enc_hidden_dim'] = 300
                self.predictor = Encoder(config, self.rng, prefix='enc3', use_context=False)
            else:
                NotImplementedError
            self._add(self.predictor)

        # objectives and optimizers
        assert self.config['optimizer'] == 'adam'
        self.optimizer = optimizers.get(self.config['optimizer'],
                                        kwargs=dict(rng=self.rng,
                                                    save=self.config['save_updates']))

    def build_train(self, memory=None, out_memory=None, compile_train=False, guide=None):
        # training function for Pointer Networks
        indices  = T.imatrix()  # padded word indices (for training)
        target   = T.imatrix()  # target indices (leading to relative locations)
        tmask    = T.imatrix()  # target masks
        pmask    = T.cast(1 - T.eq(target[:, 0], 0), dtype='float32')

        assert memory is not None, 'we must have an input memory'
        if self.config['mem_output_mem']:
            assert out_memory is not None,  'we must have an output memory'

        # L1 of memory
        loss_mem  = T.sum(abs(T.mean(memory, axis=0)))

        # encoding
        if not self.config['ptr_twice_enc']:
            source, inputs, smask, tail = self.encoder.build_encoder(indices, None, return_embed=True, return_sequence=True)

            # grab memory
            readout, probout = self.grabber(tail, memory)

            if not self.config['use_tail']:
                tailx = tail * 0.0
            else:
                tailx = tail

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context  = T.concatenate([tailx, readout], axis=1)

            # if predict ?
            # predictor: minimize || readout - predict ||^2
            if self.config['use_predict']:
                if self.config['pred_type'] == 'encoder':
                    predict = self.predictor.build_encoder(indices, None, return_sequence=False)
                else:
                    predict = self.predictor(tail)

                # reconstruction loss [note that we only compute loss for correct memory read.]
                loss_r   = 0.5 * T.sum(pmask * T.sum(T.sqr(predict - readout), axis=-1).reshape(pmask.shape)) / T.sum(pmask)

                # use predicted readout to compute loss
                contextz = T.concatenate([tailx, predict], axis=1)
                sourcez, inputsz, smaskz = source, inputs, smask
        else:
            tail = self.encoder.build_encoder(indices, None, return_sequence=False)

            # grab memory
            readout, probout = self.grabber(tail, memory, out_memory)

            # get PrtNet input
            if not self.config['use_tail']:
                tailx = tail * 0.0
            else:
                tailx = tail

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context0  = T.concatenate([tailx, readout], axis=1)

            # twice encoding ?
            source, inputs, smask, context = self.encoder2.build_encoder(
                indices, context=context0, return_embed=True, return_sequence=True)

            # if predict ?
            # predictor: minimize | readout - predict ||^2
            if self.config['use_predict']:
                if self.config['pred_type'] == 'encoder':
                    predict = self.predictor.build_encoder(indices, None, return_sequence=False)
                else:
                    predict = self.predictor(tail)

                # reconstruction loss [note that we only compute loss for correct memory read.]
                loss_r   = 0.5 * T.sum(pmask * T.sum(T.sqr(predict - readout), axis=-1).reshape(pmask.shape)) / T.sum(pmask)
                dist     = T.sum(T.sum(T.sqr(tail - readout), axis=-1).reshape(pmask.shape) * pmask) / T.sum(pmask)
                # use predicted readout to compute loss
                context1 = T.concatenate([tailx, predict], axis=1)

                # twice encoding..
                sourcez, inputsz, smaskz, contextz = self.encoder2.build_encoder(
                indices, context=context1, return_embed=True, return_sequence=True)

        # pointer decoder & loss
        logProb  = self.ptrdec.build_decoder(inputs, source, target,
                                             smask, tmask, context)
        loss     = T.mean(-logProb)

        # if predict?
        if self.config['use_predict']:
            logProbz = self.ptrdec.build_decoder(
                    inputsz, sourcez, target, smaskz, tmask, contextz)
            loss_z   = -T.sum(pmask * logProbz.reshape(pmask.shape)) / T.sum(pmask)

        # if guidance ?
        if guide:
            # attention loss
            # >>>>>>>   BE CAUTION !!!  <<<<<<
            # guide vector may contains '-1' which needs a mask for that.
            mask   = T.ones_like(guide) * (1 - T.eq(guide, -1))
            loss_g = T.mean(
                        -T.sum(
                            T.log(PtrDecoder.grab_prob(probout[:, None, :], guide)),
                        axis=1).reshape(mask.shape) * mask
                    )

            # attention accuracy
            attend = probout.argmax(axis=1, keepdims=True)
            maxp   = T.sum(probout.max(axis=1).reshape(mask.shape) * mask) / T.cast(T.sum(mask), 'float32')
            error  = T.sum((abs(attend - guide) * mask) > 0) / T.cast(T.sum(mask), 'float32')

            if self.config['mem_learn_guide']:
                loss  += loss_g

            # loss += 0.1 * loss_mem

        if compile_train:
            train_inputs = [indices, target, tmask, memory]
            if guide:
                train_inputs += [guide]
            logger.info("compiling the compuational graph ::training function::")
            updates  = self.optimizer.get_updates(self.params, loss)
            self.train_ = theano.function(train_inputs, loss, updates=updates, name='train_sub')
            logger.info("training functions compile done.")

        # output the building results for Training
        outputs  = [loss]
        if guide:
            outputs += [maxp, error]
        outputs += [indices, target, tmask]
        if self.config['use_predict']:
            outputs += [loss_r, loss_z, dist, readout]

        return outputs

    def build_sampler(self, memory=None, out_mem=None):
        # training function for Pointer Networks
        indices  = T.imatrix()  # padded word indices (for training)

        # encoding
        if not self.config['ptr_twice_enc']:
            # encoding
            source, inputs, smask, tail = self.encoder.build_encoder(indices, None, return_embed=True, return_sequence=True)

            # grab memory
            readout, probout = self.grabber(tail, memory, out_mem)

            if not self.config['use_tail']:
                tail *= 0.0

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context  = T.concatenate([tail, readout], axis=1)
        else:
            tail = self.encoder.build_encoder(indices, None, return_sequence=False)

            # grab memory
            readout, probout = self.grabber(tail, memory, out_mem)
            if not self.config['use_tail']:
                tail *= 0.0

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context0  = T.concatenate([tail, readout], axis=1)

            # twice encoding ?
            source, inputs, smask, context = self.encoder2.build_encoder(
                indices, context=context0, return_embed=True, return_sequence=True)

        # monitoring
        self.monitor['attention_prob'] = probout
        self._monitoring()

        return context, source, smask, inputs, indices

    def build_predict_sampler(self):
        # training function for Pointer Networks
        indices  = T.imatrix()  # padded word indices (for training)
        flag     = True

        # encoding
        if not self.config['ptr_twice_enc']:
            # encoding
            source, inputs, smask, tail = self.encoder.build_encoder(indices, None, return_embed=True, return_sequence=True)

            # predict memory
            if self.config['pred_type'] == 'encoder':
                readout = self.predictor.build_encoder(indices, None, return_sequence=False)
            else:
                readout = self.predictor(tail)

            if not self.config['use_tail']:
                tail *= 0.0

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context  = T.concatenate([tail, readout], axis=1)
        else:
            tail = self.encoder.build_encoder(indices, None, return_sequence=False)

            # predict memory
            if self.config['pred_type'] == 'encoder':
                readout = self.predictor.build_encoder(indices, None, return_sequence=False)
            else:
                readout = self.predictor(tail)

            if not self.config['use_tail']:
                tail *= 0.0

            if not self.config['use_memory']:
                readout *= 0.0

            # concatenate
            context0  = T.concatenate([tail, readout], axis=1)

            # twice encoding ?
            source, inputs, smask, context = self.encoder2.build_encoder(
                indices, context=context0, return_embed=True, return_sequence=True)

        return context, source, smask, inputs, indices

    def generate_(self, inputs, context, source, smask):
        args = dict(k=4, maxlen=5, stochastic=False, argmax=False)
        sample, score = self.ptrdec.get_sample(context, inputs, source, smask,
                                               **args)
        if not args['stochastic']:
            score = score / np.array([len(s) for s in sample])
            sample = sample[score.argmin()]
            score = score.min()
        else:
            score /= float(len(sample))

        return sample, np.exp(score)

================================================
FILE: emolga/models/variational.py
================================================
__author__ = 'jiataogu'
import theano
# theano.config.exception_verbosity = 'high'
import logging

import emolga.basic.objectives as objectives
import emolga.basic.optimizers as optimizers
from emolga.layers.recurrent import *
from emolga.layers.embeddings import *
from emolga.models.encdec import RNNLM, Encoder, Decoder
from emolga.models.sandbox import SkipDecoder


logger = logging
RNN = JZS3  # change it here for other RNN models.
# Decoder = SkipDecoder


class VAE(RNNLM):
    """
    Variational Auto-Encoder: RNN-Variational Encoder/Decoder,
    in order to model the sentence generation.

    We implement the original VAE and a better version, IWAE.
    References:
        Auto-Encoding Variational Bayes
            http://arxiv.org/abs/1312.6114

        Importance Weighted Autoencoders
            http://arxiv.org/abs/1509.00519
    """

    def __init__(self,
                 config, n_rng, rng,
                 mode='Evaluation'):
        super(RNNLM, self).__init__()

        self.config = config
        self.n_rng  = n_rng  # numpy random stream
        self.rng    = rng  # Theano random stream
        self.mode   = mode
        self.name   = 'vae'
        self.tparams= dict()

    def _add_tag(self, layer, tag):
        if tag not in self.tparams:
            self.tparams[tag] = []

        if layer:
            self.tparams[tag] += layer.params

    def build_(self):
        logger.info("build the variational auto-encoder")
        self.encoder = Encoder(self.config, self.rng, prefix='enc')
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', embed=self.encoder.Embed)
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec')

        # additional parameters for building Gaussian:
        logger.info("create Gaussian layers.")

        """
        Build the Gaussian distribution.
        """
        self.action_activ = activations.get('tanh')
        self.context_mean = Dense(
            self.config['enc_hidden_dim'] * 2
            if self.config['bidirectional']
            else self.config['enc_hidden_dim'],

            self.config['action_dim'],
            activation='linear',
            name="weight_mean"
        )

        self.context_std = Dense(
            self.config['enc_hidden_dim'] * 2
            if self.config['bidirectional']
            else self.config['enc_hidden_dim'],

            self.config['action_dim'],
            activation='linear',
            name="weight_std"
        )

        self.context_trans = Dense(
            self.config['action_dim'],
            self.config['dec_contxt_dim'],
            activation='tanh',
            name="transform"
        )

        # registration:
        self._add(self.context_mean)
        self._add(self.context_std)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.context_mean, 'q')
        self._add_tag(self.context_std, 'q')

        # P-layers:
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

        # objectives and optimizers
        self.optimizer = optimizers.get(self.config['optimizer'])

        logger.info("create variational RECURRENT auto-encoder. ok")

    def compile_train(self):
        """
        build the training function here <:::>
        """
        # questions (theano variables)
        inputs = T.imatrix()  # padded input word sequence (for training)

        # encoding. (use backward encoding.)
        encoded = self.encoder.build_encoder(inputs[:, ::-1])

        # gaussian distribution
        mean = self.context_mean(encoded)
        ln_var = self.context_std(encoded)

        # [important] use multiple samples.
        if self.config['repeats'] > 1:
            L  = self.config['repeats']

            # repeat mean, ln_var and targets.
            func_r = lambda x: T.extra_ops.repeat(
                                x[:, None, :], L,
                                axis=1).reshape((x.shape[0] * L, x.shape[1]))
            mean, ln_var, target \
                   = [func_r(x) for x in [mean, ln_var, inputs]]
        else:
            target = inputs

        action  = mean + T.exp(ln_var / 2.) * self.rng.normal(mean.shape)
        context = self.context_trans(action)

        # decoding.
        logPxz, logPPL = self.decoder.build_decoder(target, context)

        # loss function for variational auto-encoding
        # regulation loss + reconstruction loss
        loss_reg = T.mean(objectives.get('GKL')(mean, ln_var))
        loss_rec = T.mean(-logPxz)
        loss_ppl = T.exp(T.mean(-logPPL))

        m_mean = T.mean(abs(mean))
        m_ln_var = T.mean(abs(ln_var))
        L1       = T.sum([T.sum(abs(w)) for w in self.params])

        loss = loss_reg + loss_rec
        updates = self.optimizer.get_updates(self.params, loss)

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs]

        self.train_ = theano.function(train_inputs,
                                      [loss_reg, loss_rec, L1, m_ln_var],
                                      updates=updates,
                                      name='train_fun')
        # add monitoring:
        self.monitor['action'] = action
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)
        logger.info("pre-training functions compile done.")

    def compile_sample(self):
        """
        build the sampler function here <:::>
        """
        # context vectors (as)
        self.decoder.build_sampler()

        l = T.iscalar()
        logger.info("compiling the computational graph :: action sampler")
        self.action_sampler = theano.function([l], self.rng.normal((l, self.config['action_dim'])))

        action = T.matrix()
        logger.info("compiling the compuational graph ::transform function::")
        self.transform = theano.function([action], self.context_trans(action))
        logger.info("display functions compile done.")

    def compile_inference(self):
        """
        build the hidden action prediction.
        """
        inputs = T.imatrix()  # padded input word sequence (for training)

        # encoding. (use backward encoding.)
        encoded = self.encoder.build_encoder(inputs[:, ::-1])

        # gaussian distribution
        mean    = self.context_mean(encoded)
        ln_var  = self.context_std(encoded)

        self.inference_ = theano.function([inputs], [encoded, mean, T.sqrt(T.exp(ln_var))])
        logger.info("inference function compile done.")

    def default_context(self):
        return self.transform(self.action_sampler(1))


class Helmholtz(VAE):
    """
    Another alternative I can think about is the Helmholtz Machine
    It is trained using a Reweighted Wake Sleep Algorithm.
    Reference:
        Reweighted Wake-Sleep
            http://arxiv.org/abs/1406.2751
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode = 'Evaluation',
                 dynamic_prior=False,
                 ):
        super(VAE, self).__init__(config, n_rng, rng)

        # self.config = config
        # self.n_rng = n_rng  # numpy random stream
        # self.rng = rng  # Theano random stream
        self.mode = mode
        self.name = 'multitask_helmholtz'
        self.tparams = dict()
        self.dynamic_prior = dynamic_prior

    def build_(self):
        logger.info('Build Helmholtz Recurrent Neural Networks')
        self.encoder = Encoder(self.config, self.rng, prefix='enc')
        if self.config['shared_embed']:
            self.decoder = Decoder(self.config, self.rng, prefix='dec', embed=self.encoder.Embed,
                                   highway=self.config['highway'])
        else:
            self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   highway=self.config['highway'])

        # The main difference between VAE and HM is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a sigmoid prior here.

        """
        Build the Sigmoid Layers
        """
        # prior distribution (bias layer)
        self.Prior    = Constant(
            self.config['action_dim'],
            self.config['action_dim'],
            activation='sigmoid',
            name='prior_proj'
        )

        # Fake Posterior (Q-function)
        self.Posterior = Dense(
            self.config['enc_hidden_dim'] * 2
            if self.config['bidirectional']
            else self.config['enc_hidden_dim'],

            self.config['action_dim'],
            activation='sigmoid',
            name = 'posterior_proj'
        )

        # Action transform to context
        self.context_trans = Dense(
            self.config['action_dim'],
            self.config['dec_contxt_dim'],
            activation='linear',
            name="transform"
        )

        # registration:
        self._add(self.Posterior)
        self._add(self.Prior)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.Posterior, 'q')

        # P-layers:
        self._add_tag(self.Prior, 'p')
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

        # objectives and optimizers
        self.optimizer_p = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})
        self.optimizer_q = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})

        logger.info("create Helmholtz RECURRENT neural network. ok")

    def dynamic(self):
        self.Prior   = Dense(
            self.config['state_dim'],
            self.config['action_dim'],
            activation='sigmoid',
            name='prior_proj'
        )

        self.params = []
        self.layers = []
        self.tparams= dict()

        # add layers again!
        # registration:
        self._add(self.Posterior)
        self._add(self.Prior)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.Posterior, 'q')

        # P-layers:
        self._add_tag(self.Prior, 'p')
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            if not contrastive:
                self.compile_train()
            else:
                self.compile_train_CE()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    def compile_train(self):
        """
        build the training function here <:::>
        """
        # get input sentence (x)
        inputs  = T.imatrix()  # padded input word sequence (for training)
        batch_size = inputs.shape[0]

        """
        The Computational Flow.
        """
        # encoding. (use backward encoding.)
        encoded = self.encoder.build_encoder(inputs[:, ::-1])

        # get Q(a|y) = sigmoid(.|Posterior * encoded)
        q_dis   = self.Posterior(encoded)

        # use multiple samples
        L  = T.iscalar('repeats') #self.config['repeats']

        def func_r(x):
            return T.extra_ops.repeat(x[:, None, :], L, axis=1).reshape((-1, x.shape[1]))  # ?

        q_dis, target = [func_r(x) for x in [q_dis, inputs]]

        # sample actions
        u       = self.rng.uniform(q_dis.shape)
        action  = T.cast(u <= q_dis, dtype=theano.config.floatX)

        # compute the exact probability for actions
        logQax  = T.sum(action * T.log(q_dis) + (1 - action) * T.log(1 - q_dis), axis=1)

        # decoding.
        context = self.context_trans(action)
        logPxa, count = self.decoder.build_decoder(target, context, return_count=True)
        logPPL  = logPxa / count
        # logPxa, logPPL = self.decoder.build_decoder(target, context)

        # prior.
        p_dis   = self.Prior(action)
        logPa   = T.sum(action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis), axis=1)

        """
        Compute the weights
        """
        # reshape
        logQax  = logQax.reshape((batch_size, L))
        logPa   = logPa.reshape((batch_size, L))
        logPxa  = logPxa.reshape((batch_size, L))
        count   = count.reshape((batch_size, L))[:, :1]

        # P(x, a) = P(a) * P(x|a)
        logPx_a = logPa + logPxa
        log_wk  = logPx_a - logQax
        log_bpk = logPa - logQax

        log_w_sum  = logSumExp(log_wk, axis=1)
        log_bp_sum = logSumExp(log_bpk, axis=1)

        log_wnk    = log_wk - log_w_sum
        log_bpnk   = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator
        # nll   = -T.mean(log_w_sum - T.log(L))
        nll        = T.mean(-(log_w_sum - T.log(L)))
        perplexity = T.exp(T.mean(-(log_w_sum - T.log(L)) / count))

        # perplexity = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        """
        Compute the Loss function
        """
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights = T.exp(log_wnk)
        bp      = T.exp(log_bpnk)
        bq      = 1. / L
        ess     = T.mean(1 / T.sum(weights ** 2, axis=1))

        # monitoring
        # self.monitor['action'] = action
        if self.config['variant_control']:
            lossQ   = -T.mean(T.sum(logQax * (weights - bq), axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * (weights - bp), axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))          # log p(x|a)
            lossP   = lossPxa + lossPa

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights, bp])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])
        else:
            lossQ   = -T.mean(T.sum(logQax * weights, axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(logPa  * weights, axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))   # log p(x|a)
            lossP   = lossPxa + lossPa
            # lossRes = -T.mean(T.nnet.relu(T.sum((logPa + logPxa - logPx0) * weights, axis=1)))
            # lossP   = 0.1 * lossRes + lossP

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])

        updates   = updates_p + updates_q

        logger.info("compiling the compuational graph ::training function::")
        train_inputs = [inputs] + [theano.Param(L, default=10)]

        self.train_ = theano.function(train_inputs,
                                      [lossPa, lossPxa, lossQ, perplexity, nll],
                                      updates=updates,
                                      name='train_fun')

        logger.info("compile the computational graph:: >__< :: explore function")
        self.explore_ = theano.function(train_inputs,
                                        [log_wk, count],
                                        name='explore_fun')

        # add monitoring:
        # self._monitoring()

        # compiling monitoring
        # self.compile_monitoring(train_inputs)
        logger.info("pre-training functions compile done.")

    def build_dynamics(self, states, action, Y):
        # this funtion is used to compute probabilities for language generation.
        # compute the probability of action
        assert self.dynamic_prior, 'only supports dynamic prior'
        p_dis     = self.Prior(states)
        logPa     = T.sum(action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis), axis=1)
        context   = self.context_trans(action)
        logPxa, count = self.decoder.build_decoder(Y, context, return_count=True)
        return logPa, logPxa, count

    def compile_sample(self):
        """
        build the sampler function here <:::>
        """
        # context vectors (as)
        self.decoder.build_sampler()

        logger.info("compiling the computational graph :: action sampler")
        if self.dynamic_prior:
            states = T.matrix()
            p_dis  = self.Prior(states)
            u      = self.rng.uniform(p_dis.shape)
        else:
            p_dis  = self.Prior()
            l      = T.iscalar()
            u      = self.rng.uniform((l, p_dis.shape[-1]))

        action  = T.cast(u <= p_dis, dtype=theano.config.floatX)

        if self.dynamic_prior:
            self.action_sampler = theano.function([states], action)
        else:
            self.action_sampler = theano.function([l], action)

        # compute the action probability
        logPa   = T.sum(action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis), axis=1)
        if self.dynamic_prior:
            self.action_prob = theano.function([states, action], logPa)
        else:
            self.action_prob = theano.function([action], logPa)

        action  = T.matrix()
        logger.info("compiling the computational graph ::transform function::")
        self.transform = theano.function([action], self.context_trans(action))
        logger.info("display functions compile done.")

    def compile_inference(self):
        """
        build the hidden action prediction.
        """
        inputs = T.imatrix()  # padded input word sequence (for training)

        # encoding. (use backward encoding.)
        encoded = self.encoder.build_encoder(inputs[:, ::-1])

        # get Q(a|y) = sigmoid(.|Posterior * encoded)
        q_dis   = self.Posterior(encoded)
        p_dis   = self.Prior(inputs)

        self.inference_ = theano.function([inputs], [encoded, q_dis, p_dis])
        logger.info("inference function compile done.")

    def evaluate_(self, inputs):
        """
        build the evaluation function for valid/testing
        Note that we need multiple sampling for this!
        """
        log_wks = []
        count   = None
        N       = self.config['eval_N']
        L       = self.config['eval_repeats']

        for _ in xrange(N):
            log_wk, count = self.explore_(inputs, L)
            log_wks.append(log_wk)

        log_wk     = np.concatenate(log_wks, axis=1)
        log_wk_sum = logSumExp(log_wk, axis=1, status='numpy')

        nll        = np.mean(-(log_wk_sum - np.log(N * L)))
        perplexity = np.exp(np.mean(-(log_wk_sum - np.log(N * L)) / count))

        return nll, perplexity

    """
    OLD CODE::  >>> It doesn't work !
    """
    def compile_train_CE(self):
        # compile the computation graph (use contrastive noise, for 1 sample here. )

        """
        build the training function here <:::>
        """
        # get input sentence (x)
        inputs  = T.imatrix()  # padded input word sequence x (for training)
        noises  = T.imatrix()  # padded noise word sequence y (it stands for another question.)
        batch_size = inputs.shape[0]

        """
        The Computational Flow.
        """
        # encoding. (use backward encoding.)
        encodex = self.encoder.build_encoder(inputs[:, ::-1])
        encodey = self.encoder.build_encoder(noises[:, ::-1])

        # get Q(a|y) = sigmoid(.|Posterior * encoded)
        q_dis_x = self.Posterior(encodex)
        q_dis_y = self.Posterior(encodey)

        # use multiple samples
        if self.config['repeats'] > 1:
            L  = self.config['repeats']

            # repeat mean, ln_var and targets.
            func_r = lambda x: T.extra_ops.repeat(
                                x[:, None, :], L,
                                axis=1).reshape((x.shape[0] * L, x.shape[1]))
            q_dis_x, q_dis_y, target \
                   = [func_r(x) for x in [q_dis_x, q_dis_y, inputs]]
        else:
            target = inputs
            L  = 1

        # sample actions
        u = self.rng.uniform(q_dis_x.shape)
        action  = T.cast(u <= q_dis_x, dtype=theano.config.floatX)

        # compute the exact probability for actions (for data distribution)
        logQax  = T.sum(action * T.log(q_dis_x) + (1 - action) * T.log(1 - q_dis_x), axis=1)

        # compute the exact probability for actions (for noise distribution)
        logQay  = T.sum(action * T.log(q_dis_y) + (1 - action) * T.log(1 - q_dis_y), axis=1)

        # decoding.
        context = self.context_trans(action)
        logPxa, count = self.decoder.build_decoder(target, context, return_count=True)

        # prior.
        p_dis   = self.Prior(target)
        logPa   = T.sum(action * T.log(p_dis) + (1 - action) * T.log(1 - p_dis), axis=1)

        """
        Compute the weights
        """
        # reshape
        logQax  = logQax.reshape((batch_size, L))
        logQay  = logQay.reshape((batch_size, L))
        logPa   = logPa.reshape((batch_size, L))
        logPxa  = logPxa.reshape((batch_size, L))

        # P(x, a) = P(a) * P(x|a)
        # logPx_a = logPa + logPxa
        logPx_a    = logPa + logPxa

        # normalizing the weights
        log_wk     = logPx_a - logQax
        log_bpk    = logPa - logQax

        log_w_sum  = logSumExp(log_wk, axis=1)
        log_bp_sum = logSumExp(log_bpk, axis=1)

        log_wnk    = log_wk - log_w_sum
        log_bpnk   = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator
        logPx   = T.mean(log_w_sum - T.log(L))
        perplexity = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        """
        Compute the Loss function
        """
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights = T.exp(log_wnk)
        bp      = T.exp(log_bpnk)
        bq      = 1. / L
        ess     = T.mean(1 / T.sum(weights ** 2, axis=1))

        """
        Contrastive Estimation
        """
        # lossQ   = -T.mean(T.sum(logQax * (weights - bq), axis=1))   # log q(a|x)
        logC    = logQax - logQay
        weightC = weights * (1 - T.nnet.sigmoid(logC))
        lossQ   = -T.mean(T.sum(logC * weightC, axis=1))
        # lossQT  = -T.mean(T.sum(T.log(T.nnet.sigmoid(logC)) * weights, axis=1))

        # monitoring
        self.monitor['action'] = logC

        """
        Maximum-likelihood Estimation
        """
        lossPa  = -T.mean(T.sum(logPa  * (weights - bp), axis=1))   # log p(a)
        lossPxa = -T.mean(T.sum(logPxa * weights, axis=1))          # log p(x|a)
        lossP   = lossPxa + lossPa
        # loss    = lossQT + lossPa + lossPxa

        updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights, bp])
        updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weightC])
        updates   = updates_p + updates_q

        logger.info("compiling the compuational graph ::training function::")
        train_inputs  = [inputs, noises]

        self.train_ce_ = theano.function(train_inputs,
                                        [lossPa, lossPxa, lossQ, perplexity, ess],
                                        updates=updates,
                                        name='train_fun')

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)

        logger.info("pre-training functions compile done.")


class HarX(Helmholtz):
    """
    Another alternative I can think about is the Helmholtz Machine
    It is trained using a Reweighted Wake Sleep Algorithm.
    Reference:
        Reweighted Wake-Sleep
            http://arxiv.org/abs/1406.2751

    We extend the original Helmholtz Machine to a recurrent way.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode = 'Evaluation',
                 dynamic_prior=False,
                 ):
        super(VAE, self).__init__(config, n_rng, rng)

        # self.config = config
        # self.n_rng = n_rng  # numpy random stream
        # self.rng = rng  # Theano random stream
        self.mode = mode
        self.name = 'multitask_helmholtz'
        self.tparams = dict()
        self.dynamic_prior = dynamic_prior

    def build_(self):
        logger.info('Build Helmholtz Recurrent Neural Networks')

        # backward encoder
        self.encoder = Encoder(self.config, self.rng, prefix='enc')

        # feedforward + hidden content decoder
        self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed
                                   if self.config['shared_embed']
                                   else None)

        # The main difference between VAE and HM is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a sigmoid prior here.

        """
        Build the Sigmoid Layers
        """
        # prior distribution (conditional distribution)
        self.Prior    = Dense(
            self.config['dec_hidden_dim'],
            self.config['action_dim'],
            activation='sigmoid',
            name='prior_proj'
        )

        # Fake Posterior (Q-function)
        if self.config['decposterior']:
            self.Posterior = Dense2(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['dec_hidden_dim'],
                self.config['action_dim'],
                activation='sigmoid',
                name='posterior_proj'
            )
        else:
            self.Posterior = Dense(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['action_dim'],
                activation='sigmoid',
                name='posterior_proj'
            )

        # Action transform to context
        self.context_trans = Dense(
            self.config['action_dim'],
            self.config['dec_contxt_dim'],
            activation='linear',
            name="transform"
        )

        # registration:
        self._add(self.Posterior)
        self._add(self.Prior)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.Posterior, 'q')

        # P-layers:
        self._add_tag(self.Prior, 'p')
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

        # objectives and optimizers
        self.optimizer_p = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})
        self.optimizer_q = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})

        logger.info("create Helmholtz RECURRENT neural network. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            self.compile_train()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    """
    Training
    """
    def compile_train(self):
        """
        build the training function here <:::>
        """
        # get input sentence (x)
        inputs     = T.imatrix()  # padded input word sequence (for training)
        batch_size = inputs.shape[0]

        logger.info(
            """
            The Computational Flow. ---> In a recurrent fashion

            [= v =] <:::
            Inference-Generation in one scan

            >>>> Encoding without hidden variable. (use backward encoding.)
            """
        )
        embeded, mask \
                   = self.decoder.Embed(inputs, True)  # (nb_samples, max_len, embedding_dim)
        encoded    = self.encoder.build_encoder(inputs[:, ::-1], return_sequence=True)[:, ::-1, :]
        count      = T.cast(T.sum(mask, axis=1), dtype=theano.config.floatX)[:, None]  # (nb_samples,)

        logger.info(
            """
            >>>> Repeat
            """
        )
        L          = T.iscalar('repeats')              # self.config['repeats']

        def _repeat(x, dimshuffle=True):
            if x.ndim == 3:
                y = T.extra_ops.repeat(x[:, None, :, :], L, axis=1).reshape((-1, x.shape[1], x.shape[2]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0, 2)
            else:
                y = T.extra_ops.repeat(x[:, None, :], L, axis=1).reshape((-1, x.shape[1]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0)
            return y

        embeded    = _repeat(embeded)                  # (max_len, nb_samples * L, embedding_dim)
        encoded    = _repeat(encoded)                  # (max_len, nb_samples * L, enc_hidden_dim)
        target     = _repeat(inputs, False)            # (nb_samples * L, max_len)
        mask       = _repeat(mask, False)              # (nb_samples * L, max_len)
        init_dec   = T.zeros((encoded.shape[1],
                              self.config['dec_hidden_dim']),
                              dtype='float32')      # zero initialization
        uniform    = self.rng.uniform((embeded.shape[0],
                                       embeded.shape[1],
                                       self.config['action_dim'])) # uniform dirstribution pre-sampled.

        logger.info(
            """
            >>>> Recurrence
            """
        )

        def _recurrence(embed_t, enc_t, u_t, dec_tm1):
            """
            x_t:   (nb_samples, dec_embedd_dim)
            enc_t: (nb_samples, enc_hidden_dim)
            dec_t: (nb_samples, dec_hidden_dim)
            """
            # get q(z_t|dec_t, enc_t);  sample z_t; compute the Posterior (inference) prob.
            if self.config['decposterior']:
                q_dis_t   = self.Posterior(enc_t, dec_tm1)
            else:
                q_dis_t   = self.Posterior(enc_t)

            z_t       = T.cast(u_t <= q_dis_t, dtype='float32')
            log_qzx_t = T.sum(z_t * T.log(q_dis_t) + (1 - z_t) * T.log(1 - q_dis_t), axis=1)  # (nb_samples * L, )

            # compute the prior probability
            p_dis_t   = self.Prior(dec_tm1)
            log_pz0_t = T.sum(z_t * T.log(p_dis_t) + (1 - z_t) * T.log(1 - p_dis_t), axis=1)

            # compute the decoding probability
            context_t = self.context_trans(z_t)
            readout_t = self.decoder.hidden_readout(dec_tm1) + self.decoder.context_readout(context_t)
            for l in self.decoder.output_nonlinear:
                readout_t = l(readout_t)
            pxz_dis_t = self.decoder.output(readout_t)

            # compute recurrence
            dec_t   = self.decoder.RNN(embed_t, C=context_t, init_h=dec_tm1, one_step=True)

            return dec_t, z_t, log_qzx_t, log_pz0_t, pxz_dis_t

        # (max_len, nb_samples, ?)
        outputs, _ = theano.scan(
            _recurrence,
            sequences=[embeded, encoded, uniform],
            outputs_info=[init_dec, None, None, None, None])
        _, z, log_qzx, log_pz0, pxz_dis = outputs

        # summary of scan/ dimshuffle/ reshape
        def _grab_prob(probs, x):
            assert probs.ndim == 3
            b_size     = probs.shape[0]
            max_len    = probs.shape[1]
            vocab_size = probs.shape[2]
            probs      = probs.reshape((b_size * max_len, vocab_size))
            return probs[T.arange(b_size * max_len), x.flatten(1)].reshape(x.shape)  # advanced indexing

        log_qzx    = T.sum(log_qzx.dimshuffle(1, 0) * mask, axis=-1).reshape((batch_size, L))
        log_pz0    = T.sum(log_pz0.dimshuffle(1, 0) * mask, axis=-1).reshape((batch_size, L))
        log_pxz    = T.sum(T.log(_grab_prob(pxz_dis.dimshuffle(1, 0, 2), target)) * mask, axis=-1).reshape((batch_size, L))

        logger.info(
            """
            >>>> Compute the weights [+ _ =]
            """
        )
        log_pxnz   = log_pz0  + log_pxz    # log p(X, Z)
        log_wk     = log_pxnz - log_qzx    # log[p(X, Z)/q(Z|X)]
        log_bpk    = log_pz0  - log_qzx    # log[p(Z)/q(Z|X)]

        log_w_sum  = logSumExp(log_wk, axis=1)
        log_bp_sum = logSumExp(log_bpk, axis=1)

        log_wnk    = log_wk - log_w_sum
        log_bpnk   = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator [+ _ =]
        # Finally come to this place
        nll        = T.mean(-(log_w_sum - T.log(L)))
        perplexity = T.exp(T.mean(-(log_w_sum - T.log(L)) / count))

        # perplexity = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        logger.info(
            """
            >>>> Compute the gradients [+ _ =]
            """
        )
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights = T.exp(log_wnk)
        bp      = T.exp(log_bpnk)
        bq      = 1. / L
        ess     = T.mean(1 / T.sum(weights ** 2, axis=1))

        # monitoring
        self.monitor['hidden state'] = z
        if self.config['variant_control']:
            lossQ   = -T.mean(T.sum(log_qzx * (weights - bq), axis=1))   # log q(z|x)
            lossPa  = -T.mean(T.sum(log_pz0 * (weights - bp), axis=1))   # log p(z)
            lossPxa = -T.mean(T.sum(log_pxz * weights, axis=1))          # log p(x|z)
            lossP   = lossPxa + lossPa

            # L2 regu
            lossP  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['p']])
            lossQ  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['q']])

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights, bp])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])
        else:
            lossQ   = -T.mean(T.sum(log_qzx * weights, axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(log_pz0 * weights, axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(log_pxz * weights, axis=1))   # log p(x|a)
            lossP   = lossPxa + lossPa

            # L2 regu
            print 'L2 ?'
            lossP  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['p']])
            lossQ  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['q']])

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])

        updates   = updates_p + updates_q
        logger.info("compiling the compuational graph:: >__< ::training function::")
        train_inputs = [inputs] + [theano.Param(L, default=10)]

        self.train_ = theano.function(train_inputs,
                                      [lossPa, lossPxa, lossQ, perplexity, nll],
                                      updates=updates,
                                      name='train_fun')

        logger.info("compile the computational graph:: >__< :: explore function")
        self.explore_ = theano.function(train_inputs,
                                        [log_wk, count],
                                        name='explore_fun')

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs)
        logger.info("pre-training functions compile done.")

    def generate_(self, context=None, max_len=None, mode='display'):
        # overwrite the RNNLM generator as there are hidden variables every time step
        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)


class THarX(Helmholtz):
    """
    Another alternative I can think about is the Helmholtz Machine
    It is trained using a Reweighted Wake Sleep Algorithm.
    Reference:
        Reweighted Wake-Sleep
            http://arxiv.org/abs/1406.2751

    We extend the original Helmholtz Machine to a recurrent way.
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode = 'Evaluation',
                 dynamic_prior=False,
                 ):
        super(VAE, self).__init__(config, n_rng, rng)

        # self.config = config
        # self.n_rng = n_rng  # numpy random stream
        # self.rng = rng  # Theano random stream
        self.mode = mode
        self.name = 'multitask_helmholtz'
        self.tparams = dict()
        self.dynamic_prior = dynamic_prior

    def build_(self):
        logger.info('Build Helmholtz Recurrent Neural Networks')

        # backward encoder
        self.encoder = Encoder(self.config, self.rng, prefix='enc')

        # feedforward + hidden content decoder
        self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed
                                   if self.config['shared_embed']
                                   else None)

        # The main difference between VAE and HM is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a sigmoid prior here.

        """
        Build the Sigmoid Layers
        """
        # prior distribution (conditional distribution)
        self.Prior    = Dense(
            self.config['dec_hidden_dim'],
            self.config['action_dim'],
            activation='softmax',
            name='prior_proj'
        )

        # Fake Posterior (Q-function)
        if self.config['decposterior']:
            self.Posterior = Dense2(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['dec_hidden_dim'],
                self.config['action_dim'],
                activation='softmax',
                name='posterior_proj'
            )
        else:
            self.Posterior = Dense(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['action_dim'],
                activation='softmax',
                name='posterior_proj'
            )

        # Action transform to context
        self.context_trans = Dense(
            self.config['action_dim'],
            self.config['dec_contxt_dim'],
            activation='linear',
            name="transform"
        )

        # registration:
        self._add(self.Posterior)
        self._add(self.Prior)
        self._add(self.context_trans)
        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.Posterior, 'q')

        # P-layers:
        self._add_tag(self.Prior, 'p')
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

        # objectives and optimizers
        self.optimizer_p = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})
        self.optimizer_q = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})

        logger.info("create Helmholtz RECURRENT neural network. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            self.compile_train()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    """
    Training
    """
    def compile_train(self):
        """
        build the training function here <:::>
        """
        # get input sentence (x)
        inputs     = T.imatrix('inputs')  # padded input word sequence (for training)
        batch_size = inputs.shape[0]

        logger.info(
            """
            The Computational Flow. ---> In a recurrent fashion

            [= v =] <:::
            Inference-Generation in one scan

            >>>> Encoding without hidden variable. (use backward encoding.)
            """
        )
        embeded, mask \
                   = self.decoder.Embed(inputs, True)  # (nb_samples, max_len, embedding_dim)
        encoded    = self.encoder.build_encoder(inputs[:, ::-1], return_sequence=True)[:, ::-1, :]
        count      = T.cast(T.sum(mask, axis=1), dtype=theano.config.floatX)[:, None]  # (nb_samples,)

        logger.info(
            """
            >>>> Repeat
            """
        )
        L          = T.iscalar('repeats')              # self.config['repeats']

        def _repeat(x, dimshuffle=True):
            if x.ndim == 3:
                y = T.extra_ops.repeat(x[:, None, :, :], L, axis=1).reshape((-1, x.shape[1], x.shape[2]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0, 2)
            else:
                y = T.extra_ops.repeat(x[:, None, :], L, axis=1).reshape((-1, x.shape[1]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0)
            return y

        embeded    = _repeat(embeded)                  # (max_len, nb_samples * L, embedding_dim)
        encoded    = _repeat(encoded)                  # (max_len, nb_samples * L, enc_hidden_dim)
        target     = _repeat(inputs, False)            # (nb_samples * L, max_len)
        mask       = _repeat(mask, False)              # (nb_samples * L, max_len)
        init_dec   = T.zeros((encoded.shape[1],
                              self.config['dec_hidden_dim']),
                              dtype='float32')      # zero initialization
        # uniform    = self.rng.uniform((embeded.shape[0],
        #                                embeded.shape[1],
        #                                self.config['action_dim'])) # uniform dirstribution pre-sampled.

        logger.info(
            """
            >>>> Recurrence
            """
        )

        def _recurrence(embed_t, enc_t, dec_tm1):
            """
            x_t:   (nb_samples, dec_embedd_dim)
            enc_t: (nb_samples, enc_hidden_dim)
            dec_t: (nb_samples, dec_hidden_dim)
            """
            # get q(z_t|dec_t, enc_t);  sample z_t; compute the Posterior (inference) prob.
            if self.config['decposterior']:
                q_dis_t   = self.Posterior(enc_t, dec_tm1)
            else:
                q_dis_t   = self.Posterior(enc_t)

            z_t       = self.rng.multinomial(pvals=q_dis_t, dtype='float32')
            log_qzx_t = T.sum(T.log(q_dis_t) * z_t, axis=1)
            # log_qzx_t = T.log(q_dis_t[T.arange(q_dis_t.shape[0]), z_t])

            # z_t       = T.cast(u_t <= q_dis_t, dtype='float32')
            # log_qzx_t = T.sum(z_t * T.log(q_dis_t) + (1 - z_t) * T.log(1 - q_dis_t), axis=1)  # (nb_samples * L, )

            # compute the prior probability
            p_dis_t   = self.Prior(dec_tm1)
            log_pz0_t = T.sum(T.log(p_dis_t) * z_t, axis=1)
            # log_pz0_t = T.log(p_dis_t[T.arange(p_dis_t.shape[0]), z_t])
            # log_pz0_t = T.sum(z_t * T.log(p_dis_t) + (1 - z_t) * T.log(1 - p_dis_t), axis=1)

            # compute the decoding probability
            context_t = self.context_trans(z_t)
            readout_t = self.decoder.hidden_readout(dec_tm1) + self.decoder.context_readout(context_t)
            for l in self.decoder.output_nonlinear:
                readout_t = l(readout_t)
            pxz_dis_t = self.decoder.output(readout_t)

            # compute recurrence
            dec_t   = self.decoder.RNN(embed_t, C=context_t, init_h=dec_tm1, one_step=True)

            return dec_t, z_t, log_qzx_t, log_pz0_t, pxz_dis_t

        # (max_len, nb_samples, ?)
        outputs, scan_update = theano.scan(
            _recurrence,
            sequences=[embeded, encoded],
            outputs_info=[init_dec, None, None, None, None])
        _, z, log_qzx, log_pz0, pxz_dis = outputs

        # summary of scan/ dimshuffle/ reshape
        def _grab_prob(probs, x):
            assert probs.ndim == 3
            b_size     = probs.shape[0]
            max_len    = probs.shape[1]
            vocab_size = probs.shape[2]
            probs      = probs.reshape((b_size * max_len, vocab_size))
            return probs[T.arange(b_size * max_len), x.flatten(1)].reshape(x.shape)  # advanced indexing

        log_qzx    = T.sum(log_qzx.dimshuffle(1, 0) * mask, axis=-1).reshape((batch_size, L))
        log_pz0    = T.sum(log_pz0.dimshuffle(1, 0) * mask, axis=-1).reshape((batch_size, L))
        log_pxz    = T.sum(T.log(_grab_prob(pxz_dis.dimshuffle(1, 0, 2), target)) * mask, axis=-1).reshape((batch_size, L))

        logger.info(
            """
            >>>> Compute the weights [+ _ =]
            """
        )
        log_pxnz   = log_pz0  + log_pxz    # log p(X, Z)
        log_wk     = log_pxnz - log_qzx    # log[p(X, Z)/q(Z|X)]
        log_bpk    = log_pz0  - log_qzx    # log[p(Z)/q(Z|X)]

        log_w_sum  = logSumExp(log_wk, axis=1)
        log_bp_sum = logSumExp(log_bpk, axis=1)

        log_wnk    = log_wk - log_w_sum
        log_bpnk   = log_bpk - log_bp_sum

        # unbiased log-likelihood estimator [+ _ =]
        # Finally come to this place
        nll        = T.mean(-(log_w_sum - T.log(L)))
        perplexity = T.exp(T.mean(-(log_w_sum - T.log(L)) / count))

        # perplexity = T.exp(-T.mean((log_w_sum - T.log(L)) / count))

        logger.info(
            """
            >>>> Compute the gradients [+ _ =]
            """
        )
        # loss    = weights * log [p(a)p(x|a)/q(a|x)]
        weights = T.exp(log_wnk)
        bp      = T.exp(log_bpnk)
        bq      = 1. / L
        ess     = T.mean(1 / T.sum(weights ** 2, axis=1))

        # monitoring
        self.monitor['hidden state'] = z
        if self.config['variant_control']:
            lossQ   = -T.mean(T.sum(log_qzx * (weights - bq), axis=1))   # log q(z|x)
            lossPa  = -T.mean(T.sum(log_pz0 * (weights - bp), axis=1))   # log p(z)
            lossPxa = -T.mean(T.sum(log_pxz * weights, axis=1))          # log p(x|z)
            lossP   = lossPxa + lossPa

            # L2 regu
            lossP  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['p']])
            lossQ  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['q']])

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights, bp])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])
        else:
            lossQ   = -T.mean(T.sum(log_qzx * weights, axis=1))   # log q(a|x)
            lossPa  = -T.mean(T.sum(log_pz0 * weights, axis=1))   # log p(a)
            lossPxa = -T.mean(T.sum(log_pxz * weights, axis=1))   # log p(x|a)
            lossP   = lossPxa + lossPa

            # L2 regu
            print 'L2 ?'
            lossP  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['p']])
            lossQ  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['q']])

            updates_p = self.optimizer_p.get_updates(self.tparams['p'], [lossP, weights])
            updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, weights])

        updates   = updates_p + updates_q + scan_update
        logger.info("compiling the compuational graph:: >__< ::training function::")
        train_inputs = [inputs] + [theano.Param(L, default=10)]

        self.train_ = theano.function(train_inputs,
                                      [lossPa, lossPxa, lossQ, perplexity, nll],
                                      updates=updates,
                                      name='train_fun')

        logger.info("compile the computational graph:: >__< :: explore function")
        self.explore_ = theano.function(train_inputs,
                                        [log_wk, count],
                                        updates=scan_update,
                                        name='explore_fun')

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs, updates=scan_update)
        logger.info("pre-training functions compile done.")

    def generate_(self, context=None, max_len=None, mode='display'):
        # overwrite the RNNLM generator as there are hidden variables every time step
        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)


class NVTM(Helmholtz):
    """
    Neural Variational Topical Models
    We use the Neural Variational Inference and Learning (NVIL) to build the
    learning, instead of using Helmholtz Machine(Reweighted Wake-sleep)
    """
    def __init__(self,
                 config, n_rng, rng,
                 mode = 'Evaluation',
                 dynamic_prior=False,
                 ):
        super(VAE, self).__init__(config, n_rng, rng)

        self.mode = mode
        self.name = 'neural_variational'
        self.tparams = dict()
        self.dynamic_prior = dynamic_prior

    def build_(self):
        logger.info('Build Helmholtz Recurrent Neural Networks')

        # backward encoder
        self.encoder = Encoder(self.config, self.rng, prefix='enc')

        # feedforward + hidden content decoder
        self.decoder = Decoder(self.config, self.rng, prefix='dec',
                                   embed=self.encoder.Embed
                                   if self.config['shared_embed']
                                   else None)

        # The main difference between VAE and NVIL is that we can use
        # a more flexible prior instead of Gaussian here.
        # for example, we use a softmax prior here.

        """
        Build the Prior Layer (Conditional Prior)
        """
        # prior distribution (conditional distribution)
        self.Prior    = Dense(
            self.config['dec_hidden_dim'],
            self.config['action_dim'],
            activation='softmax',
            name='prior_proj'
        )

        if self.config['decposterior']:   # we use both enc/dec net as input.

            # Variational Posterior (Q-function)
            self.Posterior = Dense2(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['dec_hidden_dim'],
                self.config['action_dim'],
                activation='softmax',
                name='posterior_proj'
            )

            # Baseline Estimator
            self.C_lambda1 = Dense2(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['dec_hidden_dim'],
                100,
                activation='tanh',
                name='baseline-1')
            self.C_lambda2 = Dense(100, 1, activation='linear',
                                   name='baseline-2')
        else:

            # Variational Posterior
            self.Posterior = Dense(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                self.config['action_dim'],
                activation='softmax',
                name='posterior_proj'
            )

            # Baseline Estimator
            self.C_lambda1 = Dense(
                self.config['enc_hidden_dim']
                        if not self.config['bidirectional']
                        else 2 * self.config['enc_hidden_dim'],
                100,
                activation='tanh',
                name='baseline-1')
            self.C_lambda2 = Dense(100, 1, activation='linear',
                                   name='baseline-2')

        # Action transform to context
        self.context_trans = Dense(
            self.config['action_dim'],
            self.config['dec_contxt_dim'],
            activation='linear',
            name="transform"
        )

        # registration:
        self._add(self.Posterior)
        self._add(self.Prior)
        self._add(self.context_trans)
        self._add(self.C_lambda1)
        self._add(self.C_lambda2)

        self._add(self.encoder)
        self._add(self.decoder)

        # Q-layers:
        self._add_tag(self.encoder, 'q')
        self._add_tag(self.Posterior, 'q')

        # P-layers:
        self._add_tag(self.Prior, 'p')
        self._add_tag(self.decoder, 'p')
        self._add_tag(self.context_trans, 'p')

        # Lambda-layers
        self._add_tag(self.C_lambda1, 'l')
        self._add_tag(self.C_lambda2, 'l')

        # c/v
        self.c = shared_scalar(0., dtype='float32')
        self.v = shared_scalar(1., dtype='float32')

        # objectives and optimizers
        self.optimizer_p = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})
        self.optimizer_q = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})
        self.optimizer_l = optimizers.get(self.config['optimizer'], kwargs={'clipnorm': 5})

        logger.info("create Neural Variational Topic Network. ok")

    def compile_(self, mode='train', contrastive=False):
        # compile the computational graph.
        # INFO: the parameters.
        # mode: 'train'/ 'display'/ 'policy' / 'all'

        ps = 'params: {\n'
        for p in self.params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '}.'
        logger.info(ps)

        param_num = np.sum([np.prod(p.shape.eval()) for p in self.params])
        logger.info("total number of the parameters of the model: {}".format(param_num))

        if mode == 'train' or mode == 'all':
            self.compile_train()

        if mode == 'display' or mode == 'all':
            self.compile_sample()

        if mode == 'inference' or mode == 'all':
            self.compile_inference()

    """
    Training
    """
    def compile_train(self):
        """
        build the training function here <:::>
        """
        # get input sentence (x)
        inputs     = T.imatrix('inputs')  # padded input word sequence (for training)
        batch_size = inputs.shape[0]

        logger.info(
            """
            The Computational Flow. ---> In a recurrent fashion

            [= v =] <:::
            Inference-Generation in one scan

            >>>> Encoding without hidden variable. (use backward encoding.)
            """
        )
        embeded, mask \
                   = self.decoder.Embed(inputs, True)  # (nb_samples, max_len, embedding_dim)
        mask       = T.cast(mask, dtype='float32')

        encoded    = self.encoder.build_encoder(inputs[:, ::-1], return_sequence=True)[:, ::-1, :]

        L          = T.iscalar('repeats')              # self.config['repeats']

        def _repeat(x, dimshuffle=True):
            if x.ndim == 3:
                y = T.extra_ops.repeat(x[:, None, :, :], L, axis=1).reshape((-1, x.shape[1], x.shape[2]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0, 2)
            else:
                y = T.extra_ops.repeat(x[:, None, :], L, axis=1).reshape((-1, x.shape[1]))
                if dimshuffle:
                    y = y.dimshuffle(1, 0)
            return y

        embeded    = _repeat(embeded)                  # (max_len, nb_samples * L, embedding_dim)
        encoded    = _repeat(encoded)                  # (max_len, nb_samples * L, enc_hidden_dim)
        target     = _repeat(inputs, False)            # (nb_samples * L, max_len)
        mask       = _repeat(mask, False)
        count      = T.cast(T.sum(mask, axis=1), dtype=theano.config.floatX)[:, None]  # (nb_samples,)

        init_dec   = T.zeros((encoded.shape[1],
                              self.config['dec_hidden_dim']),
                              dtype='float32')         # zero initialization

        logger.info(
            """
            >>>> Recurrence
            """
        )

        def _recurrence(embed_t, enc_t, dec_tm1):
            """
            x_t:   (nb_samples, dec_embedd_dim)
            enc_t: (nb_samples, enc_hidden_dim)
            dec_t: (nb_samples, dec_hidden_dim)
            """
            # get q(z_t|dec_t, enc_t);  sample z_t;
            # compute the Posterior (inference) prob.
            # compute the baseline estimator
            if self.config['decposterior']:
                q_dis_t   = self.Posterior(enc_t, dec_tm1)
                c_lmd_t   = self.C_lambda2(self.C_lambda1(enc_t, dec_tm1)).flatten(1)

            else:
                q_dis_t   = self.Posterior(enc_t)
                c_lmd_t   = self.C_lambda2(self.C_lambda1(enc_t)).flatten(1)

            # sampling
            z_t       = self.rng.multinomial(pvals=q_dis_t, dtype='float32')
            log_qzx_t = T.sum(T.log(q_dis_t) * z_t, axis=1)

            # compute the prior probability
            p_dis_t   = self.Prior(dec_tm1)
            log_pz0_t = T.sum(T.log(p_dis_t) * z_t, axis=1)

            # compute the decoding probability
            context_t = self.context_trans(z_t)
            readout_t = self.decoder.hidden_readout(dec_tm1) + self.decoder.context_readout(context_t)
            for l in self.decoder.output_nonlinear:
                readout_t = l(readout_t)
            pxz_dis_t = self.decoder.output(readout_t)

            # compute recurrence
            dec_t   = self.decoder.RNN(embed_t, C=context_t, init_h=dec_tm1, one_step=True)

            return dec_t, z_t, log_qzx_t, log_pz0_t, pxz_dis_t, c_lmd_t

        # (max_len, nb_samples, ?)
        outputs, scan_update = theano.scan(
            _recurrence,
            sequences=[embeded, encoded],
            outputs_info=[init_dec, None, None, None, None, None])
        _, z, log_qzx, log_pz0, pxz_dis, c_lmd = outputs

        # summary of scan/ dimshuffle/ reshape
        def _grab_prob(probs, x):
            assert probs.ndim == 3
            b_size     = probs.shape[0]
            max_len    = probs.shape[1]
            vocab_size = probs.shape[2]
            probs      = probs.reshape((b_size * max_len, vocab_size))
            return probs[T.arange(b_size * max_len), x.flatten(1)].reshape(x.shape)  # advanced indexing

        logger.info(
            """
            >>>> Compute the weights [+ _ =]
            """
        )
        # log Q/P and C
        log_qzx    = log_qzx.dimshuffle(1, 0) * mask
        log_pz0    = log_pz0.dimshuffle(1, 0) * mask
        log_pxz    = T.log(_grab_prob(pxz_dis.dimshuffle(1, 0, 2), target)) * mask
        c_lambda   = c_lmd.dimshuffle(1, 0) * mask

        Lb         = T.sum(log_pz0 + log_pxz - log_qzx, axis=-1)   # lower bound
        l_lambda   = log_pz0 + log_pxz - log_qzx - c_lambda

        alpha      = T.cast(0.0, dtype='float32')
        numel      = T.sum(mask)

        cb         = T.sum(l_lambda) / numel
        vb         = T.sum(l_lambda ** 2) / T.sum(mask) - cb ** 2
        c          = self.c * alpha + (1 - alpha) * cb  # T.cast(cb, dtype='float32')
        v          = self.v * alpha + (1 - alpha) * vb  # T.cast(vb, dtype='float32')

        l_normal   = (l_lambda - c) / T.max((1., T.sqrt(v))) * mask
        l_base     = T.mean(T.sum(l_normal, axis=1))
        nll        = T.mean(-Lb)                 # variational lower-bound
        perplexity = T.exp(T.mean(-Lb[:, None] / count))  # perplexity of lower-bound

        logger.info(
            """
            >>>> Compute the gradients [+ _ =]
            """
        )

        # monitoring
        self.monitor['hidden state'] = z

        lossP   = -T.mean(T.sum(log_pxz  + log_pz0,  axis=1))
        lossQ   = -T.mean(T.sum(log_qzx  * l_normal, axis=1))
        lossL   = -T.mean(T.sum(c_lambda * l_normal, axis=1))   # ||L - c - c_lambda||2-> 0

        # lossP   = -T.sum(log_pxz  + log_pz0)  / numel
        # lossQ   = -T.sum(log_qzx  * l_normal) / numel
        # lossL   = -T.sum(c_lambda * l_normal) / numel  # ||L - c - c_lambda||2-> 0
        #
        # # L2 regu
        # print 'L2 ?'
        # lossP  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['p']])
        # lossQ  += 0.0001 * T.sum([T.sum(p**2) for p in self.tparams['q']])

        updates_p = self.optimizer_p.get_updates(self.tparams['p'], lossP)
        updates_q = self.optimizer_q.get_updates(self.tparams['q'], [lossQ, l_normal])
        updates_l = self.optimizer_l.get_updates(self.tparams['l'], [lossL, l_normal])

        updates   = updates_p + updates_q + updates_l + scan_update
        updates  += [(self.c, c), (self.v, v)]

        logger.info("compiling the compuational graph:: >__< ::training function::")
        train_inputs = [inputs] + [theano.Param(L, default=1)]

        self.train_ = theano.function(train_inputs,
                                      [lossL, lossP, lossQ, perplexity, nll, l_base],
                                      updates=updates,
                                      name='train_fun')

        logger.info("compile the computational graph:: >__< :: explore function")
        self.explore_ = theano.function(train_inputs,
                                        [lossL, lossP, lossQ, perplexity, nll, l_base],
                                        updates=scan_update,
                                        name='explore_fun')

        # add monitoring:
        self._monitoring()

        # compiling monitoring
        self.compile_monitoring(train_inputs, updates=scan_update)
        logger.info("pre-training functions compile done.")

    def generate_(self, context=None, max_len=None, mode='display'):
        # overwrite the RNNLM generator as there are hidden variables every time step
        args = dict(k=self.config['sample_beam'],
                    maxlen=self.config['max_len'] if not max_len else max_len,
                    stochastic=self.config['sample_stoch'] if mode == 'display' else None,
                    argmax=self.config['sample_argmax'] if mode == 'display' else None)


================================================
FILE: emolga/utils/__init__.py
================================================
__author__ = 'yinpengcheng'


================================================
FILE: emolga/utils/generic_utils.py
================================================
from __future__ import absolute_import
from matplotlib.ticker import FuncFormatter
import numpy as np
import time
import sys
import six
import matplotlib.pyplot as plt
import matplotlib

def get_from_module(identifier, module_params, module_name, instantiate=False, kwargs=None):
    if isinstance(identifier, six.string_types):
        res = module_params.get(identifier)
        if not res:
            raise Exception('Invalid ' + str(module_name) + ': ' + str(identifier))
        if instantiate and not kwargs:
            return res()
        elif instantiate and kwargs:
            return res(**kwargs)
        else:
            return res
    return identifier


def make_tuple(*args):
    return args

def printv(v, prefix=''):
    if type(v) == dict:
        if 'name' in v:
            print(prefix + '#' + v['name'])
            del v['name']
        prefix += '...'
        for nk, nv in v.items():
            if type(nv) in [dict, list]:
                print(prefix + nk + ':')
                printv(nv, prefix)
            else:
                print(prefix + nk + ':' + str(nv))
    elif type(v) == list:
        prefix += '...'
        for i, nv in enumerate(v):
            print(prefix + '#' + str(i))
            printv(nv, prefix)
    else:
        prefix += '...'
        print(prefix + str(v))


def make_batches(size, batch_size):
    nb_batch = int(np.ceil(size/float(batch_size)))
    return [(i*batch_size, min(size, (i+1)*batch_size)) for i in range(0, nb_batch)]


def slice_X(X, start=None, stop=None):
    if type(X) == list:
        if hasattr(start, '__len__'):
            return [x[start] for x in X]
        else:
            return [x[start:stop] for x in X]
    else:
        if hasattr(start, '__len__'):
            return X[start]
        else:
            return X[start:stop]


class Progbar(object):
    def __init__(self, target, logger, width=30, verbose=1):
        '''
            @param target: total number of steps expected
        '''
        self.width = width
        self.target = target
        self.sum_values = {}
        self.unique_values = []
        self.start = time.time()
        self.total_width = 0
        self.seen_so_far = 0
        self.verbose = verbose

        self.logger = logger

    def update(self, current, values=[]):
        '''
        @param current: index of current step
        @param values: list of tuples (name, value_for_last_step).
        The progress bar will display averages for these values.
        '''
        for k, v in values:
            if k not in self.sum_values:
                self.sum_values[k] = [v * (current - self.seen_so_far), current - self.seen_so_far]
                self.unique_values.append(k)
            else:
                self.sum_values[k][0] += v * (current - self.seen_so_far)
                self.sum_values[k][1] += (current - self.seen_so_far)
        self.seen_so_far = current

        now = time.time()
        if self.verbose == 1:
            prev_total_width = self.total_width
            sys.stdout.write("\b" * prev_total_width)
            sys.stdout.write("\r")

            numdigits = int(np.floor(np.log10(self.target))) + 1
            barstr = '%%%dd/%%%dd [' % (numdigits, numdigits)
            bar = barstr % (current, self.target)
            prog = float(current)/self.target
            prog_width = int(self.width*prog)
            if prog_width > 0:
                bar += ('.'*(prog_width-1))
                if current < self.target:
                    bar += '(-w-)'
                else:
                    bar += '(-v-)!!'
            bar += ('~' * (self.width-prog_width))
            bar += ']'
            sys.stdout.write(bar)
            self.total_width = len(bar)

            if current:
                time_per_unit = (now - self.start) / current
            else:
                time_per_unit = 0
            eta = time_per_unit*(self.target - current)

            # info = ''
            info = bar
            if current < self.target:
                info += ' - Run-time: %ds - ETA: %ds' % (now - self.start, eta)
            else:
                info += ' - %ds' % (now - self.start)
            for k in self.unique_values:
                if k == 'perplexity' or k == 'PPL':
                    info += ' - %s: %.4f' % (k, np.exp(self.sum_values[k][0] / max(1, self.sum_values[k][1])))
                else:
                    info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))

            self.total_width += len(info)
            if prev_total_width > self.total_width:
                info += ((prev_total_width-self.total_width) * " ")

            # sys.stdout.write(info)
            # sys.stdout.flush()

            self.logger.info(info)

            if current >= self.target:
                sys.stdout.write("\n")

        if self.verbose == 2:
            if current >= self.target:
                info = '%ds' % (now - self.start)
                for k in self.unique_values:
                    info += ' - %s: %.4f' % (k, self.sum_values[k][0] / max(1, self.sum_values[k][1]))
                # sys.stdout.write(info + "\n")
                self.logger.info(info + "\n")

    def add(self, n, values=[]):
        self.update(self.seen_so_far + n, values)

    def clear(self):
        self.sum_values = {}
        self.unique_values = []
        self.total_width = 0
        self.seen_so_far = 0


def print_sample(idx2word, idx):
    def cut_eol(words):
        for i, word in enumerate(words):
            if words[i] == '<eol>':
                return words[:i + 1]
        raise Exception("No end-of-line found")

    return cut_eol(map(lambda w_idx : idx2word[w_idx], idx))


def visualize_(subplots, data, w=None, h=None, name=None,
               display='on', size=10, text=None, normal=True,
               grid=False):
    fig, ax = subplots
    if data.ndim == 1:
        if w and h:
            # vector visualization
            assert w * h == np.prod(data.shape)
            data = data.reshape((w, h))
        else:
            L = data.shape[0]
            w = int(np.sqrt(L))
            while L % w > 0:
                w -= 1
            h = L / w
            assert w * h == np.prod(data.shape)
            data = data.reshape((w, h))
    else:
        w = data.shape[0]
        h = data.shape[1]

    if not size:
        size = 30 / np.sqrt(w * h)

    print(data.shape)

    major_ticks = np.arange(0, h, 1)
    ax.set_xticks(major_ticks)
    ax.set_xlim(0, h)
    major_ticks = np.arange(0, w, 1)
    ax.set_ylim(w, -1)
    ax.set_yticks(major_ticks)
    ax.set_aspect('equal')
    if grid:
        pass
        ax.grid(which='both')
        # ax.axis('equal')
    if normal:
        cax = ax.imshow(data, cmap=plt.cm.pink, interpolation='nearest',
                        vmax=1.0, vmin=0.0, aspect='auto')
    else:
        cax = ax.imshow(data, cmap=plt.cm.bone, interpolation='nearest', aspect='auto')

    if name:
        ax.set_title(name)
    else:
        ax.set_title('sample.')
    import matplotlib.ticker as ticker

    # ax.xaxis.set_ticks(np.arange(0, h, 1.))
    # ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))
    # ax.yaxis.set_ticks(np.arange(0, w, 1.))
    # ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%0.1f'))

    # ax.set_xticks(np.linspace(0, 1, h))
    # ax.set_yticks(np.linspace(0, 1, w))
    # Move left and bottom spines outward by 10 points
    # ax.spines['left'].set_position(('outward', size))
    # ax.spines['bottom'].set_position(('outward', size))
    # # Hide the right and top spines
    # ax.spines['right'].set_visible(False)
    # ax.spines['top'].set_visible(False)
    # # Only show ticks on the left and bottom spines
    # ax.yaxis.set_ticks_position('left')
    # ax.xaxis.set_ticks_position('bottom')

    if text:
        ax.set_yticks(np.linspace(0, 1, 33) * size * 3.2)
        ax.set_yticklabels([text[s] for s in range(33)])
    # cbar = fig.colorbar(cax)

    if display == 'on':
        plt.show()
    else:
        return ax


def vis_Gaussian(subplot, mean, std, name=None, display='off', size=10):
    ax   = subplot
    data = np.random.normal(size=(2, 10000))
    data[0] = data[0] * std[0] + mean[0]
    data[1] = data[1] * std[1] + mean[1]

    ax.scatter(data[0].tolist(), data[1].tolist(), 'r.')
    if display == 'on':
        plt.show()
    else:
        return ax

================================================
FILE: emolga/utils/io_utils.py
================================================
from __future__ import absolute_import
import h5py
import numpy as np
import cPickle
from collections import defaultdict


class HDF5Matrix():
    refs = defaultdict(int)

    def __init__(self, datapath, dataset, start, end, normalizer=None):
        if datapath not in list(self.refs.keys()):
            f = h5py.File(datapath)
            self.refs[datapath] = f
        else:
            f = self.refs[datapath]
        self.start = start
        self.end = end
        self.data = f[dataset]
        self.normalizer = normalizer

    def __len__(self):
        return self.end - self.start

    def __getitem__(self, key):
        if isinstance(key, slice):
            if key.stop + self.start <= self.end:
                idx = slice(key.start+self.start, key.stop + self.start)
            else:
                raise IndexError
        elif isinstance(key, int):
            if key + self.start < self.end:
                idx = key+self.start
            else:
                raise IndexError
        elif isinstance(key, np.ndarray):
            if np.max(key) + self.start < self.end:
                idx = (self.start + key).tolist()
            else:
                raise IndexError
        elif isinstance(key, list):
            if max(key) + self.start < self.end:
                idx = [x + self.start for x in key]
            else:
                raise IndexError
        if self.normalizer is not None:
            return self.normalizer(self.data[idx])
        else:
            return self.data[idx]

    @property
    def shape(self):
        return tuple([self.end - self.start, self.data.shape[1]])


def save_array(array, name):
    import tables
    f = tables.open_file(name, 'w')
    atom = tables.Atom.from_dtype(array.dtype)
    ds = f.createCArray(f.root, 'data', atom, array.shape)
    ds[:] = array
    f.close()


def load_array(name):
    import tables
    f = tables.open_file(name)
    array = f.root.data
    a = np.empty(shape=array.shape, dtype=array.dtype)
    a[:] = array[:]
    f.close()
    return a


def save_config():
    pass


def load_config():
    pass

================================================
FILE: emolga/utils/np_utils.py
================================================
from __future__ import absolute_import
import numpy as np
import scipy as sp
from six.moves import range
from six.moves import zip


def to_categorical(y, nb_classes=None):
    '''Convert class vector (integers from 0 to nb_classes)
    to binary class matrix, for use with categorical_crossentropy
    '''
    y = np.asarray(y, dtype='int32')
    if not nb_classes:
        nb_classes = np.max(y)+1
    Y = np.zeros((len(y), nb_classes))
    for i in range(len(y)):
        Y[i, y[i]] = 1.
    return Y


def normalize(a, axis=-1, order=2):
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2 == 0] = 1
    return a / np.expand_dims(l2, axis)


def binary_logloss(p, y):
    epsilon = 1e-15
    p = sp.maximum(epsilon, p)
    p = sp.minimum(1-epsilon, p)
    res = sum(y * sp.log(p) + sp.subtract(1, y) * sp.log(sp.subtract(1, p)))
    res *= -1.0/len(y)
    return res


def multiclass_logloss(P, Y):
    score = 0.
    npreds = [P[i][Y[i]-1] for i in range(len(Y))]
    score = -(1. / len(Y)) * np.sum(np.log(npreds))
    return score


def accuracy(p, y):
    return np.mean([a == b for a, b in zip(p, y)])


def probas_to_classes(y_pred):
    if len(y_pred.shape) > 1 and y_pred.shape[1] > 1:
        return categorical_probas_to_classes(y_pred)
    return np.array([1 if p > 0.5 else 0 for p in y_pred])


def categorical_probas_to_classes(p):
    return np.argmax(p, axis=1)


================================================
FILE: emolga/utils/test_utils.py
================================================
import numpy as np


def get_test_data(nb_train=1000, nb_test=500, input_shape=(10,), output_shape=(2,),
                  classification=True, nb_class=2):
    '''
        classification=True overrides output_shape
        (i.e. output_shape is set to (1,)) and the output
        consists in integers in [0, nb_class-1].

        Otherwise: float output with shape output_shape.
    '''
    nb_sample = nb_train + nb_test
    if classification:
        y = np.random.randint(0, nb_class, size=(nb_sample, 1))
        X = np.zeros((nb_sample,) + input_shape)
        for i in range(nb_sample):
            X[i] = np.random.normal(loc=y[i], scale=1.0, size=input_shape)
    else:
        y_loc = np.random.random((nb_sample,))
        X = np.zeros((nb_sample,) + input_shape)
        y = np.zeros((nb_sample,) + output_shape)
        for i in range(nb_sample):
            X[i] = np.random.normal(loc=y_loc[i], scale=1.0, size=input_shape)
            y[i] = np.random.normal(loc=y_loc[i], scale=1.0, size=output_shape)

    return (X[:nb_train], y[:nb_train]), (X[nb_train:], y[nb_train:])


================================================
FILE: emolga/utils/theano_utils.py
================================================
from __future__ import absolute_import

from theano import gof
from theano.tensor import basic as tensor
import numpy as np
import theano
import theano.tensor as T


def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)


def sharedX(X, dtype=theano.config.floatX, name=None):
    return theano.shared(np.asarray(X, dtype=dtype), name=name)


def shared_zeros(shape, dtype=theano.config.floatX, name=None):
    return sharedX(np.zeros(shape), dtype=dtype, name=name)


def shared_scalar(val=0., dtype=theano.config.floatX, name=None):
    return theano.shared(np.cast[dtype](val), name=name)


def shared_ones(shape, dtype=theano.config.floatX, name=None):
    return sharedX(np.ones(shape), dtype=dtype, name=name)


def alloc_zeros_matrix(*dims):
    return T.alloc(np.cast[theano.config.floatX](0.), *dims)


def alloc_ones_matrix(*dims):
    return T.alloc(np.cast[theano.config.floatX](1.), *dims)


def ndim_tensor(ndim):
    if ndim == 1:
        return T.vector()
    elif ndim == 2:
        return T.matrix()
    elif ndim == 3:
        return T.tensor3()
    elif ndim == 4:
        return T.tensor4()
    return T.matrix()


# get int32 tensor
def ndim_itensor(ndim, name=None):
    if ndim == 2:
        return T.imatrix(name)
    elif ndim == 3:
        return T.itensor3(name)
    elif ndim == 4:
        return T.itensor4(name)
    return T.imatrix(name)


# dot-product
def dot(inp, matrix, bias=None):
    """
    Decide the right type of dot product depending on the input
    arguments
    """
    if 'int' in inp.dtype and inp.ndim == 2:
        return matrix[inp.flatten()]
    elif 'int' in inp.dtype:
        return matrix[inp]
    elif 'float' in inp.dtype and inp.ndim == 3:
        shape0 = inp.shape[0]
        shape1 = inp.shape[1]
        shape2 = inp.shape[2]
        if bias:
            return (T.dot(inp.reshape((shape0 * shape1, shape2)), matrix) + bias).reshape((shape0, shape1, matrix.shape[1]))
        else:
            return T.dot(inp.reshape((shape0 * shape1, shape2)), matrix).reshape((shape0, shape1, matrix.shape[1]))
    else:
        if bias:
            return T.dot(inp, matrix) + bias
        else:
            return T.dot(inp, matrix)


# Numerically stable log(sum(exp(A))). Can also be used in softmax function.
def logSumExp(x, axis=None, mask=None, status='theano', c=None, err=1e-7):
    """
        Numerically stable log(sum(exp(A))). Can also be used in softmax function.
        c is the additional input when it doesn't require masking but x need.

    """
    if status == 'theano':
        J = T
    else:
        J = np

    if c is None:
        x_max = J.max(x, axis=axis, keepdims=True)
    else:
        x_max = J.max(J.concatenate([c, x], axis=-1), axis=axis, keepdims=True)

    if c is None:
        if not mask:
            l_t = J.sum(J.exp(x - x_max), axis=axis, keepdims=True)

        else:
            l_t = J.sum(J.exp(x - x_max) * mask, axis=axis, keepdims=True)
    else:
        if not mask:
            l_t = J.sum(J.exp(x - x_max), axis=axis, keepdims=True) + \
                  J.sum(J.exp(c - x_max), axis=axis, keepdims=True)
        else:
            l_t = J.sum(J.exp(x - x_max) * mask, axis=axis, keepdims=True) + \
                  J.sum(J.exp(c - x_max), axis=axis, keepdims=True)

    x_t = J.log(J.maximum(l_t, err)) + x_max
    return x_t


def softmax(x):
    return T.nnet.softmax(x.reshape((-1, x.shape[-1]))).reshape(x.shape)


def masked_softmax(x, mask, err=1e-7):
    assert x.ndim == 2, 'support two-dimension'
    weights  = softmax(x)
    weights *= mask
    weights  = weights / (T.sum(weights, axis=-1)[:, None] + err) * mask
    return weights


def cosine_sim(k, M):
    k_unit = k / (T.sqrt(T.sum(k**2)) + 1e-5)
    # T.patternbroadcast(k_unit.reshape((1,k_unit.shape[0])),(True,False))
    k_unit = k_unit.dimshuffle(('x', 0))
    k_unit.name = "k_unit"
    M_lengths = T.sqrt(T.sum(M**2, axis=1)).dimshuffle((0, 'x'))
    M_unit = M / (M_lengths + 1e-5)
    M_unit.name = "M_unit"
    return T.sum(k_unit * M_unit, axis=1)


def cosine_sim2d(k, M):
    # k: (nb_samples, memory_width)
    # M: (nb_samples, memory_dim, memory_width)

    # norms of keys and memories
    k_norm = T.sqrt(T.sum(T.sqr(k), 1)) + 1e-5  # (nb_samples,)
    M_norm = T.sqrt(T.sum(T.sqr(M), 2)) + 1e-5  # (nb_samples, memory_dim,)

    k      = k[:, None, :]                      # (nb_samples, 1, memory_width)
    k_norm = k_norm[:, None]                    # (nb_samples, 1)

    sim    = T.sum(k * M, axis=2)               # (nb_samples, memory_dim,)
    sim   /= k_norm * M_norm                    # (nb_samples, memory_dim,)
    return sim


def dot_2d(k, M, b=None, g=None):
    # k: (nb_samples, memory_width)
    # M: (nb_samples, memory_dim, memory_width)

    # norms of keys and memories
    # k_norm = T.sqrt(T.sum(T.sqr(k), 1)) + 1e-5  # (nb_samples,)
    # M_norm = T.sqrt(T.sum(T.sqr(M), 2)) + 1e-5  # (nb_samples, memory_dim,)

    k      = k[:, None, :]                      # (nb_samples, 1, memory_width)
    value  = k * M
    if b is not None:
        b  = b[:, None, :]
        value *= b         # (nb_samples, memory_dim,)

    if g is not None:
        g  = g[None, None, :]
        value *= g

    sim    = T.sum(value, axis=2)
    return sim


def shift_convolve(weight, shift, shift_conv):
    shift = shift.dimshuffle((0, 'x'))
    return T.sum(shift * weight[shift_conv], axis=0)


def shift_convolve2d(weight, shift, shift_conv):
    return T.sum(shift[:, :, None] * weight[:, shift_conv], axis=1)


================================================
FILE: keyphrase/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: keyphrase/baseline/evaluate.py
================================================
import math
import logging
import string

import scipy
from nltk.stem.porter import *
import numpy as np

import os
import sys
import keyphrase.config as config
# prepare logging.
from keyphrase.dataset import dataset_utils
import keyphrase.config

# config = keyphrase.config.setup_keyphrase_all()
config = keyphrase.config.setup_keyphrase_baseline()  # load settings.

def load_phrase(file_path, tokenize=True):
    phrases = []
    with open(file_path, 'r') as f:
        # TODO here the ground-truth is already after processing, contains <digit>, not good for baseline methods...
        if tokenize:
            phrase_str = ';'.join([l.strip() for l in f.readlines()])
            phrases = dataset_utils.process_keyphrase(phrase_str)
        else:
            phrases = [l.strip().split(' ') for l in f.readlines()]
        return phrases

def evaluate_(text_dir, target_dir, prediction_dir, model_name, dataset_name, do_stem=True):
    '''
    '''
    stemmer = PorterStemmer()

    print('Evaluating on %s@%s' % (model_name, dataset_name))
    # Evaluation part
    macro_metrics = []
    macro_matches = []

    doc_names = [name[:name.index('.')] for name in os.listdir(text_dir)]

    number_groundtruth = 0
    number_present_groundtruth = 0

    for doc_name in doc_names:
        logger.info('[FILE]{0}'.format(text_dir+'/'+doc_name+'.txt'))
        with open(text_dir+'/'+doc_name+'.txt', 'r') as f:
            text_tokens = (' '.join(f.readlines())).split( )

            text    = [t.split('_')[0] for t in text_tokens]
            postag  = [t.split('_')[1] for t in text_tokens]

        targets = load_phrase(target_dir+'/'+doc_name+'.txt', True)

        predictions = load_phrase(prediction_dir+'/'+doc_name+'.txt.phrases', False)

        # do processing to baseline predictions
        if (not model_name.startswith('CopyRNN')) and (not model_name.startswith('RNN')):
            predictions = dataset_utils.process_keyphrase(';'.join([' '.join(p) for p in predictions]))

        correctly_matched = np.asarray([0] * len(predictions), dtype='int32')

        print(targets)
        print(predictions)
        print('*' * 100)

        # convert target index into string
        if do_stem:
            stemmed_input    = [stemmer.stem(t).strip().lower() for t in text]
            targets = [[stemmer.stem(w).strip().lower() for w in target] for target in targets]

        if 'target_filter' in config:
            present_targets = []

            for target in targets:
                keep = True
                # whether do filtering on groundtruth phrases. if config['target_filter']==None, do nothing
                match = None
                for i in range(len(stemmed_input) - len(target) + 1):
                    match = None
                    for j in range(len(target)):
                        if target[j] != stemmed_input[i + j]:
                            match = False
                            break
                    if j == len(target) - 1 and match == None:
                        match = True
                        break

                if match == True:
                    # if match and 'appear-only', keep this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and True
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and False
                elif match == False:
                    # if not match and 'appear-only', discard this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and False
                    # if not match and 'non-appear-only', keep this phrase
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and True

                if not keep:
                    continue

                present_targets.append(target)

            number_groundtruth += len(targets)
            number_present_groundtruth += len(present_targets)
            targets = present_targets

        printable = set(string.printable)
        # lines = [filter(lambda x: x in printable, l) for l in lines]
        predictions = [[filter(lambda x: x in printable, w) for w in prediction] for prediction in predictions]
        predictions = [[stemmer.stem(w).strip().lower() for w in prediction] for prediction in predictions]
        for pid, predict in enumerate(predictions):
            # check whether the predicted phrase is correct (match any groundtruth)
            for target in targets:
                if len(target) == len(predict):
                    flag = True
                    for i, w in enumerate(predict):
                        if predict[i] != target[i]:
                            flag = False
                    if flag:
                        correctly_matched[pid] = 1
                        break

        metric_dict = {}
        for number_to_predict in [5, 10, 15]:
            metric_dict['target_number'] = len(targets)
            metric_dict['prediction_number'] = len(predictions)
            metric_dict['correct_number@%d' % number_to_predict] = sum(correctly_matched[:number_to_predict])

            metric_dict['p@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict])) / float(
                number_to_predict)

            if len(targets) != 0:
                metric_dict['r@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict])) / float(
                    len(targets))
            else:
                metric_dict['r@%d' % number_to_predict] = 0

            if metric_dict['p@%d' % number_to_predict] + metric_dict['r@%d' % number_to_predict] != 0:
                metric_dict['f1@%d' % number_to_predict] = 2 * metric_dict['p@%d' % number_to_predict] * metric_dict[
                    'r@%d' % number_to_predict] / float(
                    metric_dict['p@%d' % number_to_predict] + metric_dict['r@%d' % number_to_predict])
            else:
                metric_dict['f1@%d' % number_to_predict] = 0

            # Compute the binary preference measure (Bpref)
            bpref = 0.
            trunked_match = correctly_matched[:number_to_predict].tolist()  # get the first K prediction to evaluate
            match_indexes = np.nonzero(trunked_match)[0]

            if len(match_indexes) > 0:
                for mid, mindex in enumerate(match_indexes):
                    bpref += 1. - float(mindex - mid) / float(
                        number_to_predict)  # there're mindex elements, and mid elements are correct, before the (mindex+1)-th element
                metric_dict['bpref@%d' % number_to_predict] = float(bpref) / float(len(match_indexes))
            else:
                metric_dict['bpref@%d' % number_to_predict] = 0

            # Compute the mean reciprocal rank (MRR)
            rank_first = 0
            try:
                rank_first = trunked_match.index(1) + 1
            except ValueError:
                pass

            if rank_first > 0:
                metric_dict['mrr@%d' % number_to_predict] = float(1) / float(rank_first)
            else:
                metric_dict['mrr@%d' % number_to_predict] = 0

        macro_metrics.append(metric_dict)
        macro_matches.append(correctly_matched)

        '''
        Print information on each prediction
        '''
        # print stuff
        a = '[SOURCE][{0}]: {1}'.format(len(text) ,' '.join(text))
        logger.info(a)
        a += '\n'

        b = '[TARGET]: %d targets\n\t\t' % (len(targets))
        for id, target in enumerate(targets):
            b += ' '.join(target) + '; '
        logger.info(b)
        b += '\n'
        c = '[DECODE]: %d predictions' % (len(predictions))
        for id, predict in enumerate(predictions):
            c += ('\n\t\t[%d][%d]' % (len(predict), sum([len(w) for w in predict]))) + ' '.join(predict)
            if correctly_matched[id] == 1:
                c += ' [correct!]'
                # print(('\n\t\t[%.3f]'% score) + ' '.join(predict) + ' [correct!]')
                # print(('\n\t\t[%.3f]'% score) + ' '.join(predict))
        c += '\n'

        # c = '[DECODE]: {}'.format(' '.join(cut_zero(phrase, idx2word)))
        # if inputs_unk is not None:
        #     k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
        #     logger.info(k)
        # a += k
        logger.info(c)
        a += b + c

        for number_to_predict in [5, 10, 15]:
            d = '@%d - Precision=%.4f, Recall=%.4f, F1=%.4f, Bpref=%.4f, MRR=%.4f' % (
            number_to_predict, metric_dict['p@%d' % number_to_predict], metric_dict['r@%d' % number_to_predict],
            metric_dict['f1@%d' % number_to_predict], metric_dict['bpref@%d' % number_to_predict], metric_dict['mrr@%d' % number_to_predict])
            logger.info(d)
            a += d + '\n'

        logger.info('*' * 100)

    logger.info('#(Ground-truth Keyphrase)=%d' % number_groundtruth)
    logger.info('#(Present Ground-truth Keyphrase)=%d' % number_present_groundtruth)

    '''
    Export the f@5 and f@10 for significance test
    '''
    for k in [5, 10]:
        with open(config['predict_path'] + '/macro-f@%d-' % (k) + model_name+'-'+dataset_name+'.txt', 'w') as writer:
            writer.write('\n'.join([str(m['f1@%d' % k]) for m in macro_metrics]))

    '''
    Compute the corpus evaluation
    '''
    csv_writer = open(config['predict_path'] + '/evaluate-' + model_name+'-'+dataset_name+'.txt', 'w')

    real_test_size = len(doc_names)
    overall_score = {}
    for k in [5, 10, 15]:
        correct_number = sum([m['correct_number@%d' % k] for m in macro_metrics])
        overall_target_number = sum([m['target_number'] for m in macro_metrics])
        overall_prediction_number = sum([m['prediction_number'] for m in macro_metrics])

        if real_test_size * k < overall_prediction_number:
            overall_prediction_number = real_test_size * k

        # Compute the macro Measures, by averaging the macro-score of each prediction
        overall_score['p@%d' % k] = float(sum([m['p@%d' % k] for m in macro_metrics])) / float(real_test_size)
        overall_score['r@%d' % k] = float(sum([m['r@%d' % k] for m in macro_metrics])) / float(real_test_size)
        overall_score['f1@%d' % k] = float(sum([m['f1@%d' % k] for m in macro_metrics])) / float(real_test_size)

        # Print basic statistics
        logger.info('%s@%s' % (model_name, dataset_name))
        output_str = 'Overall - %s valid testing data=%d, Number of Target=%d/%d, Number of Prediction=%d, Number of Correct=%d' % (
                    config['predict_type'], real_test_size,
                    overall_target_number, overall_target_number,
                    overall_prediction_number, correct_number
        )
        logger.info(output_str)
        # Print macro-average performance
        output_str = 'macro:\t\tP@%d=%f, R@%d=%f, F1@%d=%f' % (
                    k, overall_score['p@%d' % k],
                    k, overall_score['r@%d' % k],
                    k, overall_score['f1@%d' % k]
        )
        logger.info(output_str)
        csv_writer.write(', %f, %f, %f' % (
                    overall_score['p@%d' % k],
                    overall_score['r@%d' % k],
                    overall_score['f1@%d' % k]
        ))

        # Print micro-average performance
        overall_score['micro_p@%d' % k] = correct_number / float(overall_prediction_number)
        overall_score['micro_r@%d' % k] = correct_number / float(overall_target_number)
        if overall_score['micro_p@%d' % k] + overall_score['micro_r@%d' % k] > 0:
            overall_score['micro_f1@%d' % k] = 2 * overall_score['micro_p@%d' % k] * overall_score[
                'micro_r@%d' % k] / float(overall_score['micro_p@%d' % k] + overall_score['micro_r@%d' % k])
        else:
            overall_score['micro_f1@%d' % k] = 0

        output_str = 'micro:\t\tP@%d=%f, R@%d=%f, F1@%d=%f' % (
                    k, overall_score['micro_p@%d' % k],
                    k, overall_score['micro_r@%d' % k],
                    k, overall_score['micro_f1@%d' % k]
        )
        logger.info(output_str)
        csv_writer.write(', %f, %f, %f' % (
                    overall_score['micro_p@%d' % k],
                    overall_score['micro_r@%d' % k],
                    overall_score['micro_f1@%d' % k]
        ))

        # Compute the binary preference measure (Bpref)
        overall_score['bpref@%d' % k] = float(sum([m['bpref@%d' % k] for m in macro_metrics])) / float(real_test_size)

        # Compute the mean reciprocal rank (MRR)
        overall_score['mrr@%d' % k] = float(sum([m['mrr@%d' % k] for m in macro_metrics])) / float(real_test_size)

        output_str = '\t\t\tBpref@%d=%f, MRR@%d=%f' % (
                    k, overall_score['bpref@%d' % k],
                    k, overall_score['mrr@%d' % k]
        )
        logger.info(output_str)
    csv_writer.close()


def init_logging(logfile):
    formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(module)s: %(message)s',
                                  datefmt='%m/%d/%Y %H:%M:%S')
    fh = logging.FileHandler(logfile)
    # ch = logging.StreamHandler()
    ch = logging.StreamHandler(sys.stdout)

    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    # fh.setLevel(logging.INFO)
    ch.setLevel(logging.INFO)
    logging.getLogger().addHandler(ch)
    logging.getLogger().addHandler(fh)
    logging.getLogger().setLevel(logging.INFO)

    return logging

print('Log path: %s' % (
    config['path_experiment'] + '/experiments.{0}.id={1}.log'.format(config['task_name'], config['timemark'])))
logger = init_logging(
    config['path_experiment'] + '/experiments.{0}.id={1}.log'.format(config['task_name'], config['timemark']))
logger = logging.getLogger(__name__)


def evaluate_baselines():
    '''
    evaluate baselines' performance
    :return:
    '''
    # base_dir = '/Users/memray/Project/Keyphrase_Extractor-UTD/'
    # 'TfIdf', 'TextRank', 'SingleRank', 'ExpandRank', 'Maui', 'KEA', 'RNN_present', 'CopyRNN_present_singleword=0', 'CopyRNN_present_singleword=1', 'CopyRNN_present_singleword=2'
    models = ['CopyRNN_present_singleword=1']

    test_sets = config['testing_datasets']

    for model_name in models:
        for dataset_name in test_sets:
            text_dir       = config['baseline_data_path'] + dataset_name + '/text/'
            target_dir     = config['baseline_data_path'] + dataset_name + '/keyphrase/'

            base_dir = config['path'] + '/dataset/keyphrase/prediction/' + model_name + '/'
            prediction_dir = base_dir + dataset_name

            #if model_name == 'Maui':
            #    prediction_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/maui_output/' + dataset_name
            #if model_name == 'Kea':
            #    prediction_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/kea_output/' + dataset_name

            evaluate_(text_dir, target_dir, prediction_dir, model_name, dataset_name)

def significance_test():
    model1 = 'CopyRNN'
    models = ['TfIdf', 'TextRank', 'SingleRank', 'ExpandRank', 'RNN', 'CopyRNN']

    test_sets = config['testing_datasets']

    def load_result(filepath):
        with open(filepath, 'r') as reader:
            return [float(l.strip()) for l in reader.readlines()]

    for model2 in models:
        print('*'*20 + '  %s Vs. %s  ' % (model1, model2) + '*' * 20)
        for dataset_name in test_sets:
            for k in [5, 10]:
                print('Evaluating on %s@%d' % (dataset_name, k))
                filepath = config['predict_path'] + '/macro-f@%d-' % (k) + model1 + '-' + dataset_name + '.txt'
                val1 = load_result(filepath)
                filepath = config['predict_path'] + '/macro-f@%d-' % (k) + model2 + '-' + dataset_name + '.txt'
                val2 = load_result(filepath)
                s_test = scipy.stats.wilcoxon(val1, val2)
                print(s_test)

if __name__ == '__main__':
    evaluate_baselines()
    # significance_test()


================================================
FILE: keyphrase/baseline/export_dataset.py
================================================
import os

import numpy
import shutil

import keyphrase.config
from emolga.dataset.build_dataset import deserialize_from_file
from keyphrase.dataset.keyphrase_test_dataset import load_additional_testing_data

def export_UTD():
    # prepare logging.
    config  = keyphrase.config.setup_keyphrase_all()   # load settings.

    train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])
    test_sets = load_additional_testing_data(config['testing_datasets'], idx2word, word2idx, config)

    for dataset_name, dataset in test_sets.items():
        print('Exporting %s' % str(dataset_name))

        # keep the first 400 in krapivin
        if dataset_name == 'krapivin':
            dataset['tagged_source'] = dataset['tagged_source'][:400]

        for i, d in enumerate(zip(dataset['tagged_source'], dataset['target_str'])):
            source_postag, target = d
            print('[%d/%d]' % (i, len(dataset['tagged_source'])))

            output_text = ' '.join([sp[0]+'_'+sp[1] for sp in source_postag])

            output_dir = config['baseline_data_path'] + dataset_name + '/text/'
            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            with open(output_dir+'/'+str(i)+'.txt', 'w') as f:
                f.write(output_text)

            output_text = '\n'.join([' '.join(t) for t in target])
            tag_output_dir = config['baseline_data_path'] + dataset_name + '/keyphrase/'
            if not os.path.exists(tag_output_dir):
                os.makedirs(tag_output_dir)
            with open(tag_output_dir+'/'+str(i)+'.txt', 'w') as f:
                f.write(output_text)

class Document(object):
    def __init__(self):
        self.name       = ''
        self.title      = ''
        self.text       = ''
        self.phrases    = []

    def __str__(self):
        return '%s\n\t%s\n\t%s' % (self.title, self.text, str(self.phrases))

def load_text(doclist, textdir):
    for filename in os.listdir(textdir):
        with open(textdir+filename) as textfile:
            doc = Document()
            doc.name = filename[:filename.find('.txt')]

            import string
            printable = set(string.printable)

            # print((filename))
            try:
                lines = textfile.readlines()

                lines = [filter(lambda x: x in printable, l) for l in lines]

                title = lines[0].encode('ascii', 'ignore').decode('ascii', 'ignore')
                # the 2nd line is abstract title
                text  = (' '.join(lines[2:])).encode('ascii', 'ignore').decode('ascii', 'ignore')

                # if lines[1].strip().lower() != 'abstract':
                #     print('Wrong title detected : %s' % (filename))

                doc.title = title
                doc.text  = text
                doclist.append(doc)

            except UnicodeDecodeError:
                print('UnicodeDecodeError detected! %s' % filename )
    return doclist

def load_keyphrase(doclist, keyphrasedir):
    for doc in doclist:
        phrase_set = set()
        if os.path.exists(keyphrasedir + doc.name + '.keyphrases'):
            with open(keyphrasedir+doc.name+'.keyphrases') as keyphrasefile:
                phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])
        # else:
        #     print(self.keyphrasedir + doc.name + '.keyphrases Not Found')

        if os.path.exists(keyphrasedir + doc.name + '.keywords'):
            with open(keyphrasedir + doc.name + '.keywords') as keyphrasefile:
                phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])
        # else:
        #     print(self.keyphrasedir + doc.name + '.keywords Not Found')

        doc.phrases = list(phrase_set)
    return doclist

def get_doc(text_dir, phrase_dir):
    '''
    :return: a list of dict instead of the Document object
    '''
    doclist = []
    doclist = load_text(doclist, text_dir)
    doclist = load_keyphrase(doclist, phrase_dir)

    for d in doclist:
        print(d)

    return doclist

def export_maui():
    # prepare logging.
    config  = keyphrase.config.setup_keyphrase_all()   # load settings.

    data_infos = [['inspec_train',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/INSPEC/train_validation_texts/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/INSPEC/gold_standard_train_validation/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/inspec/train/'],
                  ['inspec_test',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/INSPEC/test_texts/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/INSPEC/gold_standard_test/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/inspec/test/'],
                  ['nus',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/NUS/abstract_introduction_texts/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/NUS/gold_standard_keyphrases/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/nus/'],
                  ['semeval_train',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/SemEval/train+trial/all_texts/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/SemEval/train+trial/gold_standard_keyphrases_3/',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/semeval/train/'],
                  ['semeval_test',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/testing-data/SemEval/test/',
                   '',
                   '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/semeval/test/']
                  ]

    for dataset_name, text_dir, target_dir, output_dir in data_infos:
        print('Exporting %s in %s' % (str(dataset_name), text_dir))
        file_names = [file_name[: file_name.index('.txt')] for file_name in os.listdir(text_dir)]

        if not os.path.exists(text_dir):
            os.makedirs(text_dir)
        if target_dir!="" and not os.path.exists(target_dir):
            os.makedirs(target_dir)
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        for i, file_name in enumerate(file_names):
            print('Exporting %d. %s - %s' % (i, dataset_name, file_name))
            print('Text file %s' % (text_dir + str(i) + '.txt'))
            with open(text_dir + file_name + '.txt', 'r') as inf:
                text = inf.read()
                with open(output_dir + str(i) + '.txt', 'w') as outf:
                    outf.write(text)

            print('Target file %s' % (target_dir + file_name + '.keyphrases'))
            targets = []
            if target_dir.strip() != '':
                with open(target_dir + file_name + '.keyphrases', 'r') as inf:
                    targets.extend([l.strip() for l in inf.readlines()])
                with open(target_dir + file_name + '.keywords', 'r') as inf:
                    targets.extend([l.strip() for l in inf.readlines()])
                with open(output_dir + str(i) + '.key', 'w') as outf:
                    outf.write('\n'.join([t + '\t1' for t in targets]))

def export_krapivin_maui():
    # prepare logging.
    config  = keyphrase.config.setup_keyphrase_all()   # load settings.

    train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])
    test_sets = load_additional_testing_data(config['testing_datasets'], idx2word, word2idx, config)

    # keep the first 400 in krapivin
    dataset = test_sets['krapivin']

    train_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/krapivin/train/'
    if not os.path.exists(train_dir):
        os.makedirs(train_dir)
    train_texts = dataset['source_str'][401:]
    train_targets = dataset['target_str'][401:]
    for i, (train_text, train_target) in enumerate(zip(train_texts,train_targets)):
        print('train '+ str(i))
        with open(train_dir+str(i)+'.txt', 'w') as f:
            f.write(' '.join(train_text))
        with open(train_dir + str(i) + '.key', 'w') as f:
            f.write('\n'.join([' '.join(t)+'\t1' for t in train_target]))

    test_dir  = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/krapivin/test/'
    if not os.path.exists(test_dir):
        os.makedirs(test_dir)
    test_texts = dataset['source_str'][:400]
    test_targets = dataset['target_str'][:400]
    for i, (test_text, test_target) in enumerate(zip(test_texts,test_targets)):
        print('test '+ str(i))
        with open(test_dir+str(i)+'.txt', 'w') as f:
            f.write(' '.join(test_text))
        with open(test_dir + str(i) + '.key', 'w') as f:
            f.write('\n'.join([' '.join(t)+'\t1' for t in test_target]))

def export_ke20k_testing_maui():
    from keyphrase.dataset import keyphrase_test_dataset
    target_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/ke20k/'

    config  = keyphrase.config.setup_keyphrase_all()   # load settings.
    doc_list = keyphrase_test_dataset.testing_data_loader('ke20k', kwargs=dict(basedir = config['path'])).get_docs(False)

    for d in doc_list:
        d_id = d.name[:d.name.find('.txt')]
        print(d_id)
        with open(target_dir+d_id+'.txt', 'w') as textfile:
            textfile.write(d.title+'\n'+d.text)
        with open(target_dir + d_id + '.key', 'w') as phrasefile:
            for p in d.phrases:
                phrasefile.write('%s\t1\n' % p)

def export_ke20k_train_maui():
    '''
    just use the validation dataset
    :return:
    '''
    config  = keyphrase.config.setup_keyphrase_all()   # load settings.
    target_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/ke20k/train/'

    import emolga,string

    printable = set(string.printable)
    validation_records = emolga.dataset.build_dataset.deserialize_from_file(config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'validation_record_'+str(config['validation_size'])+'.pkl')
    for r_id, r in enumerate(validation_records):
        print(r_id)

        r['title'] = filter(lambda x: x in printable, r['title'])
        r['abstract'] = filter(lambda x: x in printable, r['abstract'])
        r['keyword'] = filter(lambda x: x in printable, r['keyword'])

        with open(target_dir+str(r_id)+'.txt', 'w') as textfile:
            textfile.write(r['title']+'\n'+r['abstract'])

        with open(target_dir + str(r_id) + '.key', 'w') as phrasefile:
            for p in r['keyword'].split(';'):
                phrasefile.write('%s\t1\n' % p)

def prepare_data_cross_validation(input_dir, output_dir, folds=5):
    file_names = [ w[:w.index('.')] for w in filter(lambda x: x.endswith('.txt'),os.listdir(input_dir))]
    file_names.sort()
    file_names = numpy.asarray(file_names)

    fold_size = len(file_names)/folds

    for fold in range(folds):
        start   = fold * fold_size
        end     = start + fold_size

        if (fold == folds-1):
            end = len(file_names)

        print('Fold %d' % fold)

        test_names = file_names[start: end]
        train_names = file_names[list(filter(lambda x: x < start or x >= end, range(len(file_names))))]
        # print('test_names: %s' % str(test_names))
        # print('train_names: %s' % str(train_names))

        train_dir = output_dir + 'train_'+str(fold+1)+'/'
        if not os.path.exists(train_dir):
            os.makedirs(train_dir)
        test_dir = output_dir + 'test_'+str(fold+1)+'/'
        if not os.path.exists(test_dir):
            os.makedirs(test_dir)

        for test_name in test_names:
            shutil.copyfile(input_dir + test_name + '.txt', test_dir + test_name + '.txt')
            shutil.copyfile(input_dir + test_name + '.key', test_dir + test_name + '.key')
        for train_name in train_names:
            shutil.copyfile(input_dir + test_name + '.txt', train_dir + train_name + '.txt')
            shutil.copyfile(input_dir + test_name + '.key', train_dir + train_name + '.key')

if __name__ == '__main__':
    # export_krapivin_maui()
    # export_maui()
    # input_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/ke20k/'
    # output_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/ke20k/cross_validation/'
    # prepare_data_cross_validation(input_dir, output_dir, folds=5)

    # export_ke20k_maui()
    export_ke20k_train_maui()

================================================
FILE: keyphrase/config.py
================================================
import time

import os
import os.path as path


def setup_keyphrase_stable():
    config = dict()
    '''
    Meta information
    '''
    config['seed']            = 154316847
    # for naming the outputs and logs
    config['model_name']      = 'CopyRNN' # 'TfIdf', 'TextRank', 'SingleRank', 'ExpandRank', 'Maui', 'Kea', 'RNN', 'CopyRNN'
    config['task_name']       = 'keyphrase-all.copy'
    config['timemark']        = time.strftime('%Y%m%d-%H%M%S', time.localtime(time.time()))

    config['path']            = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)) #path.realpath(path.curdir)
    config['path_experiment'] = config['path'] + '/Experiment/'+config['task_name']
    config['path_h5']         = config['path_experiment']
    config['path_log']        = config['path_experiment']

    config['casestudy_log']   = config['path_experiment'] + '/case-print.log'

    '''
    Experiment process
    '''
    # do training?
    config['do_train']        = True
    # config['do_train']        = False

    # do quick-testing (while training)?
    config['do_quick_testing']     = True
    # config['do_quick_testing']     = False

    # do validation?
    config['do_validate']     = True
    # config['do_validate']     = False

    # do predicting?
    config['do_predict']      = True
    # config['do_predict']      = False

    # do testing?
    # config['do_evaluate']     = True
    config['do_evaluate']     = False

    '''
    Training settings
    '''
    # Dataset
    config['training_name']   = 'acm-sci-journal_600k'

    # actually still not clean enough, further filtering is done when loading pairs: dataset_utils.load_pairs()
    config['training_dataset']= config['path'] + '/dataset/keyphrase/million-paper/all_title_abstract_keyword_clean.json'
    # config['testing_name']    = 'inspec_all'
    # config['testing_dataset'] = config['path'] + '/dataset/keyphrase/inspec/inspec_all.json'

    config['testing_datasets']= ['kp20k'] # 'inspec', 'nus', 'semeval', 'krapivin', 'kp20k'
    config['preprocess_type'] = 1 # 0 is old type, 1 is new type(keep most punctuation)

    config['data_process_name'] = 'punctuation-20000validation-20000testing/'

    config['validation_size'] = 20000
    config['validation_id']   = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'validation_id_'+str(config['validation_size'])+'.pkl'
    config['testing_id']      = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'testing_id_'+str(config['validation_size'])+'.pkl'
    config['dataset']         = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_dataset.pkl'
    config['voc']             = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_voc.pkl' # for manual check

    # Optimization
    config['use_noise']       = False
    config['optimizer']       = 'adam'
    config['clipnorm']        = 0.1

    config['save_updates']    = True
    config['get_instance']    = True

    # size
    config['batch_size']      = 128
    # config['mini_batch_size'] = 64
    config['mini_mini_batch_length']      = 50000 # max length (#words) of each mini-mini batch, up to the GPU memory you have
    config['mode']            = 'RNN'
    config['binary']          = False
    config['voc_size']        = 50000

    # output log place
    if not os.path.exists(config['path_log']):
        os.mkdir(config['path_log'])

    # path to pre-trained model
    # config['trained_model']   = None
    config['trained_model']   = config['path_experiment'] + '/experiments.keyphrase-all.one2one.copy.id=20170106-025508.epoch=4.batch=1000.pkl'
    # config['trained_model']   = config['path_experiment'] + '/experiments.keyphrase-all.one2one.copy.id=20170106-025508.epoch=4.batch=1000.pkl'

    config['weight_json']= config['path_experiment'] + '/model_weight.json'
    config['resume_training'] = False
    config['training_archive']= None

    '''
    Predicting/evaluation settings
    '''
    config['return_encoding']   = True
    config['baseline_data_path']     = config['path'] + '/dataset/keyphrase/baseline-data/'
    # whether to add length penalty on beam search results
    config['normalize_score']   = False
    # whether to keep the longest prediction when many phrases sharing same prefix, like for 'A','AB','ABC' we only keep 'ABC'
    config['keep_longest']      = False

    # whether do filtering on groundtruth? 'appear-only','non-appear-only' and None (do no filtering)
    config['target_filter']     = 'appear-only'
    # whether do filtering on predictions? 'appear-only','non-appear-only' and None (do no filtering)
    config['predict_filter']    = 'appear-only'

    config['noun_phrase_only']  = False

    config['max_len']         = 6
    config['sample_beam']     = 200 #config['voc_size']
    config['sample_stoch']    = False # use beamsearch
    config['sample_argmax']   = False

    config['predict_type']    = 'generative' # type of prediction, extractive or generative
    # config['predict_path']    = config['path_experiment'] + '/predict.' + config['predict_type']+ '.'+ config['timemark'] + '.dataset=%d.len=%d.beam=%d.predict=%s.target=%s.keeplongest=%s.noun_phrase=%s/' % (len(config['testing_datasets']),config['max_len'], config['sample_beam'], config['predict_filter'], config['target_filter'], config['keep_longest'], config['noun_phrase_only'])
    config['predict_path']      = os.path.join(config['path_experiment'], 'predict.generative.20170712-221404.dataset=1.len=6.beam=200.predict=appear-only.target=appear-only.keeplongest=False.noun_phrase=False/')

    if not os.path.exists(config['predict_path']):
        os.mkdir(config['predict_path'])

    '''
    Model settings
    '''
    # Encoder: Model
    config['bidirectional']   = True
    config['enc_use_contxt']  = False
    config['enc_learn_nrm']   = True
    config['enc_embedd_dim']  = 150    # 100
    config['enc_hidden_dim']  = 300    # 150
    config['enc_contxt_dim']  = 0
    config['encoder']         = 'RNN'
    config['pooling']         = False

    # Decoder: dimension
    config['dec_embedd_dim']  = 150  # 100
    config['dec_hidden_dim']  = 300  # 180
    config['dec_contxt_dim']  = config['enc_hidden_dim']       \
                                if not config['bidirectional'] \
                                else 2 * config['enc_hidden_dim']

    # Decoder: CopyNet
    config['copynet']         = True
    # config['copynet']         = False
    config['identity']        = False
    config['location_embed']  = True
    config['coverage']        = True
    config['copygate']        = False

    # Decoder: Model
    config['shared_embed']    = False
    config['use_input']       = True
    config['bias_code']       = True
    config['dec_use_contxt']  = True
    config['deep_out']        = False
    config['deep_out_activ']  = 'tanh'  # maxout2
    config['bigram_predict']  = True
    config['context_predict'] = True
    config['dropout']         = 0.5  # 5
    config['leaky_predict']   = False

    config['dec_readout_dim'] = config['dec_hidden_dim']
    if config['dec_use_contxt']:
        config['dec_readout_dim'] += config['dec_contxt_dim']
    if config['bigram_predict']:
        config['dec_readout_dim'] += config['dec_embedd_dim']

    # Decoder: sampling
    config['multi_output']    = False

    config['decode_unk']      = False
    config['explicit_loc']    = False

    # Gradient Tracking !!!
    config['gradient_check']  = True
    config['gradient_noise']  = True

    config['skip_size']       = 15

    # for w in config:
    #     print('{0} => {1}'.format(w, config[w]))
    # print('setup ok.')
    return config


def setup_keyphrase_train():
    config = dict()
    '''
    Meta information
    '''
    config['seed']            = 154316847
    # for naming the outputs and logs
    config['model_name']      = 'CopyRNN' # 'TfIdf', 'TextRank', 'SingleRank', 'ExpandRank', 'Maui', 'Kea', 'RNN', 'CopyRNN'
    config['task_name']       = 'keyphrase-all.copy'
    config['timemark']        = time.strftime('%Y%m%d-%H%M%S', time.localtime(time.time()))

    config['path']            = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)) #path.realpath(path.curdir)
    config['path_experiment'] = config['path'] + '/Experiment/'+config['task_name']
    config['path_h5']         = config['path_experiment']
    config['path_log']        = config['path_experiment']

    config['casestudy_log']   = config['path_experiment'] + '/case-print.log'

    '''
    Experiment process
    '''
    # do training?
    config['do_train']        = True
    # config['do_train']        = False

    # do quick-testing (while training)?
    config['do_quick_testing']     = True
    # config['do_quick_testing']     = False

    # do validation?
    config['do_validate']     = True
    # config['do_validate']     = False

    # do predicting?
    # config['do_predict']      = True
    config['do_predict']      = False

    # do testing?
    # config['do_evaluate']     = True
    config['do_evaluate']     = False

    '''
    Training settings
    '''
    # Dataset
    config['training_name']   = 'acm-sci-journal_600k'

    # actually still not clean enough, further filtering is done when loading pairs: dataset_utils.load_pairs()
    config['training_dataset']= config['path'] + '/dataset/keyphrase/million-paper/all_title_abstract_keyword_clean.json'
    # config['testing_name']    = 'inspec_all'
    # config['testing_dataset'] = config['path'] + '/dataset/keyphrase/inspec/inspec_all.json'

    config['testing_datasets']= ['kp20k'] # 'inspec', 'nus', 'semeval', 'krapivin', 'kp20k'
    config['preprocess_type'] = 1 # 0 is old type, 1 is new type(keep most punctuation)

    config['data_process_name'] = 'punctuation-20000validation-20000testing/'

    config['validation_size'] = 20000
    config['validation_id']   = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'validation_id_'+str(config['validation_size'])+'.pkl'
    config['testing_id']      = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'testing_id_'+str(config['validation_size'])+'.pkl'
    config['dataset']         = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_dataset.pkl'
    config['voc']             = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_voc.pkl' # for manual check

    # Optimization
    config['use_noise']       = False
    config['optimizer']       = 'adam'
    config['clipnorm']        = 0.1

    config['save_updates']    = True
    config['get_instance']    = True

    # size
    config['batch_size']      = 128
    # config['mini_batch_size'] = 64
    config['mini_mini_batch_length']      = 50000 # max length (#words) of each mini-mini batch, up to the GPU memory you have
    config['mode']            = 'RNN'
    config['binary']          = False
    config['voc_size']        = 50000

    # output log place
    if not os.path.exists(config['path_log']):
        os.mkdir(config['path_log'])

    # path to pre-trained model
    config['trained_model']   = None

    config['weight_json']= config['path_experiment'] + '/model_weight.json'
    config['resume_training'] = False
    config['training_archive']= None

    '''
    Predicting/evaluation settings
    '''
    config['baseline_data_path']     = config['path'] + '/dataset/keyphrase/baseline-data/'
    # whether to add length penalty on beam search results
    config['normalize_score']   = False
    # whether to keep the longest prediction when many phrases sharing same prefix, like for 'A','AB','ABC' we only keep 'ABC'
    config['keep_longest']      = False

    # whether do filtering on groundtruth? 'appear-only','non-appear-only' and None (do no filtering)
    config['target_filter']     = 'appear-only'
    # whether do filtering on predictions? 'appear-only','non-appear-only' and None (do no filtering)
    config['predict_filter']    = 'appear-only'

    config['noun_phrase_only']  = False

    config['max_len']         = 6
    config['sample_beam']     = 200 #config['voc_size']
    config['sample_stoch']    = False # use beamsearch
    config['sample_argmax']   = False

    config['predict_type']    = 'generative' # type of prediction, extractive or generative
    # config['predict_path']    = config['path_experiment'] + '/predict.' + config['predict_type']+ '.'+ config['timemark'] + '.dataset=%d.len=%d.beam=%d.predict=%s.target=%s.keeplongest=%s.noun_phrase=%s/' % (len(config['testing_datasets']),config['max_len'], config['sample_beam'], config['predict_filter'], config['target_filter'], config['keep_longest'], config['noun_phrase_only'])
    config['predict_path']      = os.path.join(config['path_experiment'], 'predict.generative.20170712-221404.dataset=1.len=6.beam=200.predict=appear-only.target=appear-only.keeplongest=False.noun_phrase=False/')

    if not os.path.exists(config['predict_path']):
        os.mkdir(config['predict_path'])

    '''
    Model settings
    '''
    # Encoder: Model
    config['bidirectional']   = True
    config['enc_use_contxt']  = False
    config['enc_learn_nrm']   = True
    config['enc_embedd_dim']  = 100    # 100
    config['enc_hidden_dim']  = 300    # 150
    config['enc_contxt_dim']  = 0
    config['encoder']         = 'RNN'
    config['pooling']         = False

    # Decoder: dimension
    config['dec_embedd_dim']  = 100  # 100
    config['dec_hidden_dim']  = 300  # 180
    config['dec_contxt_dim']  = config['enc_hidden_dim']       \
                                if not config['bidirectional'] \
                                else 2 * config['enc_hidden_dim']

    # Decoder: CopyNet
    config['copynet']         = True
    # config['copynet']         = False
    config['identity']        = False
    config['location_embed']  = True
    config['coverage']        = True
    config['copygate']        = False

    # Decoder: Model
    config['shared_embed']    = False
    config['use_input']       = True
    config['bias_code']       = True
    config['dec_use_contxt']  = True
    config['deep_out']        = False
    config['deep_out_activ']  = 'tanh'  # maxout2
    config['bigram_predict']  = True
    config['context_predict'] = True
    config['dropout']         = 0.5  # 5
    config['leaky_predict']   = False

    config['dec_readout_dim'] = config['dec_hidden_dim']
    if config['dec_use_contxt']:
        config['dec_readout_dim'] += config['dec_contxt_dim']
    if config['bigram_predict']:
        config['dec_readout_dim'] += config['dec_embedd_dim']

    # Decoder: sampling
    config['multi_output']    = False

    config['decode_unk']      = False
    config['explicit_loc']    = False

    # Gradient Tracking !!!
    config['gradient_check']  = True
    config['gradient_noise']  = True

    config['skip_size']       = 15

    # for w in config:
    #     print('{0} => {1}'.format(w, config[w]))
    # print('setup ok.')
    return config


def setup_keyphrase_baseline():
    config = dict()
    # config['seed']            = 3030029828
    config['seed']            = 154316847
    config['task_name']       = 'baseline'
    # config['task_name']       = 'copynet-keyphrase-all.one2one.copy'
    config['timemark']        = time.strftime('%Y%m%d-%H%M%S', time.localtime(time.time()))

    config['use_noise']       = False
    config['optimizer']       = 'adam'
    config['clipnorm']        = 0.1

    config['save_updates']    = True
    config['get_instance']    = True

    config['path']            = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)) #path.realpath(path.curdir)
    config['path_experiment'] = config['path'] + '/Experiment/'+config['task_name']
    config['path_h5']         = config['path_experiment']
    config['path_log']        = config['path_experiment']

    config['casestudy_log']   = config['path_experiment'] + '/case-print.log'

    # do training?
    config['do_train']        = False
    # do predicting?
    # config['do_predict']      = True
    config['do_predict']      = False
    # do testing?
    config['do_evaluate']     = True
    # config['do_evaluate']     = False
    # do validation?
    config['do_validate']     = False

    config['training_name']   = 'acm-sci-journal_600k'
    config['training_dataset']= config['path'] + '/dataset/keyphrase/million-paper/all_title_abstract_keyword_clean.json'
    config['testing_name']    = 'inspec_all'
    config['testing_dataset'] = config['path'] + '/dataset/keyphrase/inspec/inspec_all.json'

    config['testing_datasets']= ['kp20k'] # 'inspec', 'nus', 'semeval', 'krapivin', 'ke20k', 'kdd', 'www', 'umd'
    config['preprocess_type'] = 1 # 0 is old type, 1 is new type(keep most punctuation)

    config['data_process_name'] = 'eos-punctuation-1000validation/'

    config['validation_size'] = 20000
    config['validation_id']   = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'validation_id_'+str(config['validation_size'])+'.pkl'
    config['testing_id']      = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'testing_id_'+str(config['validation_size'])+'.pkl'
    config['dataset']         = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_dataset.pkl'
    config['voc']             = config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'all_600k_voc.pkl' # for manual check

    # size
    config['batch_size']      = 100
    config['mini_batch_size'] = 20
    config['mode']            = 'RNN'  # NTM
    config['binary']          = False
    config['voc_size']        = 50000

    # output log place
    if not os.path.exists(config['path_log']):
        os.mkdir(config['path_log'])

    # trained_model
    config['trained_model']   = config['path_experiment'] + '/experiments.copynet-keyphrase-all.one2one.copy.id=20161220-070035.epoch=2.batch=20000.pkl'
    # A copy-model
    # config['path_experiment'] + '/experiments.copynet-keyphrase-all.one2one.copy.id=20161220-070035.epoch=2.batch=20000.pkl'
    # A well-trained no-copy model
    # config['path_experiment'] + '/experiments.keyphrase-all.one2one.nocopy.id=20161230-000056.epoch=3.batch=1000.pkl'
    # A well-trained model on all data
    #   path.realpath(path.curdir) + '/Experiment/' + 'copynet-keyphrase-all.one2one.nocopy.<eol><digit>.emb=100.hid=150/experiments.copynet-keyphrase-all.one2one.nocopy.id=20161129-195005.epoch=2.pkl'
    # A well-trained model on acm data
    # config['path_experiment'] + '/experiments.copynet-keyphrase-all.one2one.nocopy.id=20161129-195005.epoch=2.pkl'
    config['weight_json']= config['path_experiment'] + '/model_weight.json'
    config['resume_training'] = False
    config['training_archive']= None #config['path_experiment'] + '/save_training_status.id=20161229-135001.epoch=1.batch=1000.pkl'
        #config['path_experiment'] + '/save_training_status.pkl'

    # # output hdf5 file.
    # config['weights_file']    = config['path'] + '/froslass/model-pool/'
    # if not os.path.exists(config['weights_file']):
    #     os.mkdir(config['weights_file'])

    config['max_len']         = 6
    config['sample_beam']     = config['voc_size']
    config['sample_stoch']    = False # use beamsearch
    config['sample_argmax']   = False

    config['predict_type']    = 'extractive' # type of prediction, extractive or generative
    config['predict_path']    = config['path_experiment'] + '/predict.'+config['timemark']+'.data=5.len=6.beam=all.predict=appear_only.target=appear_only/'
                                # config['path_experiment'] + '/predict.20161231-152451.len=6.beam=200.target=appear_only/'
                                # '/copynet-keyphrase-all.one2one.nocopy.extractive.predict.pkl'
    if not os.path.exists(config['predict_path']):
        os.mkdir(config['predict_path'])

    # config['path_experiment'] + '/copynet-keyphrase-all.one2one.nocopy.generate.len=8.beam=50.predict.pkl'
    # '/copynet-keyphrase-all.one2one.nocopy.extract.predict.pkl'
    #config['path_experiment'] + '/'+ config['task_name']+ '.' + config['predict_type'] + ('.len={0}.beam={1}'.format(config['max_len'], config['sample_beam'])) + '.predict.pkl' # prediction on testing data

    # Evaluation
    config['baseline_data_path']     = config['path'] + '/dataset/keyphrase/baseline-data/'

    config['normalize_score']   = True #
    # config['normalize_score']   = True
    config['predict_filter']    = 'appear-only' # [USELESS]whether do filtering on predictions? 'appear-only','non-appear-only' and None
    config['target_filter']     = 'appear-only' # 'appear-only' # whether do filtering on groundtruth? 'appear-only','non-appear-only' and None
    config['keep_longest']      = False # whether keep the longest phrases only, as there're too many phrases are part of other longer phrases
    config['noun_phrase_only']  = False

    config['number_to_predict'] = 10 # [desperated] the k in P@k,R@k,F1@k

    # Encoder: Model
    config['bidirectional']   = True
    config['enc_use_contxt']  = False
    config['enc_learn_nrm']   = True
    config['enc_embedd_dim']  = 150    # 100
    config['enc_hidden_dim']  = 300    # 150
    config['enc_contxt_dim']  = 0
    config['encoder']         = 'RNN'
    config['pooling']         = False

    # Decoder: dimension
    config['dec_embedd_dim']  = 150  # 100
    config['dec_hidden_dim']  = 300  # 180
    config['dec_contxt_dim']  = config['enc_hidden_dim']       \
                                if not config['bidirectional'] \
                                else 2 * config['enc_hidden_dim']

    # Decoder: CopyNet
    config['copynet']         = True
    config['identity']        = False
    config['location_embed']  = True
    config['coverage']        = True
    config['copygate']        = False

    # Decoder: Model
    config['shared_embed']    = False
    config['use_input']       = True
    config['bias_code']       = True
    config['dec_use_contxt']  = True
    config['deep_out']        = False
    config['deep_out_activ']  = 'tanh'  # maxout2
    config['bigram_predict']  = True
    config['context_predict'] = True
    config['dropout']         = 0.5  # 5
    config['leaky_predict']   = False

    config['dec_readout_dim'] = config['dec_hidden_dim']
    if config['dec_use_contxt']:
        config['dec_readout_dim'] += config['dec_contxt_dim']
    if config['bigram_predict']:
        config['dec_readout_dim'] += config['dec_embedd_dim']


    # Decoder: sampling
    config['multi_output']    = False

    config['decode_unk']      = False
    config['explicit_loc']    = False

    # Gradient Tracking !!!
    config['gradient_check']  = True
    config['gradient_noise']  = True

    config['skip_size']       = 15

    # for w in config:
    #     print '{0} => {1}'.format(w, config[w])
    # print 'setup ok.'
    return config


================================================
FILE: keyphrase/dataset/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: keyphrase/dataset/dataset_utils.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os
import nltk
import numpy
import numpy as np
import re

import emolga.dataset.build_dataset as db

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
SENTENCEDELIMITER = '<eos>'
DIGIT = '<digit>'

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"


def prepare_text(record, process_type=1):
    '''
    :param type: 0 is old way, do sentence splitting
                 1 is new way, keep most of punctuations
                 2 just return the text, no processing
    concatenate title and abstract, do sentence tokenization if needed
        As I keep most of punctuations (including period), actually I should have stopped doing sentence boundary detection
    '''
    if process_type==0:
        # replace e.g. to avoid noise for sentence boundary detection
        text = record['abstract'].replace('e.g.', 'eg')
        # pad space before and after certain punctuations [_,.<>()'%]
        title = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['title'])
        sents = [re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', s) for s in sent_detector.tokenize(text)]
        text = title + ' ' + SENTENCEDELIMITER + ' ' + (' ' + SENTENCEDELIMITER + ' ').join(sents)
    elif process_type==1:
        text = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['title']) + ' '+SENTENCEDELIMITER + ' ' + re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['abstract'])
    elif process_type==2:
        text = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['abstract'])
    return text

def get_tokens(text, process_type=1):
    '''
    parse the feed-in text, filtering and tokenization
    :param text:
    :param type: 0 is old way, only keep [_<>,], do sentence boundary detection, replace digits to <digit>
                 1 is new way, keep [_<>,\(\)\.\'%], replace digits to <digit>, split by [^a-zA-Z0-9_<>,\(\)\.\'%]
    :return: a list of tokens
    '''
    if process_type == 0:
        text = re.sub(r'[\r\n\t]', ' ', text)

        # tokenize by non-letters
        tokens = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,]', text))
        # replace the digit terms with <digit>
        tokens = [w if not re.match('^\d+$', w) else DIGIT for w in tokens]

    elif process_type == 1:
        text = text.lower()
        text = re.sub(r'[\r\n\t]', ' ', text)
        # tokenize by non-letters
        tokens = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', text))
        # replace the digit terms with <digit>
        tokens = [w if not re.match('^\d+$', w) else DIGIT for w in tokens]

    elif process_type == 2:
        text = text.lower()
        text = re.sub(r'[\r\n\t]', ' ', text)
        # tokenize by non-letters
        tokens = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', text))

    return tokens

def process_keyphrase(keyword_str):
    keyphrases = keyword_str.lower()
    # replace abbreviations
    keyphrases = re.sub(r'\(.*?\)', ' ', keyphrases)
    # pad whitespace before and after punctuations
    keyphrases = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', keyphrases)
    # tokenize with same delimiters
    keyphrases = [filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', phrase)) for phrase in
                  keyphrases.split(';')]
    # replace digit with <digit>
    keyphrases = [[w if not re.match('^\d+$', w) else DIGIT for w in phrase] for phrase in keyphrases]

    return keyphrases

def build_data(data, idx2word, word2idx):
    Lmax = len(idx2word)

    # don't keep the original string, or the dataset would be over 2gb
    instance = dict(source_str=[], target_str=[], source=[], target=[], target_c=[])
    # instance = dict(source=[], target=[])
    for count, pair in enumerate(data):
        source, target = pair

        # if not multi_output:
        #     A = [word2idx[w] for w in source]
        #     B = [word2idx[w] for w in target]
        #     # C = np.asarray([[w == l for w in source] for l in target], dtype='float32')
        #     C = [0 if w not in source else source.index(w) + Lmax for w in target]
        # else:

        A = [word2idx[w] if w in word2idx else word2idx['<unk>'] for w in source]
        B = [[word2idx[w] if w in word2idx else word2idx['<unk>'] for w in p] for p in target]
        # C = np.asarray([[w == l for w in source] for l in target], dtype='float32')
        C = [[0 if w not in source else source.index(w) + Lmax for w in p] for p in target]

        # actually only source,target are used in model
        # instance['source_str'] += [source]
        # instance['target_str'] += [target]
        instance['source'] += [A]
        instance['target'] += [B]
        # instance['target_c'] += [C]
        # instance['cc_matrix'] += [C]
        if count % 1000 == 0:
            print('-------------------- %d ---------------------------' % count)
            print(source)
            print(target)
            print(A)
            print(B)
            print(C)
    return instance

def load_pairs(records, process_type=1 ,do_filter=False):
    wordfreq = dict()
    filtered_records = []
    pairs = []

    import string
    printable = set(string.printable)

    for id, record in enumerate(records):
        record['keyword'] = ''.join(list(filter(lambda x: x in printable, record['keyword'])))
        record['abstract'] = ''.join(list(filter(lambda x: x in printable, record['abstract'])))
        record['title'] = ''.join(list(filter(lambda x: x in printable, record['title'])))
        text        = prepare_text(record, process_type)
        tokens      = get_tokens(text, process_type)
        keyphrases  = process_keyphrase(record['keyword'])

        for w in tokens:
            if w not in wordfreq:
                wordfreq[w]  = 1
            else:
                wordfreq[w] += 1

        for keyphrase in keyphrases:
            for w in keyphrase:
                if w not in wordfreq:
                    wordfreq[w]  = 1
                else:
                    wordfreq[w] += 1

        if id % 10000 == 0 and id > 1:
            print('%d \n\t%s \n\t%s \n\t%s' % (id, text, tokens, keyphrases))
            # break

        fine_tokens = re.split(r'[\.,;]',record['keyword'].lower())
        if sum([len(k) for k in keyphrases]) != 0:
            ratio1 = float(len(record['keyword'])) / float(sum([len(k) for k in keyphrases]))
            ratio2 = float(sum([len(k) for k in fine_tokens])) / float(len(fine_tokens))
        else:
            ratio1 = 0
            ratio2 = 0
        if ( do_filter and (ratio1< 3.5)): # usually ratio1 < 3.5 is noise. actually ratio2 is more reasonable, but we didn't use out of consistency
            print('!' * 100)
            print('Error found')
            print('%d - title=%s, \n\ttext=%s, \n\tkeyphrase=%s \n\tkeyphrase after process=%s \n\tlen(keyphrase)=%d, #(tokens in keyphrase)=%d \n\tratio1=%.3f\tratio2=%.3f' % (
            id, record['title'], record['abstract'], record['keyword'], keyphrases, len(record['keyword']), sum([len(k) for k in keyphrases]), ratio1, ratio2))
            continue

        pairs.append((tokens, keyphrases))
        filtered_records.append(record)

    return filtered_records, pairs, wordfreq

def get_none_phrases(source_text, source_postag, max_len):
    np_regex = r'^(JJ|JJR|JJS|VBG|VBN)*(NN|NNS|NNP|NNPS|VBG)+$'
    np_list = []

    for i in range(len(source_text)):
        for j in range(i+1, len(source_text)+1):
            if j-i > max_len:
                continue
            if j-i == 1 and (source_text[i:j]=='<digit>' or len(source_text[i:j][0])==1):
                continue
            tagseq = ''.join(source_postag[i:j])
            if re.match(np_regex, tagseq):
                np_list.append((source_text[i:j], source_postag[i:j]))

    print('Text: \t\t %s' % str(source_text))
    print('None Phrases:[%d] \n\t\t\t%s' % (len(np_list), str('\n\t\t\t'.join([str(p[0])+'['+str(p[1])+']' for p in np_list]))))

    return np_list


if __name__ == '__main__':
    import keyphrase.config as configs
    config = configs.setup_keyphrase_all()
    test_set = db.deserialize_from_file(
        config['path'] + '/dataset/keyphrase/' + config['data_process_name'] + 'semeval.testing.pkl')
    for s_index, s_str, s_tag in zip(test_set['source'], test_set['source_str'], [[s[1] for s in d ]for d in test_set['tagged_source']]):
        get_none_phrases(s_str, s_tag, config['max_len'])

================================================
FILE: keyphrase/dataset/inspec/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: keyphrase/dataset/inspec/inspec_export_json.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os
import re
import json

import nltk

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

SENTENCEDELIMITER = '.' #'<EOS>'
DIGIT = '<DIGIT>'

def export_Inspec_tokenized(dir_name, output_name):
    """
    Paper abstracts Inspec (Hulth, 2003)∗ #(Documents)=2,000,  #(Tokens/doc) <200, #(Keys/doc)=10
    Load data in seperate files and export to a json
    Only .uncontr is used (used in Hulth, 2003)
    """
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

    total_length = 0
    total_keyphrases = 0
    total_occurance_keyphrases = 0

    output_list = []

    files = os.listdir(dir_name)
    count = 0

    for c, f in enumerate(files):
        if f.endswith('.abstr'):
            count += 1
            file_no = f[:f.index('.abstr')]
            text_file_name = f
            print("No. %s : file=%s, name=%s" % (c, f, file_no))

            keyphrase_file_name = file_no+'.uncontr'
            with open(dir_name+os.sep+text_file_name, 'r') as text_file:
                text = text_file.read().lower()
                text = text.replace('\n', SENTENCEDELIMITER, 1) # first line is title.
                text = re.sub('[\t\r\n]', ' ', text)
                text = (' '+SENTENCEDELIMITER+' ').join(sent_detector.tokenize(text))
                text = re.sub('\d+', '  ', text)
                text = filter(lambda w: len(w)>0, re.split('\W', text))
            with open(dir_name+os.sep+keyphrase_file_name, 'r') as keyphrase_file:
                keyphrases = keyphrase_file.read().lower()
                keyphrases = re.sub('[\t\r\n]', ' ', keyphrases)
                keyphrases = re.sub('\d+', ' '+DIGIT+' ', keyphrases)
                keyphrases = [re.sub('\W+', ' ', w.strip()) for w  in keyphrases.split(';')]

            for k in keyphrases:
                dict = {}
                dict['filename']=f
                dict['name']=k.split()
                dict['tokens']=text
                output_list.append(dict)

            clean_text = ' '.join(text)

            print(keyphrases)
            print(text)
            print(clean_text)
            print('text length = %s' % len(text))

            number_appearence = len(filter(lambda x: clean_text.find(x) > 0, keyphrases))
            print('keyphrase occurance = %s/%s' % (number_appearence,len(keyphrases)))
            print('--------------------------------------')

            total_length += len(text)
            total_keyphrases += len(keyphrases)
            total_occurance_keyphrases += number_appearence

    print('total documents = %s') % count
    print('average doc length = %s' % (float(total_length)/count))
    print('keyphrase occurance = %s/%s' % (total_occurance_keyphrases, total_keyphrases))
    print('averge keyphrase occurance = %s/%s' % (float(total_occurance_keyphrases)/count, float(total_keyphrases)/count))

    with open(output_name, 'w') as json_file:
        json_file.write(json.dumps(output_list))


def export_Inspec(Inspec_input_path, Inspec_output_path):
    '''
    export to json, without any preprocess
    :param Inspec_folder_name:
    :return:
    '''

    count = 0
    output_list = []
    for p, folders, docs in os.walk(Inspec_input_path):
        for f in docs:
            if f.endswith('.abstr'):
                count += 1
                file_no = f[:f.index('.abstr')]
                text_file_name = f
                print("No. %s : file=%s, name=%s" % (count, f, file_no))

                with open(os.path.join(p, f), 'r') as text_file:
                    text = text_file.read()
                    title = text[:text.find('\r')] # first line is title.
                    title = re.sub('[\t\r\n]', ' ', title).strip()
                    text = text[text.find('\r'):]
                    text = re.sub('[\t\r\n]', ' ', text).strip()

                keyphrase_file_name = file_no + '.uncontr'
                # keyphrase_file_name = file_no + '.contr'
                with open(os.path.join(p, keyphrase_file_name), 'r') as keyphrase_file:
                    keyphrases = keyphrase_file.read().lower()
                    keyphrases = re.sub('[\t\r\n]', ' ', keyphrases)
                    keyphrases = [re.sub('\W+', ' ', w.strip()) for w in keyphrases.split(';')]
                    keyphrases = ';'.join(keyphrases)

                dict = {}
                dict['title'] = title
                dict['keyword'] = keyphrases
                dict['abstract'] = text
                output_list.append(dict)

                print('\ttitle: %s' % dict['title'])
                print('\tabstract: %s' % text)
                print('\tkeyword: %s' % keyphrases)
                print('text length = %s' % len(text))
                print('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n')

    with open(Inspec_output_path, 'w') as json_file:
        json_file.write(json.dumps(output_list))


BASE_DIR = os.path.realpath(os.path.curdir)+'/dataset/keyphrase/inspec/'

if __name__ == '__main__':
    os.chdir(BASE_DIR)
    # Inspec_folder_name = 'test'
    Inspec_input_name = 'all' # folder
    Inspec_output_name = 'inspec_all_tokenized.json'
    # export_Inspec(BASE_DIR+Inspec_input_name, BASE_DIR+Inspec_output_name)
    export_Inspec_tokenized(BASE_DIR+Inspec_input_name, BASE_DIR+Inspec_output_name)

================================================
FILE: keyphrase/dataset/inspec/key_convert_maui.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Clean the dirty format of keyword files
Convert to one keyword per line
"""

import os
import re

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    UNCONTR_DIR = "/home/memray/Project/keyphrase/maui/src/test/resources/data/Inspec/uncontr/"
    OUTPUT_DIR = "/home/memray/Project/keyphrase/maui/src/test/resources/data/Inspec/key/"
    # os.makedirs(UNCONTR_DIR)
    # os.makedirs(OUTPUT_DIR)

    for file_name in os.listdir(UNCONTR_DIR):
        file_no = file_name[:file_name.find('.')]
        with open(UNCONTR_DIR+file_name) as file:
            str = ' '.join(file.readlines()).replace('\n',' ').replace('\t',' ')
        keywords = [key.strip() for key in str.split(';')]

        with open(OUTPUT_DIR+file_no+'.key', 'w') as output_file:
            for keyword in keywords:
                keyword = re.sub('\s+',' ',keyword)
                print(keyword+'\t1\n')
                output_file.write(keyword+'\t1\n')


================================================
FILE: keyphrase/dataset/json_count.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
just check how many data instances in the json
"""

import os
import json

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    print(os.getcwd())
    file_path = '../dataset/json/gradle_test_methodnaming.json'
    with open(file_path) as json_string:
        data = json.load(json_string)
        print(len(data))

================================================
FILE: keyphrase/dataset/keyphrase_dataset.py
================================================
# coding=utf-8
import json
import sys
import time

import nltk
import numpy
import numpy as np

import keyphrase.config as config
from emolga.dataset.build_dataset import *
import re
from keyphrase_test_dataset import *

MULTI_OUTPUT = False
TOKENIZE_SENTENCE = True

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
wordfreq = dict()
SENTENCEDELIMITER = '<eos>'
DIGIT = '<digit>'

'''
    desperated
    an old function for load and parse data
'''

def get_tokens(text, type=1):
    '''
    parse the feed-in text, filtering and tokenization
    :param text:
    :param type: 0 is old way, only keep [_<>,], do sentence boundary detection, replace digits to <digit>
                 1 is new way, keep [_<>,\(\)\.\'%], replace digits to <digit>, split by [^a-zA-Z0-9_<>,\(\)\.\'%]
    :return: a list of tokens
    '''
    if type == 0:
        text = re.sub(r'[\r\n\t]', ' ', text)

        text = text.replace('e.g.', 'eg')
        sents = [re.sub(r'[_<>,]', ' \g<0> ', s) for s in sent_detector.tokenize(text)]

        text = (' ' + SENTENCEDELIMITER + ' ').join(sents)
        text = text.lower()

        # tokenize by non-letters
        tokens = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,]', text))
        # replace the digit terms with <digit>
        tokens = [w if not re.match('^\d+$', w) else DIGIT for w in tokens]

    elif type == 1:
        text = text.lower()
        text = re.sub(r'[\r\n\t]', ' ', text)
        text = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', text)
        # tokenize by non-letters
        tokens = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', text))
        # replace the digit terms with <digit>
        tokens = [w if not re.match('^\d+$', w) else DIGIT for w in tokens]

    return tokens


def load_data(input_path, tokenize_sentence=True):
    '''
    :param input_path:
    :param tokenize_sentence:
    :return:
    '''
    global wordfreq

    # load data set from json
    pairs   = []
    f       = open(input_path, 'r')
    records = json.load(f)

    for id, record in enumerate(records):
        # record['abstract'] = record['abstract'].encode('utf8')
        # record['title'] = record['abstract'].encode('utf8')
        # record['keyword'] = record['abstract'].encode('utf8')

        # record = json.loads(record)
        if (tokenize_sentence):
            text = record['abstract'].replace('e.g.','eg')
            title =  re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['title'])
            sents = [re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', s) for s in sent_detector.tokenize(text)]
            text =  title + ' '+SENTENCEDELIMITER+' ' + (' '+SENTENCEDELIMITER+' ').join(sents)
        else:
            text = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', record['title'] + ' . ' + record['abstract'])

        text = text.lower()
        # text = re.sub(r'[_<>,()\.\']', ' \g<0> ', text)

        # tokenize by non-letters
        text = filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', text))
        # replace the digit terms with <digit>
        text = [w if not re.match('^\d+$', w) else DIGIT for w in text]

        for w in text:
            if w not in wordfreq:
                wordfreq[w]  = 1
            else:
                wordfreq[w] += 1

        # store the terms of outputs
        keyphrases = record['keyword'].lower()
        keyphrases = re.sub(r'\(.*?\)', ' ', keyphrases)
        keyphrases = re.sub(r'[_<>,\(\)\.\'%]', ' \g<0> ', keyphrases)
        # tokenize with same delimiters
        keyphrases = [filter(lambda w: len(w) > 0, re.split(r'[^a-zA-Z0-9_<>,\(\)\.\'%]', phrase)) for phrase in keyphrases.split(';')]
        # replace digit with <digit>
        keyphrases = [[w if not re.match('^\d+$', w) else DIGIT for w in phrase] for phrase in keyphrases]

        for keyphrase in keyphrases:
            for w in keyphrase:
                if w not in wordfreq:
                    wordfreq[w]  = 1
                else:
                    wordfreq[w] += 1

        pairs.append((text, keyphrases))
        if id % 10000 == 0:
            print('%d \n\t%s \n\t%s' % (id, text, keyphrases))

    return pairs


def build_dict(wordfreq):
    word2idx = dict()
    word2idx['<eol>'] = 0
    word2idx['<unk>'] = 1
    start_index = 2

    # sort the vocabulary (word, freq) from low to high
    wordfreq = sorted(wordfreq.items(), key=lambda a: a[1], reverse=True)

    # create word2idx
    for w in wordfreq:
        word2idx[w[0]] = start_index
        start_index += 1

    # create idx2word
    idx2word = {k: v for v, k in word2idx.items()}
    Lmax = len(idx2word)
    # for i in xrange(Lmax):
    #     print idx2word[i].encode('utf-8')

    return idx2word, word2idx

def build_data(data, idx2word, word2idx):
    Lmax = len(idx2word)

    instance = dict(source_str=[], target_str=[], source=[], target=[], target_c=[])
    for count, pair in enumerate(data):
        source, target = pair

        # if not multi_output:
        #     A = [word2idx[w] for w in source]
        #     B = [word2idx[w] for w in target]
        #     # C = np.asarray([[w == l for w in source] for l in target], dtype='float32')
        #     C = [0 if w not in source else source.index(w) + Lmax for w in target]
        # else:
        A = [word2idx[w] if w in word2idx else word2idx['<unk>'] for w in source]
        B = [[word2idx[w] if w in word2idx else word2idx['<unk>'] for w in p] for p in target]
        # C = np.asarray([[w == l for w in source] for l in target], dtype='float32')
        C = [[0 if w not in source else source.index(w) + Lmax for w in p] for p in target]

        # actually only source,target,target_c are used in model
        instance['source_str'] += [source]
        instance['target_str'] += [target]
        instance['source'] += [A]
        instance['target'] += [B]
        instance['target_c'] += [C]
        # instance['cc_matrix'] += [C]
        if count % 10000 == 0:
            print '-------------------- %d ---------------------------' % count
            print source
            print target
            print A
            print B
            print C
    return instance

def load_data_and_dict(training_dataset, testing_dataset):
    '''
    here dict is built on both training and testing dataset, which may be not suitable (testing data should be unseen)
    :param training_dataset,testing_dataset: path
    :return:
    '''
    global wordfreq

    train_pairs = load_data(training_dataset)
    test_pairs = load_data(testing_dataset)
    print('read dataset done.')

    idx2word, word2idx = build_dict(wordfreq)
    print('build dicts done.')

    # use character-based model [on]
    # use word-based model     [off]
    train_set = build_data(train_pairs, idx2word, word2idx)
    test_set = build_data(test_pairs, idx2word, word2idx)


    print('Train pairs: %d' % len(train_pairs))
    print('Test pairs:  %d' % len(test_pairs))
    print('Dict size:   %d' % len(idx2word))
    return train_set, test_set, idx2word, word2idx


def export_data_for_maui():
    '''
    Export training data for Maui
    '''
    pairs   = []
    with open(config['training_dataset'], 'r') as f:
        training_records = json.load(f)
        # load training dataset
        print('Loading training dataset')
        title_dict = dict([(r['title'].strip().lower(), r) for r in training_records])
        print('#(Training Data)=%d' % len(title_dict))

        # load testing dataset
        print('Loading testing dataset')
        testing_names = config['testing_datasets']  # only these three may have overlaps with training data
        testing_records = {}

        # rule out the ones appear in testing data: 'inspec', 'krapivin', 'nus', 'semeval'
        print('Filtering testing dataset from training data')
        for dataset_name in testing_names:
            print(dataset_name)

            testing_records[dataset_name] = testing_data_loader(dataset_name,
                                                                kwargs=dict(basedir=config['path'])).get_docs()

            for r in testing_records[dataset_name]:
                title = r['title'].strip().lower()
                if title in title_dict:
                    title_dict.pop(title)

        training_records, train_pairs, wordfreq = dataset_utils.load_pairs(title_dict.values(), do_filter=True)
        print('#(Training Data after Filtering Noises)=%d' % len(training_records))
        validation_ids = deserialize_from_file(config['validation_id'])
        validation_ids = filter(lambda x:x<len(training_records), validation_ids)
        training_records    = numpy.delete(training_records, validation_ids, axis=0)

        testing_ids = deserialize_from_file(config['testing_id'])
        testing_ids = filter(lambda x:x<len(training_records), testing_ids)
        training_records    = numpy.delete(training_records, testing_ids, axis=0)

        print('#(Training Data after Filtering Validation/Testing data)=%d' % len(training_records))

        for id, record in enumerate(training_records):
            if id % 10000 == 0:
                print(id)
            # output_dir = config['baseline_data_path'] + '/maui/ke20k/train(all)/'
            # if not os.path.exists(output_dir):
            #     os.makedirs(output_dir)
            # with open(output_dir+ str(id) + '.txt', 'w') as rf:
            #     rf.write(record['title']+' \n '+record['abstract'])
            # with open(output_dir+ str(id) + '.key', 'w') as rf:
            #     for k in record['keyword'].split(';'):
            #         rf.write('%s\t1\n' % k)

            if id < 50000:
                output_dir = config['baseline_data_path'] + '/maui/ke20k/train(50k)/'
                if not os.path.exists(output_dir):
                    os.makedirs(output_dir)
                with open(output_dir+ str(id) + '.txt', 'w') as rf:
                    rf.write(record['title']+' \n '+record['abstract'])
                with open(output_dir+ str(id) + '.key', 'w') as rf:
                    for k in record['keyword'].split(';'):
                        rf.write('%s\t1\n' % k)
            else:
                break

            # if id < 200000:
            #     output_dir = config['baseline_data_path'] + '/maui/ke20k/train(200k)/'
            #     if not os.path.exists(output_dir):
            #         os.makedirs(output_dir)
            #     with open(output_dir+ str(id) + '.txt', 'w') as rf:
            #         rf.write(record['title']+' \n '+record['abstract'])
            #     with open(output_dir+ str(id) + '.key', 'w') as rf:
            #         for k in record['keyword'].split(';'):
            #             rf.write('%s\t1\n' % k)


if __name__ == '__main__':
    # config = config.setup_keyphrase_all()
    config = config.setup_keyphrase_all()

    export_data_for_maui()

    '''
    examine the data
    '''
    # start_time = time.clock()
    # train_set, test_set, idx2word, word2idx = load_data_and_dict(config['training_dataset'], config['testing_dataset'])
    # serialize_to_file([train_set, test_set, idx2word, word2idx], config['dataset'])
    # print('Finish processing and dumping: %d seconds' % (time.clock()-start_time))
    #
    # # export vocabulary to file for manual check
    # wordfreq = sorted(wordfreq.items(), key=lambda a: a[1], reverse=True)
    # serialize_to_file(wordfreq, config['voc'])
    # with open(config['path']+'/dataset/keyphrase/voc_list.json', 'w') as voc_file:
    #     str = ''
    #     for w,c in wordfreq:
    #         str += '%s\t%d\n' % (w,c)
    #     voc_file.write(str)


    # train_set, test_set, idx2word, word2idx = deserialize_from_file(config['dataset'])
    # print('Load successful: vocsize=%d'% len(idx2word))

    # count_dict = {}
    #
    # for d in train_set['target']:
    #     for p in d:
    #         if len(p)>=10:
    #             print('%d, %s' %(len(p), ' '.join([idx2word[i] for i in p])))
    #         if len(p) in count_dict:
    #             count_dict[len(p)] += 1
    #         else:
    #             count_dict[len(p)] = 1
    #
    # total_count = sum(count_dict.values())
    #
    # for leng,count in count_dict.items():
    #     print('%s: %d, %.3f' % (leng, count, float(count)/float(total_count)*100))
    #
    # print('Total phrases: %d'% total_count)

================================================
FILE: keyphrase/dataset/keyphrase_test_dataset.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""
from nltk.internals import find_jars_within_path
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

import os
import re
import shutil

import nltk
import xml.etree.ElementTree as ET

import keyphrase.dataset.dataset_utils as utils
from emolga.utils.generic_utils import get_from_module


from emolga.dataset.build_dataset import deserialize_from_file, serialize_to_file
import numpy as np

from keyphrase.dataset import dataset_utils

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
class Document(object):
    def __init__(self):
        self.name       = ''
        self.title      = ''
        self.text       = ''
        self.phrases    = []

    def __str__(self):
        return '%s\n\t%s\n\t%s' % (self.title, self.text, str(self.phrases))


class DataLoader(object):
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        self.name    = self.__class__.__name__
        self.basedir = self.basedir
        self.doclist = []

    def get_docs(self, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        '''
        class_name = self.__class__.__name__.lower()
        if class_name == 'kdd' or class_name == 'www' or class_name == 'umd':
            self.load_xml(self.textdir)
        else:
            self.load_text(self.textdir)
            self.load_keyphrase(self.keyphrasedir)

        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name']        = d.name
            newd['abstract']    = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title']       = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword']     = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist

    def __call__(self, idx2word, word2idx, type = 1):
        self.get_docs()

        pairs = []

        for doc in self.doclist:
            try:
                title= utils.get_tokens(doc.title, type)
                text = utils.get_tokens(doc.text, type)
                if type == 0:
                    title.append('<eos>')
                elif type == 1:
                    title.append('.')

                title.extend(text)
                text = title

                # trunk, many texts are too long, would lead to out-of-memory
                if len(text) > 1500:
                    text = text[:1500]

                keyphrases = [utils.get_tokens(k, type) for k in doc.phrases]
                pairs.append((text, keyphrases))

            except UnicodeDecodeError:
                print('UnicodeDecodeError detected! %s' % doc.name)
            # print(text)
            # print(keyphrases)
            # print('*'*50)
        dataset = utils.build_data(pairs, idx2word, word2idx)

        return dataset, self.doclist

    def load_xml(self, xmldir):
        '''
        for KDD/WWW/UMD only
        :return: doclist
        '''
        for filename in os.listdir(xmldir):
            with open(xmldir+filename) as textfile:
                doc = Document()
                doc.name = filename[:filename.find('.xml')]

                import string
                printable = set(string.printable)

                # print((filename))
                try:
                    lines = textfile.readlines()
                    xml = ''.join([filter(lambda x: x in printable, l) for l in lines])
                    root = ET.fromstring(xml)

                    doc.title = root.findall("title")[0].text
                    doc.abstract = root.findall("abstract")[0].text
                    doc.phrases = [n.text for n in root.findall("*/tag")]

                    self.doclist.append(doc)

                except UnicodeDecodeError:
                    print('UnicodeDecodeError detected! %s' % filename )

    def load_text(self, textdir):
        for fid, filename in enumerate(os.listdir(textdir)):
            with open(textdir+filename) as textfile:
                doc = Document()
                doc.name = filename[:filename.find('.txt')]

                import string
                printable = set(string.printable)

                # print((textdir+filename))
                try:
                    lines = textfile.readlines()

                    lines = [list(filter(lambda x: x in printable, l)) for l in lines]

                    title = ''.join(lines[0]).encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # the 2nd line is abstract title
                    text  = (' '.join([''.join(l).strip() for l in lines[1:]])).encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # if lines[1].strip().lower() != 'abstract':
                    #     print('Wrong title detected : %s' % (filename))

                    doc.title = title
                    doc.text  = text
                    self.doclist.append(doc)

                except UnicodeDecodeError:
                    print('UnicodeDecodeError detected! %s' % filename )

    def load_keyphrase(self, keyphrasedir):
        for did,doc in enumerate(self.doclist):
            phrase_set = set()
            if os.path.exists(self.keyphrasedir + doc.name + '.keyphrases'):
                with open(keyphrasedir+doc.name+'.keyphrases') as keyphrasefile:
                    phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])
            # else:
            #     print(self.keyphrasedir + doc.name + '.keyphrases Not Found')

            if os.path.exists(self.keyphrasedir + doc.name + '.keywords'):
                with open(keyphrasedir + doc.name + '.keywords') as keyphrasefile:
                    phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])
            # else:
            #     print(self.keyphrasedir + doc.name + '.keywords Not Found')

            doc.phrases = list(phrase_set)

    def load_testing_data_postag(self, word2idx):
        print('Loading testing dataset %s from %s' % (self.name, self.postag_datadir))
        text_file_paths = [self.text_postag_dir + n_ for n_ in os.listdir(self.text_postag_dir)]
        keyphrase_file_paths = [self.keyphrase_postag_dir + n_ for n_ in os.listdir(self.keyphrase_postag_dir)]

        def load_postag_text_(path):
            with open(path, 'r') as f:
                tokens = ' '.join(f.readlines()).split(' ')
                text = [t.split('_')[0] for t in tokens]
                postag = [t.split('_')[1] for t in tokens]
            return text, postag

        def load_keyphrase_(path):
            with open(path, 'r') as f:
                keyphrase_str = ';'.join([l.strip() for l in f.readlines()])
                return dataset_utils.process_keyphrase(keyphrase_str)

        texts = [load_postag_text_(f_) for f_ in text_file_paths]
        keyphrases = [load_keyphrase_(f_) for f_ in keyphrase_file_paths]

        instance = dict(source_str=[], target_str=[], source=[], source_postag=[], target=[], target_c=[])

        for (source, postag),target in zip(texts, keyphrases):
            A = [word2idx[w] if w in word2idx else word2idx['<unk>'] for w in source]
            B = [[word2idx[w] if w in word2idx else word2idx['<unk>'] for w in p] for p in target]

            instance['source_str'] += [source]
            instance['target_str'] += [target]
            instance['source'] += [A]
            instance['source_postag'] += [postag]
            instance['target'] += [B]

        return instance

    def load_testing_data(self, word2idx):
        print('Loading testing dataset %s from %s' % (self.name, self.datadir))
        text_file_paths = [self.textdir + n_ for n_ in os.listdir(self.textdir) if n_.endswith('.txt')]
        # here is problematic. keep phrases in either '.txt' or 'keyphrases', but don't keep both files (kp20k uses both, delete either one would be fine). Keep '.txt' only in the future
        keyphrase_file_paths = [self.keyphrasedir + n_ for n_ in os.listdir(self.keyphrasedir) if n_.endswith('.txt')]
        if len(keyphrase_file_paths) == 0:
            keyphrase_file_paths = [self.keyphrasedir + n_ for n_ in os.listdir(self.keyphrasedir) if n_.endswith('.keyphrases')]

        def _load_text(path):
            with open(path, 'r') as f:
                text = ' '.join(f.readlines()).split(' ')
            return text

        def _load_keyphrase(path):
            with open(path, 'r') as f:
                keyphrase_str = [l.strip().split(' ') for l in f.readlines()]
                return keyphrase_str

        texts = [_load_text(f_) for f_ in text_file_paths]
        keyphrases = [_load_keyphrase(f_) for f_ in keyphrase_file_paths]

        instance = dict(source_str=[], target_str=[], source=[], source_postag=[], target=[], target_c=[])

        for source, target in zip(texts, keyphrases):
            A = [word2idx[w] if w in word2idx else word2idx['<unk>'] for w in source]
            B = [[word2idx[w] if w in word2idx else word2idx['<unk>'] for w in p] for p in target]

            instance['source_str'] += [source]
            instance['target_str'] += [target]
            instance['source'] += [A]
            instance['source_postag'] += [] # set to be empty
            instance['target'] += [B]

        return instance

class INSPEC(DataLoader):
    def __init__(self, **kwargs):
        super(INSPEC, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/INSPEC'
        # self.textdir = self.datadir + '/all_texts/'
        # self.keyphrasedir = self.datadir + '/gold_standard_keyphrases_2/'

        self.textdir = self.datadir + '/test_texts/'
        self.keyphrasedir = self.datadir + '/gold_standard_test/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/inspec/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

class NUS(DataLoader):
    def __init__(self, **kwargs):
        super(NUS, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/NUS'
        self.textdir = self.datadir + '/all_texts/'
        self.keyphrasedir = self.datadir + '/gold_standard_keyphrases/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/nus/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

    def export(self):
        '''
        parse the original dataset into two folders: text and gold_standard_keyphrases
        :return:
        '''
        originaldir = self.datadir+'/original'
        for paper_id in os.listdir(originaldir):
            if os.path.isfile(originaldir+'/'+paper_id):
                continue
            # copy text to all_texts/
            text_file = originaldir+'/'+paper_id+'/'+paper_id+'.txt'
            shutil.copy2(text_file, self.textdir+'/'+paper_id+'.txt')
            # load keyphrases
            keyphrases = set()
            keyphrase_files = [originaldir+'/'+paper_id+'/'+paper_id+'.kwd']
            reader_phrase_dir = originaldir+'/'+paper_id+'/KEY/'
            for key_file in os.listdir(reader_phrase_dir):
                keyphrase_files.append(reader_phrase_dir+key_file)
            for key_file in keyphrase_files:
                with open(key_file, 'r') as f:
                    keyphrases.update(set([l.strip() for l in f.readlines()]))
            # write into gold_standard_keyphrases/
            if os.path.exists(self.keyphrasedir + paper_id + '.keyphrases'):
                with open(self.keyphrasedir + paper_id + '.keyphrases', 'w') as f:
                    for key in list(keyphrases):
                        if key.find(' ') != -1:
                            f.write(key+'\n')
            # else:
            #     print(self.keyphrasedir + paper_id + '.keyphrases Not Found')

            if os.path.exists(self.keyphrasedir + paper_id + '.keywords'):
                with open(self.keyphrasedir + paper_id + '.keywords', 'w') as f:
                    for key in list(keyphrases):
                        if key.find(' ') == -1:
                            f.write(key+'\n')
            # else:
            #     print(self.keyphrasedir + paper_id + '.keywords Not Found')
    def get_docs(self, only_abstract=True, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        The keyphrase in SemEval is already stemmed
        '''
        for filename in os.listdir(self.keyphrasedir):
            if not filename.endswith('keyphrases'):
                continue
            doc = Document()
            doc.name = filename[:filename.find('.keyphrases')]
            phrase_set = set()
            if os.path.exists(self.keyphrasedir + doc.name + '.keyphrases'):
                with open(self.keyphrasedir+doc.name+'.keyphrases') as keyphrasefile:
                    phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])
            # else:
            #     print(self.keyphrasedir + doc.name + '.keyphrases Not Found')

            if os.path.exists(self.keyphrasedir + doc.name + '.keywords'):
                with open(self.keyphrasedir + doc.name + '.keywords') as keyphrasefile:
                    phrase_set.update([phrase.strip() for phrase in keyphrasefile.readlines()])

            doc.phrases = list(phrase_set)
            self.doclist.append(doc)

        for d in self.doclist:
            with open(self.textdir + d.name + '.txt', 'r') as f:
                import string
                printable = set(string.printable)

                # print((filename))
                try:
                    lines = f.readlines()

                    lines = [filter(lambda x: x in printable, l) for l in lines]

                    # 1st line is title
                    title = lines[0].encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # find abstract
                    index_abstract = None
                    for id, line in enumerate(lines):
                        if line.lower().strip().endswith('abstract') or line.lower().strip().startswith('abstract'):
                            index_abstract = id
                            break
                    if index_abstract == None:
                        print('abstract not found: %s' % d.name)
                        index_abstract = 1

                    # find introduction
                    if only_abstract:
                        index_introduction = len(lines)
                        for id, line in enumerate(lines):
                            if line.lower().strip().endswith('introduction'):
                                index_introduction = id
                                break
                        if index_introduction == len(lines):
                            print('Introduction not found: %s' % d.name)

                    # 2nd line is abstract title
                    text  = (' '.join(lines[2 : index_introduction])).encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # if lines[1].strip().lower() != 'abstract':
                    #     print('Wrong title detected : %s' % (filename))

                    d.title = title
                    d.text  = text

                except UnicodeDecodeError:
                    print('UnicodeDecodeError detected! %s' % self.self.textdir + d.name + '.txt.final' )


        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name']        = d.name
            newd['abstract']    = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title']       = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword']     = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist

class SemEval(DataLoader):
    def __init__(self, **kwargs):
        super(SemEval, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/SemEval'
        # self.textdir = self.datadir + '/all_texts/'
        # self.keyphrasedir = self.datadir + '/gold_standard_keyphrases_3/'
        self.textdir = self.datadir + '/test/'
        self.keyphrasedir = self.datadir + '/test_answer/test.combined.stem.final'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/semeval/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

    """
    def get_docs(self, only_abstract=True, returnDict=True):
        '''
        :return: a list of dict instead of the Document object
        The keyphrase in SemEval is already stemmed
        '''

        if self.keyphrasedir.endswith('test.combined.stem.final'):
            with open(self.keyphrasedir, 'r') as kp:
                lines = kp.readlines()
                for line in lines:
                    d = Document()
                    d.name = line[:line.index(':')].strip()
                    d.phrases = line[line.index(':')+1:].split(',')
                    self.doclist.append(d)

        for d in self.doclist:
            with open(self.textdir + d.name + '.txt', 'r') as f:
                import string
                printable = set(string.printable)

                # print((filename))
                try:
                    lines = f.readlines()

                    lines = [filter(lambda x: x in printable, l) for l in lines]

                    # 1st line is title
                    title = lines[0].encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # find abstract
                    index_abstract = None
                    for id, line in enumerate(lines):
                        if line.lower().strip().endswith('abstract') or line.lower().strip().startswith('abstract'):
                            index_abstract = id
                            break
                    if index_abstract == None:
                        print('abstract not found: %s' % d.name)
                        index_abstract = 1

                    # find introduction
                    if only_abstract:
                        index_introduction = len(lines)
                        for id, line in enumerate(lines):
                            if line.lower().strip().endswith('introduction'):
                                index_introduction = id
                                break
                        if index_introduction == len(lines):
                            print('Introduction not found: %s' % d.name)

                    # 2nd line is abstract title
                    text  = (' '.join(lines[2 : index_introduction])).encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # if lines[1].strip().lower() != 'abstract':
                    #     print('Wrong title detected : %s' % (filename))

                    d.title = title
                    d.text  = text

                except UnicodeDecodeError:
                    print('UnicodeDecodeError detected! %s' % self.self.textdir + d.name + '.txt.final' )


        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name']        = d.name
            newd['abstract']    = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title']       = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword']     = ';'.join(d.phrases)
            doclist.append(newd)

        if returnDict:
            return doclist
        else:
            return self.doclist
    """

class KRAPIVIN(DataLoader):
    def __init__(self, **kwargs):
        super(KRAPIVIN, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/KRAPIVIN'
        self.textdir = self.datadir + '/all_texts/'
        self.keyphrasedir = self.datadir + '/gold_standard_keyphrases/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/krapivin/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

    def load_text(self, textdir):
        for filename in os.listdir(textdir):
            with open(textdir+filename) as textfile:
                doc = Document()
                doc.name = filename[:filename.find('.txt')]

                import string
                printable = set(string.printable)

                # print((filename))
                try:
                    lines = textfile.readlines()

                    lines = [list(filter(lambda x: x in printable, l)) for l in lines]

                    title = ''.join(lines[1]).encode('ascii', 'ignore').decode('ascii', 'ignore')
                    # the 2nd line is abstract title
                    text  = ''.join(lines[3]).encode('ascii', 'ignore').decode('ascii', 'ignore')

                    # if lines[1].strip().lower() != 'abstract':
                    #     print('Wrong title detected : %s' % (filename))

                    doc.title = title
                    doc.text  = text
                    self.doclist.append(doc)

                except UnicodeDecodeError:
                    print('UnicodeDecodeError detected! %s' % filename )

class KDD(DataLoader):
    def __init__(self, **kwargs):
        super(KDD, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/KDD'
        self.xmldir = self.datadir + '/acmparsed/'
        self.textdir = self.datadir + '/acmparsed/'
        self.keyphrasedir = self.datadir + '/acmparsed/'

class WWW(DataLoader):
    def __init__(self, **kwargs):
        super(WWW, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/WWW'
        self.xmldir = self.datadir + '/acmparsed/'
        self.textdir = self.datadir + '/acmparsed/'
        self.keyphrasedir = self.datadir + '/acmparsed/'

class UMD(DataLoader):
    def __init__(self, **kwargs):
        super(UMD, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/UMD'
        self.xmldir = self.datadir + '/acmparsed/'
        self.textdir = self.datadir + '/contentsubset/'
        self.keyphrasedir = self.datadir + '/gold/'

class DUC(DataLoader):
    def __init__(self, **kwargs):
        super(DUC, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/DUC/'
        self.textdir = self.datadir + '/all_texts/'
        self.keyphrasedir = self.datadir + '/gold_standard_keyphrases/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/duc/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

    def export_text_phrase(self):
        all_phrase_file = self.basedir + '/dataset/keyphrase/testing-data/DUC/DUC2001LabeledKeyphrase.txt'

        duc_set = set()
        with open(all_phrase_file, 'r') as all_pf:
            for line in all_pf.readlines():
                line    = line.strip()
                duc_id  = line[:line.find('@')].strip()
                phrases = filter(lambda x:len(x.strip()) > 0, line[line.find('@')+1 : ].split(';'))
                # print(duc_id)
                # print(phrases)

                with open(self.keyphrasedir + duc_id + '.keyphrases', 'w') as pf:
                    pf.write('\n'.join(phrases))

                duc_set.add(duc_id)

        # delete the irrelevant files
        count = 0
        for text_file in os.listdir(self.datadir + '/original/'):
            if text_file in duc_set:
                if text_file.startswith('AP'):
                    print('*' * 50)
                    print(text_file)
                count += 1

                with open(self.datadir + '/original/' + text_file, 'r') as tf:
                    source = ' '.join([l.strip() for l in tf.readlines()])

                    m = re.search(r'<HEAD>(.*?)</HEAD>', source, flags=re.IGNORECASE)
                    if m  == None:
                        m = re.search(r'<HEADLINE>(.*?)</HEADLINE>', source, flags=re.IGNORECASE)
                    if m  == None:
                        m = re.search(r'<HL>(.*?)</HL>', source, flags=re.IGNORECASE)
                    if m  == None:
                        m = re.search(r'<H3>(.*?)</H3>', source, flags=re.IGNORECASE)

                    title = m.group(1)
                    title = re.sub('<.*?>', '', title).strip()
                    if text_file.startswith('FT') and title.find('/') > 0:
                        title = title[title.find('/')+1:].strip()

                    m = re.search(r'<TEXT>(.*?)</TEXT>', source, flags=re.IGNORECASE)
                    text = m.group(1)
                    text = re.sub('<.*?>', '', text).strip()

                    if text_file.startswith('AP'):
                        print(title)
                        print(text)

                    with open(self.textdir + text_file + '.txt', 'w') as target_tf:
                        target_tf.write(title + '\n' + text)

            # else:
            #     print('Delete!')
            #     os.remove(self.textdir+text_file)
        print(count)

class KP20k(DataLoader):
    def __init__(self, **kwargs):
        super(KP20k, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/baseline-data/kp20k/'
        self.textdir = self.datadir + '/text/'
        self.keyphrasedir = self.datadir + '/keyphrase/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/kp20k/'
        self.text_postag_dir = self.postag_datadir + 'text/'
        self.keyphrase_postag_dir = self.postag_datadir + 'keyphrase/'

    def get_docs(self, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        '''
        for fname in os.listdir(self.textdir):
            d = Document()
            d.name = fname
            with open(self.textdir + fname, 'r') as textfile:
                lines = textfile.readlines()
                d.title = lines[0].strip()
                d.text = ' '.join(lines[1:])
            with open(self.keyphrasedir + fname, 'r') as phrasefile:
                d.phrases = [l.strip() for l in phrasefile.readlines()]
            self.doclist.append(d)

        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name'] = d.name
            newd['abstract'] = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title'] = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword'] = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist

class KP2k_NEW(DataLoader):
    '''
    18,716 docs after filtering (no keyword etc)
    '''
    def __init__(self, **kwargs):
        super(KP2k_NEW, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/new_kp2k_for_theano_model/'
        self.textdir = self.datadir + '/text/'
        self.keyphrasedir = self.datadir + '/keyphrase/'

    def get_docs(self, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        '''
        for fname in os.listdir(self.textdir):
            d = Document()
            d.name = fname
            with open(self.textdir + fname, 'r') as textfile:
                lines = textfile.readlines()
                d.title = lines[0].strip()
                d.text = ' '.join(lines[1:])
            with open(self.keyphrasedir + fname, 'r') as phrasefile:
                d.phrases = [l.strip() for l in phrasefile.readlines()]
            self.doclist.append(d)

        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name'] = d.name
            newd['abstract'] = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title'] = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword'] = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist

class IRBooks(DataLoader):
    def __init__(self, **kwargs):
        super(IRBooks, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/IRbooks/'
        self.textdir = self.datadir + '/ir_textbook.txt'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/IRbooks/'
        self.text_postag_dir = self.postag_datadir + 'text/'

    def get_docs(self, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        '''

        with open(self.textdir, 'r') as textfile:
            for line in textfile.readlines():
                d = Document()
                text_id         = line[:line.index('\t')]
                text_content    = line[line.index('\t'):]
                d.name          = text_id.strip()
                d.title         = text_content[:text_content.index('  ')].strip()
                d.text          = text_content[text_content.index('  '):].strip()
                d.phrases       = []

                if len(d.text.split()) > 2000:
                    print('[length=%d]%s - %s' % (len(d.text.split()), d.name, d.title))
                self.doclist.append(d)

                # with open(self.keyphrasedir + fname, 'r') as phrasefile:
                #     d.phrases = [l.strip() for l in phrasefile.readlines()]

        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name'] = d.name
            newd['abstract'] = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title'] = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword'] = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist

class Quora(DataLoader):
    def __init__(self, **kwargs):
        super(Quora, self).__init__(**kwargs)
        self.datadir = self.basedir + '/dataset/keyphrase/testing-data/Quora/'
        self.textdir = self.datadir + '/'

        self.postag_datadir = self.basedir + '/dataset/keyphrase/baseline-data/Quora/'
        self.text_postag_dir = self.postag_datadir + 'text/'

    def get_docs(self, return_dict=True):
        '''
        :return: a list of dict instead of the Document object
        '''
        for textfile_name in os.listdir(self.textdir):
            with open(self.textdir+textfile_name, 'r') as textfile:
                d = Document()
                d.name = textfile_name[:textfile_name.find('.')].strip()
                d.title = ''
                d.text = ' '.join([l.strip() for l in textfile.readlines()])
                d.phrases = []
                self.doclist.append(d)

        doclist = []
        for d in self.doclist:
            newd = {}
            newd['name'] = d.name
            newd['abstract'] = re.sub('[\r\n]', ' ', d.text).strip()
            newd['title'] = re.sub('[\r\n]', ' ', d.title).strip()
            newd['keyword'] = ';'.join(d.phrases)
            doclist.append(newd)

        if return_dict:
            return doclist
        else:
            return self.doclist


# aliases
inspec = INSPEC
nus = NUS
semeval = SemEval
krapivin = KRAPIVIN
kdd = KDD
www = WWW
umd = UMD
kp20k = KP20k
kp2k_new = KP2k_NEW
duc = DUC
irbooks = IRBooks
quora = Quora # for Runhua's data

def testing_data_loader(identifier, kwargs=None):
    '''
    load testing data dynamically
    :return:
    '''
    data_loader = get_from_module(identifier, globals(), 'data_loader', instantiate=True,
                           kwargs=kwargs)
    return data_loader

def load_additional_testing_data(testing_names, idx2word, word2idx, config, postagging=True, process_type=1):
    test_sets           = {}

    # rule out the ones appear in testing data
    for dataset_name in testing_names:

        if os.path.exists(config['path'] + '/dataset/keyphrase/'+config['data_process_name']+dataset_name+'.testing.pkl'):
            test_set = deserialize_from_file(config['path'] + '/dataset/keyphrase/'+config['data_process_name']+dataset_name+'.testing.pkl')
            print('Loading testing dataset %s from %s' % (dataset_name, config['path'] + '/dataset/keyphrase/'+config['data_process_name']+dataset_name+'.testing.pkl'))
        else:
            print('Creating testing dataset %s: %s' % (dataset_name, config['path'] + '/dataset/keyphrase/' + config[
                'data_process_name'] + dataset_name + '.testing.pkl'))
            dataloader          = testing_data_loader(dataset_name, kwargs=dict(basedir=config['path']))
            records             = dataloader.get_docs()
            records, pairs, _   = utils.load_pairs(records, process_type=process_type, do_filter=False)
            test_set            = utils.build_data(pairs, idx2word, word2idx)

            test_set['record']  = records

            if postagging:
                tagged_sources = get_postag_with_record(records, pairs)
                test_set['tagged_source']   = [[t[1] for t in s] for s in tagged_sources]

                if hasattr(dataloader, 'text_postag_dir') and dataloader.__getattribute__('text_postag_dir') != None:
                    print('Exporting postagged data to %s' % (dataloader.text_postag_dir))
                    if not os.path.exists(dataloader.text_postag_dir):
                        os.makedirs(dataloader.text_postag_dir)
                    for r_, p_, s_ in zip(records, pairs, tagged_sources):
                        with open(dataloader.text_postag_dir+ '/' + r_['name'] + '.txt', 'w') as f:
                            output_str = ' '.join([w+'_'+t for w,t in s_])
                            f.write(output_str)
                else:
                    print('text_postag_dir not found, no export of postagged data')

            serialize_to_file(test_set, config['path'] + '/dataset/keyphrase/'+config['data_process_name']+dataset_name+'.testing.pkl')

        test_sets[dataset_name] = test_set

    return test_sets


from nltk.stem.porter import *
from keyphrase.dataset import dataset_utils
def check_data():
    from keyphrase.config import setup_keyphrase_stable
    config = setup_keyphrase_stable()
    train_set, validation_set, test_set, idx2word, word2idx = deserialize_from_file(config['dataset'])

    for dataset_name in config['testing_datasets']:
        print('*' * 50)
        print(dataset_name)

        number_groundtruth = 0
        number_present_groundtruth = 0

        loader = testing_data_loader(dataset_name, kwargs=dict(basedir = config['path']))

        if dataset_name == 'nus':
            docs   = loader.get_docs(only_abstract = True, return_dict=False)
        else:
            docs   = loader.get_docs(return_dict=False)

        stemmer = PorterStemmer()

        for id,doc in enumerate(docs):

            text_tokens = dataset_utils.get_tokens(doc.title.strip()+' '+ doc.text.strip())
            # if len(text_tokens) > 1500:
            #     text_tokens = text_tokens[:1500]
            print('[%d] length= %d' % (id, len(doc.text)))

            stemmed_input = [stemmer.stem(t).strip().lower() for t in text_tokens]

            phrase_str = ';'.join([l.strip() for l in doc.phrases])
            phrases = dataset_utils.process_keyphrase(phrase_str)
            targets = [[stemmer.stem(w).strip().lower() for w in target] for target in phrases]

            present_targets = []

            for target in targets:
                keep = True
                # whether do filtering on groundtruth phrases. if config['target_filter']==None, do nothing
                match = None
                for i in range(len(stemmed_input) - len(target) + 1):
                    match = None
                    for j in range(len(target)):
                        if target[j] != stemmed_input[i + j]:
                            match = False
                            break
                    if j == len(target) - 1 and match == None:
                        match = True
                        break

                if match == True:
                    # if match and 'appear-only', keep this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and True
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and False
                elif match == False:
                    # if not match and 'appear-only', discard this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and False
                    # if not match and 'non-appear-only', keep this phrase
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and True

                if not keep:
                    continue

                present_targets.append(target)

            number_groundtruth += len(targets)
            number_present_groundtruth += len(present_targets)

        print('number_groundtruth='+str(number_groundtruth))
        print('number_present_groundtruth='+str(number_present_groundtruth))

        '''
        test_set, doclist = loader(idx2word, word2idx, type=0)
        test_data_plain = zip(*(test_set['source'], test_set['target'], doclist))

        for idx in xrange(len(test_data_plain)):  # len(test_data_plain)
            test_s_o, test_t_o, doc = test_data_plain[idx]
            target = doc.phrases

            if len(doc.text) < 50000:
                print('%d - %d : %d   \n\tname=%s, \n\ttitle=%s, \n\ttext=%s, \n\tlen(keyphrase)=%d' % (idx, len(test_s_o), len(test_t_o), doc.name, doc.title, doc.text, len(''.join(target))))
                print(doc)

            if (len(target)!=0 and len(''.join(target))/len(target) < 3):
                print('!' * 100)
                print('Error found')
                print('%d - %d : %d   name=%s, title=%d, text=%d, len(keyphrase)=%d' % (idx, len(test_s_o), len(test_t_o), doc.name, len(doc.title), len(doc.text), len(''.join(target))))
        '''

def add_padding(data):
    shapes = [np.asarray(sample).shape for sample in data]
    lengths = [shape[0] for shape in shapes]

    # make sure there's at least one zero at last to indicate the end of sentence <eol>
    max_sequence_length = max(lengths) + 1
    rest_shape = shapes[0][1:]
    padded_batch = np.zeros(
        (len(data), max_sequence_length) + rest_shape,
        dtype='int32')
    for i, sample in enumerate(data):
        padded_batch[i, :len(sample)] = sample
    return padded_batch

def split_into_multiple_and_padding(data_s_o, data_t_o):
    data_s = []
    data_t = []
    for s, t in zip(data_s_o, data_t_o):
        for p in t:
            data_s += [s]
            data_t += [p]

    data_s = add_padding(data_s)
    data_t = add_padding(data_t)
    return data_s, data_t

def get_postag_with_record(records, pairs):
    path = os.path.dirname(__file__)
    path = path[:path.rfind(os.sep, 0, len(path)-10)+1] + 'stanford-postagger/'
    print(path)
    # jar = '/Users/memray/Project/stanford/stanford-postagger/stanford-postagger.jar'
    jar = path + '/stanford-postagger.jar'
    model = path + '/models/english-bidirectional-distsim.tagger'
    pos_tagger = StanfordPOSTagger(model, jar)
    # model = '/Users/memray/Project/stanford/stanford-postagger/models/english-left3words-distsim.tagger'
    # model = '/Users/memray/Project/stanford/stanford-postagger/models/english-bidirectional-distsim.tagger'

    stanford_dir = jar.rpartition('/')[0]
    stanford_jars = find_jars_within_path(stanford_dir)
    pos_tagger._stanford_jar = ':'.join(stanford_jars)

    tagged_source = []
    # Predict on testing data
    for idx, (record, pair) in enumerate(zip(records, pairs)):  # len(test_data_plain)
        print('*' * 100)
        print('File: '  + record['name'])
        print('Input: ' + str(pair[0]))
        text = pos_tagger.tag(pair[0])
        print('[%d/%d][%d] : %s' % (idx, len(records) , len(pair[0]), str(text)))
        tagged_source.append(text)

    return tagged_source


def get_postag_with_index(sources, idx2word, word2idx):
    path = os.path.dirname(__file__)
    path = path[:path.rfind(os.sep, 0, len(path)-10)+1] + 'stanford-postagger/'
    print(path)
    # jar = '/Users/memray/Project/stanford/stanford-postagger/stanford-postagger.jar'
    jar = path + '/stanford-postagger.jar'
    model = path + '/models/english-bidirectional-distsim.tagger'
    pos_tagger = StanfordPOSTagger(model, jar)
    # model = '/Users/memray/Project/stanford/stanford-postagger/models/english-left3words-distsim.tagger'
    # model = '/Users/memray/Project/stanford/stanford-postagger/models/english-bidirectional-distsim.tagger'

    stanford_dir = jar.rpartition('/')[0]
    stanford_jars = find_jars_within_path(stanford_dir)
    pos_tagger._stanford_jar = ':'.join(stanford_jars)

    tagged_source = []
    # Predict on testing data
    for idx in xrange(len(sources)):  # len(test_data_plain)
        test_s_o = sources[idx]
        source_text = keyphrase_utils.cut_zero(test_s_o, idx2word)
        text = pos_tagger.tag(source_text)
        print('[%d/%d] : %s' % (idx, len(sources), str(text)))

        tagged_source.append(text)

    return tagged_source

def check_postag(config):
    train_set, validation_set, test_set, idx2word, word2idx = deserialize_from_file(config['dataset'])

    path = os.path.dirname(__file__)
    path = path[:path.rfind(os.sep, 0, len(path)-10)+1] + 'stanford-postagger/'
    jar = path + '/stanford-postagger.jar'
    model = path + '/models/english-bidirectional-distsim.tagger'
    pos_tagger = StanfordPOSTagger(model, jar)

    for dataset_name in config['testing_datasets']:
        # override the original test_set
        # test_set = load_testing_data(dataset_name, kwargs=dict(basedir=config['path']))(idx2word, word2idx, config['preprocess_type'])

        test_sets = load_additional_testing_data(config['testing_datasets'], idx2word, word2idx, config)
        test_set = test_sets[dataset_name]

        # print(dataset_name)
        # print('Avg length=%d, Max length=%d' % (np.average([len(s) for s in test_set['source']]), np.max([len(s) for s in test_set['source']])))
        test_data_plain = zip(*(test_set['source'], test_set['target']))

        test_size = len(test_data_plain)

        # Alternatively to setting the CLASSPATH add the jar and model via their path:
        jar = '/Users/memray/Project/stanford/stanford-postagger/stanford-postagger.jar'
        # model = '/Users/memray/Project/stanford/stanford-postagger/models/english-left3words-distsim.tagger'
        model = '/Users/memray/Project/stanford/stanford-postagger/models/english-bidirectional-distsim.tagger'
        pos_tagger = StanfordPOSTagger(model, jar)

        for idx in xrange(len(test_data_plain)):  # len(test_data_plain)
            test_s_o, test_t_o = test_data_plain[idx]

            source = keyphrase_utils.cut_zero(test_s_o, idx2word)

            print(source)

            # Add other jars from Stanford directory
            stanford_dir = jar.rpartition('/')[0]
            stanford_jars = find_jars_within_path(stanford_dir)
            pos_tagger._stanford_jar = ':'.join(stanford_jars)

            text = pos_tagger.tag(source)
            print(text)

if __name__ == '__main__':
    # config = setup_keyphrase_all()
    #
    # loader = testing_data_loader('duc', kwargs=dict(basedir=config['path']))
    # loader.export_text_phrase()
    # docs   = loader.get_docs()

    check_data()
    pass

    # check_postag(config)

    # train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])
    # load_additional_testing_data(config['testing_datasets'], idx2word, word2idx, config)

            # if len(test_t_o) < 3:
            #
            #     doc.text = re.sub(r'[\r\n\t]', ' ', doc.text)
            #     print('name:\t%s' % doc.name)
            #     print('text:\t%s' % doc.text)
            #     print('phrase:\t%s' % str(doc.phrases))
            # if idx % 100 == 0:
            #     print(test_data_plain[idx])

================================================
FILE: keyphrase/dataset/keyphrase_train_dataset.py
================================================
# coding=utf-8
import json
import sys
import time

import nltk
import numpy
import numpy as np
import re

from keyphrase.config import *
from emolga.dataset.build_dataset import *
from keyphrase.dataset import dataset_utils
from keyphrase_test_dataset import DataLoader,testing_data_loader
import dataset_utils as utils


def build_dict(wordfreq):
    word2idx = dict()
    word2idx['<eol>'] = 0
    word2idx['<unk>'] = 1
    start_index = 2

    # sort the vocabulary (word, freq) from low to high
    wordfreq = sorted(wordfreq.items(), key=lambda a: a[1], reverse=True)

    # create word2idx
    for w in wordfreq:
        word2idx[w[0]] = start_index
        start_index += 1

    # create idx2word
    idx2word = {k: v for v, k in word2idx.items()}
    Lmax = len(idx2word)
    # for i in xrange(Lmax):
    #     print idx2word[i].encode('utf-8')

    return idx2word, word2idx


def dump_samples_to_json(records, file_path):
    '''
    A temporary function for exporting cleaned data
    :param records:
    :param file_path:
    :return:
    '''
    with open(file_path, 'w') as out_file:
        for record in records:
            json_line = json.dumps(record)
            out_file.write(json_line+'\n')

def load_data_and_dict(training_dataset):
    '''
    here dict is built on both training and testing dataset, which may be not suitable (testing data should be unseen)
    :param training_dataset,testing_dataset: path
    :return:
    '''
    # load training dataset
    print('Loading training dataset')
    f                   = open(training_dataset, 'r')
    training_records    = json.load(f)
    # filter the duplicates
    title_dict          = dict([(r['title'].strip().lower(), r) for r in training_records])
    print('#(Training Data)=%d' % len(title_dict))

    # load testing dataset
    print('Loading testing dataset')
    testing_names       = config['testing_datasets'] # only these three may have overlaps with training data
    testing_records     = {}

    # rule out the ones appear in testing data: 'inspec', 'krapivin', 'nus', 'semeval'
    print('Filtering testing dataset from training data')
    for dataset_name in testing_names:
        print(dataset_name)

        testing_records[dataset_name] = testing_data_loader(dataset_name, kwargs=dict(basedir = config['path'])).get_docs()

        for r in testing_records[dataset_name]:
            title = r['title'].strip().lower()
            if title in title_dict:
                title_dict.pop(title)


    print('Process the data')
    training_records, train_pairs, wordfreq         = dataset_utils.load_pairs(title_dict.values(), do_filter=True)
    print('#(Training Data after Filtering Noises)=%d' % len(training_records))

    print('Preparing development data')
    training_records    = numpy.asarray(training_records)
    train_pairs         = numpy.asarray(train_pairs)
    # keep a copy of validation data
    if 'validation_id' in config and os.path.exists(config['validation_id']):
        validation_ids = deserialize_from_file(config['validation_id'])
        # serialize_to_file(validation_records, config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'validation_record_'+str(config['validation_size'])+'.pkl')
        # exit()
    else:
        validation_ids      = numpy.random.randint(0, len(training_records), config['validation_size'])
        serialize_to_file(validation_ids, config['validation_id'])

    validation_records  = training_records[validation_ids]
    validation_pairs    = train_pairs[validation_ids]
    training_records    = numpy.delete(training_records, validation_ids, axis=0)
    train_pairs         = numpy.delete(train_pairs, validation_ids, axis=0)

    #
    # target_dir = '/Users/memray/Project/seq2seq-keyphrase/dataset/keyphrase/baseline-data/maui/kp20k/train/'
    # for r_id, r in enumerate(validation_records):
    #     with open(target_dir+r_id+'.txt', 'w') as textfile:
    #         textfile.write(r.title+'\n'+r.text)
    #     with open(target_dir + r_id + '.key', 'w') as phrasefile:
    #         for p in r.phrases:
    #             phrasefile.write('%s\t1\n' % p)

    print('#(Training Data after Filtering out Validate & Test data)=%d' % len(train_pairs))

    print('Preparing testing data kp20k')
    # keep a copy of testing data
    if 'testing_id' in config and os.path.exists(config['testing_id']):
        testing_ids = deserialize_from_file(config['testing_id'])
        testing_ids = filter(lambda x:x<len(training_records), testing_ids)
    else:
        testing_ids         = numpy.random.randint(0, len(training_records), config['validation_size'])
        serialize_to_file(testing_ids, config['testing_id'])

    testing_records['kp20k']  = training_records[testing_ids]
    training_records          = numpy.delete(training_records, testing_ids, axis=0)
    train_pairs               = numpy.delete(train_pairs, testing_ids, axis=0)

    dump_samples_to_json(training_records, config['path'] + '/dataset/keyphrase/million-paper/kp20k_training.json')
    dump_samples_to_json(validation_records, config['path'] + '/dataset/keyphrase/million-paper/kp20k_validation.json')
    dump_samples_to_json(testing_records['kp20k'], config['path'] + '/dataset/keyphrase/million-paper/kp20k_testing.json')

    path = config['path'] + '/dataset/keyphrase/baseline-data/kp20k/'
    keyphrase_count = 0
    for i,r in enumerate(testing_records['kp20k']):
        with open(path+'text/'+ str(i) +'.txt', 'w') as f:
            f.write(r['title']+'. \n'+r['abstract'])
        with open(path+'keyphrase/'+ str(i) +'.txt', 'w') as f:
            keyphrases = re.sub(r'\(.*?\)', ' ', r['keyword'])
            keyphrases = re.split('[,;]',keyphrases)
            keyphrase_count += len(keyphrases)
            f.write('\n'.join(keyphrases))

    print('length of testing ids: %d' % len(testing_ids))
    print('length of actually testing samples: %d' % len(testing_records['kp20k']))
    print('average number of keyphrases: %f' % (float(keyphrase_count)/ float(len(testing_records['kp20k']))))
    # exit()

    test_pairs                = dict([(k, dataset_utils.load_pairs(v, do_filter=False)[1]) for (k,v) in testing_records.items()])

    print('Building dicts')
    # if voc exists and is assigned, load it, overwrite the wordfreq
    if 'voc' in config and os.path.exists(config['voc']):
        print('Loading dicts from %s' % config['voc'])
        wordfreq = dict(deserialize_from_file(config['voc']))
    else:
        # export vocabulary to file for manual check
        sorted_wordfreq = sorted(wordfreq.items(), key=lambda a: a[1], reverse=True)
        serialize_to_file(sorted_wordfreq, config['voc'])
        serialize_to_file_json(sorted_wordfreq, config['voc'].replace('pkl', 'json'))

        with open(config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'/voc_list.txt', 'w') as f_:
            for (k,f) in sorted_wordfreq:
                f_.write('%s\t%d\n' % (k,f))

    idx2word, word2idx = build_dict(wordfreq)

    print('Mapping tokens to indexes')
    train_set           = dataset_utils.build_data(train_pairs, idx2word, word2idx)
    validation_set      = dataset_utils.build_data(validation_pairs, idx2word, word2idx)
    test_set            = dict([(k, dataset_utils.build_data(v, idx2word, word2idx)) for (k, v) in test_pairs.items()])

    print('Train samples      : %d' % len(train_pairs))
    print('Validation samples : %d' % len(validation_pairs))
    print('Test samples       : %d' % sum([len(test_pair) for test_pair in test_pairs.values()]))
    print('Dict size          : %d' % len(idx2word))

    return train_set, validation_set, test_set, idx2word, word2idx


if __name__ == '__main__':
    # config = config.setup_keyphrase_all()
    config = setup_keyphrase_all()

    start_time = time.clock()
    train_set, validation_set, test_set, idx2word, word2idx = load_data_and_dict(config['training_dataset'])
    serialize_to_file([train_set, validation_set, test_set, idx2word, word2idx], config['dataset'])

    print('Finish processing and dumping: %d seconds' % (time.clock()-start_time))


    # # export vocabulary to file for manual check
    # wordfreq = sorted(wordfreq.items(), key=lambda a: a[1], reverse=True)
    # serialize_to_file(wordfreq, config['voc'])
    #
    # wordfreq = deserialize_from_file(config['voc'])
    # with open(config['path'] + '/dataset/keyphrase/'+config['data_process_name']+'/voc_list.txt', 'w') as f_:
    #     for (k,f) in wordfreq:
    #         f_.write('%s\t%d\n' % (k,f))

    # train_set, test_set, idx2word, word2idx = deserialize_from_file(config['dataset'])
    # print('Load successful: vocsize=%d'% len(idx2word))
    #
    # count_dict = {}
    #
    # for d in train_set['target']:
    #     for p in d:
    #         if len(p)>=10:
    #             print('%d, %s' %(len(p), ' '.join([idx2word[i] for i in p])))
    #         if len(p) in count_dict:
    #             count_dict[len(p)] += 1
    #         else:
    #             count_dict[len(p)] = 1
    #
    # total_count = sum(count_dict.values())
    #
    # for leng,count in count_dict.items():
    #     print('%s: %d, %.3f' % (leng, count, float(count)/float(total_count)*100))
    #
    # print('Total phrases: %d'% total_count)

================================================
FILE: keyphrase/dataset/million-paper/clean_export_json.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Check how many non-duplicate and valid (some doesn't contain title/keyword/abstract) items in the data
"""
import json
import re
import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    PATH = os.path.realpath(os.path.curdir)+'/dataset/keyphrase/million-paper/'
    data_name = 'ALL'

    data_path = {'ALL':PATH+'raw_data/'+'all_title_abstract_keyword.json',
                 'ACM':PATH+'raw_data/'+'acm_title_abstract_keyword.json'}
    export_path = {'ALL':PATH+'all_title_abstract_keyword_clean.json',
                   'ACM':PATH+'acm_title_abstract_keyword_clean.json'}

    wordlist_path = data_path[data_name]

    # load data and filter invalid items
    title_map = {}
    count = 0
    no_keyword_abstract = 0
    with open(wordlist_path, 'r') as f:
        for line in f:
            line = re.sub(r'<.*?>|\s+|', ' ', line)

            d = json.loads(line)

            count += 1

            # line['title'] = line['title'].replace('',' ').replace('    ',' ')
            # print(d['title'])
            # print(d['keyword'].split(';'))
            # print(d['abstract'])

            if 'title' not in d or 'keyword' not in d or 'abstract' not in d :
                no_keyword_abstract += 1
                continue
            title_map[d['title']] = d

    print('Total paper = %d' % count)
    print('Remove duplicate = %d' % len(title_map))
    print('No abstract/keyword = %d' % no_keyword_abstract)

    # export the filtered data to file
    with open(export_path[data_name], 'w') as json_file:
        json_file.write(json.dumps(title_map.values()))


================================================
FILE: keyphrase/dataset/million-paper/preprocess.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
'''
Load the paper metadata from json, do preprocess (cleanup, tokenization for words and sentences) and export to json
'''
import json
import re

import sys
import nltk

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

def load_file(input_path):
    record_dict = {}
    count = 0
    no_keyword_abstract = 0
    with open(input_path, 'r') as f:
        for line in f:
            # clean the string
            line = re.sub(r'<.*?>|\s+|', ' ', line.lower())
            # load object
            record = json.loads(line)

            # store the name of input sentence
            text = record['abstract']
            text = re.sub('[\t\r\n]', ' ', text)
            text = record['title'] + ' <EOS> ' + ' <EOS> '.join(sent_detector.tokenize(text)) + ' <EOS>'
            text = filter(lambda w: len(w)>0, re.split('\W',text))
            record['tokens'] = text

            # store the terms of outputs
            keyphrases = record['keyword']
            record['name'] = [filter(lambda w: len(w)>0, re.split('\W',phrase)) for phrase in keyphrases.split(';')]

            # filter out the duplicate
            record_dict[record['title']] = record

            count += 1
            if len(record['keyword'])==0 or len(record['abstract'])==0:
                no_keyword_abstract += 0

            record['filename'] = record['title']
            print(record['title'])
            # print(record['abstract'])
            # print(record['token'])
            # print(record['keyword'])
            # print(record['name'])
            # print('')

    print('Total paper = %d' % count)
    print('Remove duplicate = %d' % len(record_dict))
    print('No abstract/keyword = %d' % no_keyword_abstract)
    return record_dict.values()

'''
Two ways to preprocess, -d 0 will export one abstract to one phrase, -d 1 will export one abstract to multiple phrases
'''
if __name__ == '__main__':
    sys.argv = 'dataset/keyphrase/million-paper/all_title_abstract_keyword.json dataset/keyphrase/million-paper/processed_all_title_abstract_keyword_one2many.json 1'.split()

    if len(sys.argv) < 3:
        print 'Usage <keyword_input_json_file> <output_file> -d 0|1 \n' \
              '     -d: format to output, 0 means one abstract to one keyphrase, 1 means one to many keyphrases'
        sys.exit(-1)

    input_file = sys.argv[0]
    output_file = sys.argv[1]
    records = load_file(input_file)

    print(len(records))

    with open(output_file, 'w') as out:
        output_list = []
        if sys.argv[2]=='0':
            for record in records:
                for keyphrase in record['name']:
                    dict = {}
                    dict['tokens'] = record['tokens']
                    dict['name'] = keyphrase
                    dict['filename'] = record['filename']
                    output_list.append(dict)
        if sys.argv[2]=='1':
            for record in records:
                dict = {}
                dict['tokens'] = record['tokens']
                dict['name'] = record['name']
                dict['filename'] = record['filename']
                output_list.append(dict)
        print(len(output_list))
        out.write(json.dumps(output_list))


================================================
FILE: keyphrase/keyphrase_copynet.py
================================================
import os
# set environment variables in advance of importing theano as well as any possible module
# os.environ['PATH'] = "/usr/local/cuda-9.0/bin:/usr/local/cuda-9.0/lib64:" + os.environ['PATH']
os.environ['THEANO_FLAGS'] = 'device=cuda0,floatX=float32,optimizer=fast_compile,exception_verbosity=high'
# os.environ['CPLUS_INCLUDE_PATH'] = '/usr/local/cuda-9.0/include'

import logging
import time
import numpy as np
import sys
import copy
import math

import keyphrase_utils
from keyphrase.dataset import keyphrase_test_dataset
import os
import resource
# resource.setrlimit(resource.RLIMIT_STACK, (2**10,-1))
# sys.setrecursionlimit(10**7)

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"


import theano
# theano.config.optimizer='fast_compile'
# os.environ['THEANO_FLAGS'] = 'device=cpu'

from emolga.basic import optimizers

# theano.config.exception_verbosity='high'
# theano.config.compute_test_value = 'warn'

from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams

#from keyphrase.dataset.keyphrase_train_dataset import *
from keyphrase.config import *
from emolga.utils.generic_utils import *
from emolga.models.covc_encdec import NRM
from emolga.models.encdec import NRM as NRM0
from emolga.dataset.build_dataset import deserialize_from_file, serialize_to_file
from collections import OrderedDict
from fuel import datasets
from fuel import transformers
from fuel import schemes

setup = setup_keyphrase_stable

class LoggerWriter:
    def __init__(self, level):
        # self.level is really like using log.debug(message)
        # at least in my case
        self.level = level

    def write(self, message):
        # if statement reduces the amount of newlines that are
        # printed to the logger
        if message != '\n':
            self.level(message)

    def flush(self):
        # create a flush method so things can be flushed when
        # the system wants to. Not sure if simply 'printing'
        # sys.stderr is the correct way to do it, but it seemed
        # to work properly for me.
        self.level(sys.stderr)

def init_logging(logfile):
    formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(module)s: %(message)s',
                                  datefmt='%m/%d/%Y %H:%M:%S'   )
    fh = logging.FileHandler(logfile)
    # ch = logging.StreamHandler()
    ch = logging.StreamHandler(sys.stdout)

    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    # fh.setLevel(logging.INFO)
    ch.setLevel(logging.INFO)
    logging.getLogger().addHandler(ch)
    logging.getLogger().addHandler(fh)
    logging.getLogger().setLevel(logging.INFO)

    return logging


def output_stream(dataset, batch_size, size=1):
    data_stream = dataset.get_example_stream()
    data_stream = transformers.Batch(data_stream,
                                     iteration_scheme=schemes.ConstantScheme(batch_size))

    # add padding and masks to the dataset
    # Warning: in multiple output case, will raise ValueError: All dimensions except length must be equal, need padding manually
    # data_stream = transformers.Padding(data_stream, mask_sources=('source', 'target', 'target_c'))
    # data_stream = transformers.Padding(data_stream, mask_sources=('source', 'target'))
    return data_stream


def prepare_batch(batch, mask, fix_len=None):
    data = batch[mask].astype('int32')
    data = np.concatenate([data, np.zeros((data.shape[0], 1), dtype='int32')], axis=1)

    def cut_zeros(data, fix_len=None):
        if fix_len is not None:
            return data[:, : fix_len]
        for k in range(data.shape[1] - 1, 0, -1):
            data_col = data[:, k].sum()
            if data_col > 0:
                return data[:, : k + 2]
        return data

    data = cut_zeros(data, fix_len)
    return data


def cc_martix(source, target):
    '''
    return the copy matrix, size = [nb_sample, max_len_target, max_len_source]
    '''
    cc = np.zeros((source.shape[0], target.shape[1], source.shape[1]), dtype='float32')
    for k in range(source.shape[0]): # go over each sample in batch
        for j in range(target.shape[1]): # go over each word in target (all target have same length after padding)
            for i in range(source.shape[1]): # go over each word in source
                if (source[k, i] == target[k, j]) and (source[k, i] > 0): # if word match, set cc[k][j][i] = 1. Don't count non-word(source[k, i]=0)
                    cc[k][j][i] = 1.
    return cc

def unk_filter(data):
    '''
    only keep the top [voc_size] frequent words, replace the other as 0
    word index is in the order of from most frequent to least
    :param data:
    :return:
    '''
    if config['voc_size'] == -1:
        return copy.copy(data)
    else:
        # mask shows whether keeps each word (frequent) or not, only word_index<config['voc_size']=1, else=0
        mask = (np.less(data, config['voc_size'])).astype(dtype='int32')
        # low frequency word will be set to 1 (index of <unk>)
        data = copy.copy(data * mask + (1 - mask))
        return data


def add_padding(data):
    shapes = [np.asarray(sample).shape for sample in data]
    lengths = [shape[0] for shape in shapes]

    # make sure there's at least one zero at last to indicate the end of sentence <eol>
    max_sequence_length = max(lengths) + 1
    rest_shape = shapes[0][1:]
    padded_batch = np.zeros(
        (len(data), max_sequence_length) + rest_shape,
        dtype='int32')
    for i, sample in enumerate(data):
        padded_batch[i, :len(sample)] = sample

    return padded_batch


def split_into_multiple_and_padding(data_s_o, data_t_o):
    data_s = []
    data_t = []
    for s, t in zip(data_s_o, data_t_o):
        for p in t:
            data_s += [s]
            data_t += [p]

    data_s = add_padding(data_s)
    data_t = add_padding(data_t)
    return data_s, data_t

def build_data(data):
    # create fuel dataset.
    dataset = datasets.IndexableDataset(indexables=OrderedDict([('source', data['source']),
                                                                ('target', data['target']),
                                                                # ('target_c', data['target_c']),
                                                                ]))
    dataset.example_iteration_scheme \
        = schemes.ShuffledExampleScheme(dataset.num_examples)
    return dataset


if __name__ == '__main__':
    # prepare logging.
    config  = setup()   # load settings.

    print('Log path: %s' % (config['path_experiment'] + '/experiments.{0}.id={1}.log'.format(config['task_name'],config['timemark'])))
    logger  = init_logging(config['path_experiment'] + '/experiments.{0}.id={1}.log'.format(config['task_name'],config['timemark']))

    n_rng   = np.random.RandomState(config['seed'])
    np.random.seed(config['seed'])
    rng     = RandomStreams(n_rng.randint(2 ** 30))

    logger.info('*'*20 + '  config information  ' + '*'*20)
    # print config information
    for k,v in config.items():
        logger.info("\t\t\t\t%s : %s" % (k,v))
    logger.info('*' * 50)

    # data is too large to dump into file, so has to load from raw dataset directly
    # train_set, test_set, idx2word, word2idx = keyphrase_dataset.load_data_and_dict(config['training_dataset'], config['testing_dataset'])

    train_set, validation_set, test_sets, idx2word, word2idx = deserialize_from_file(config['dataset'])
    test_sets = keyphrase_test_dataset.load_additional_testing_data(['inspec'], idx2word, word2idx, config, postagging=False)
    test_sets = keyphrase_test_dataset.load_additional_testing_data(['kp20k'], idx2word, word2idx, config, postagging=False)

    logger.info('#(training paper)=%d' % len(train_set['source']))
    logger.info('#(training keyphrase)=%d' % sum([len(t) for t in train_set['target']]))
    logger.info('#(testing paper)=%d' % sum([len(test_set['target']) for test_set in test_sets.values()]))

    logger.info('Load data done.')

    if config['voc_size'] == -1:   # not use unk
        config['enc_voc_size'] = max(list(zip(*word2idx.items()))[1]) + 1
        config['dec_voc_size'] = config['enc_voc_size']
    else:
        config['enc_voc_size'] = config['voc_size']
        config['dec_voc_size'] = config['enc_voc_size']

    predictions  = len(train_set['source'])

    logger.info('build dataset done. ' +
                'dataset size: {} ||'.format(predictions) +
                'vocabulary size = {0}/ batch size = {1}'.format(
            config['dec_voc_size'], config['batch_size']))

    # train_data        = build_data(train_set) # a fuel IndexableDataset
    train_data_plain  = list(zip(*(train_set['source'], train_set['target'])))
    train_data_source = np.array(train_set['source'])
    train_data_target = np.array(train_set['target'])

    count_has_OOV = 0
    count_all     = 0
    for phrases in train_data_target:
        for phrase in phrases:
            count_all += 1
            if np.greater(phrase, np.asarray(config['voc_size'])).any():
                count_has_OOV += 1
    print('%d / %d' % (count_has_OOV, count_all))

    # test_data_plain   = list(zip(*(test_set['source'],  test_set['target'])))

    # trunk the over-long input in testing data
    # for test_set in test_sets.values():
    #     test_set['source'] = [s if len(s)<1000 else s[:1000] for s in test_set['source']]
    test_data_plain = np.concatenate([list(zip(*(t['source'],  t['target']))) for k,t in test_sets.items()])

    print('Avg length=%d, Max length=%d' % (
    np.average([len(s[0]) for s in test_data_plain]), np.max([len(s[0]) for s in test_data_plain])))

    train_size        = len(train_data_plain)
    test_size         = len(test_data_plain)
    tr_idx            = n_rng.permutation(train_size)[:2000].tolist()
    ts_idx            = n_rng.permutation(test_size )[:2000].tolist()
    logger.info('load the data ok.')

    if config['do_train'] or config['do_predict']:
        # build the agent
        if config['copynet']:
            agent = NRM(config, n_rng, rng, mode=config['mode'],
                         use_attention=True, copynet=config['copynet'], identity=config['identity'])
        else:
            agent = NRM0(config, n_rng, rng, mode=config['mode'],
                          use_attention=True, copynet=config['copynet'], identity=config['identity'])

        agent.build_()
        agent.compile_('all')
        logger.info('compile ok.')

        # load pre-trained model to continue training
        if config['trained_model'] and os.path.exists(config['trained_model']):
            logger.info('Trained model exists, loading from %s' % config['trained_model'])
            agent.load(config['trained_model'])
            # agent.save_weight_json(config['weight_json'])

    epoch   = 0
    epochs = 10
    valid_param = {}
    valid_param['early_stop'] = False
    valid_param['valid_best_score'] = (float(sys.maxsize),float(sys.maxsize))
    valid_param['valids_not_improved'] = 0
    valid_param['patience']            = 3

    # do training?
    do_train     = config['do_train']
    # do predicting?
    do_predict     = config['do_predict']
    # do testing?
    do_evaluate     = config['do_evaluate']
    do_validate     = config['do_validate']

    if do_train:
        while epoch < epochs:
            epoch += 1
            loss  = []
            # train_batches = output_stream(train_data, config['batch_size']).get_epoch_iterator(as_dict=True)

            if valid_param['early_stop']:
                break

            logger.info('\nEpoch = {} -> Training Set Learning...'.format(epoch))
            progbar = Progbar(train_size / config['batch_size'], logger)

            # number of minibatches
            num_batches = int(float(len(train_data_plain)) / config['batch_size'])
            name_ordering = np.arange(len(train_data_plain), dtype=np.int32)
            np.random.shuffle(name_ordering)
            batch_start = 0

            # if it's to resume the previous training, reload the archive and settings before training
            if config['resume_training'] and epoch == 1:
                name_ordering, batch_id, loss, valid_param, optimizer_config = deserialize_from_file(config['training_archive'])
                batch_start += 1

                optimizer_config['rng'] = agent.rng
                optimizer_config['save'] = False
                optimizer_config['clipnorm'] = config['clipnorm']
                print('optimizer_config: %s' % str(optimizer_config))
                # agent.optimizer = optimizers.get(config['optimizer'], kwargs=optimizer_config)
                agent.optimizer.iterations.set_value(optimizer_config['iterations'])
                agent.optimizer.lr.set_value(optimizer_config['lr'])
                agent.optimizer.beta_1 = optimizer_config['beta_1']
                agent.optimizer.beta_2 = optimizer_config['beta_2']
                agent.optimizer.clipnorm = optimizer_config['clipnorm']
                # batch_start = 40001

            for batch_id in range(batch_start, num_batches):
                # 1. Prepare data
                data_ids = name_ordering[batch_id * config['batch_size']:min((batch_id + 1) * config['batch_size'], len(train_data_plain))]

                # obtain mini-batch data
                data_s = train_data_source[data_ids]
                data_t = train_data_target[data_ids]

                # convert one data (with multiple targets) into multiple ones
                data_s, data_t = split_into_multiple_and_padding(data_s, data_t)

                # 2. Training
                '''
                As the length of input varies often, it leads to frequent Out-of-Memory on GPU
                 Thus I have to segment each mini batch into mini-mini batches based on their lengths (number of words)
                 It slows down the speed somehow, but avoids the break-down effectively
                '''
                loss_batch = []

                mini_data_idx = 0
                max_size = config['mini_mini_batch_length'] # max length (#words) of each mini-mini batch
                stack_size = 0
                mini_data_s = []
                mini_data_t = []
                while mini_data_idx < len(data_s):
                    if len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) >= max_size:
                        logger.error('mini_mini_batch_length is too small. Enlarge it to 2 times')
                        max_size = len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) * 2
                        config['mini_mini_batch_length'] = max_size

                    # get a new mini-mini batch
                    while mini_data_idx < len(data_s) and stack_size + len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) < max_size:
                        # mini_data_s.append(data_s[mini_data_idx][:300]) # truncate to reduce memory usage
                        mini_data_s.append(data_s[mini_data_idx])
                        mini_data_t.append(data_t[mini_data_idx])
                        stack_size += len(data_s[mini_data_idx]) * len(data_t[mini_data_idx])
                        mini_data_idx += 1
                    mini_data_s = np.asarray(mini_data_s)
                    mini_data_t = np.asarray(mini_data_t)

                    logger.info('Training minibatch %d/%d, avg_len=%.2f' % (mini_data_idx, len(data_s), sum([len(s) for s in mini_data_s])/mini_data_idx))

                    # fit the mini-mini batch
                    if config['copynet']:
                        # mini_data_s=(batch_size, src_len), mini_data_t=(batch_size, trg_len), data_c=(batch_size, trg_len, src_len)
                        data_c = cc_martix(mini_data_s, mini_data_t)
                        loss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)]
                        # loss += [agent.train_guard(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)]
                    else:
                        loss_batch += [agent.train_(unk_filter(mini_data_s), unk_filter(mini_data_t))]

                    mini_data_s = []
                    mini_data_t = []
                    stack_size  = 0

                # average the training loss and print progress
                mean_ll  = np.average(np.concatenate([l[0] for l in loss_batch]))
                mean_ppl = np.average(np.concatenate([l[1] for l in loss_batch]))
                loss.append([mean_ll, mean_ppl])
                progbar.update(batch_id, [('loss_reg', mean_ll),
                                          ('ppl.', mean_ppl)])

                # 3. Quick testing
                if config['do_quick_testing'] and batch_id % 200 == 0 and batch_id > 1:
                    print_case = '-' * 100 +'\n'

                    logger.info('Echo={} Evaluation Sampling.'.format(batch_id))
                    print_case += 'Echo={} Evaluation Sampling.\n'.format(batch_id)

                    logger.info('generating [training set] samples')
                    print_case += 'generating [training set] samples\n'

                    for _ in range(1):
                        idx              = int(np.floor(n_rng.rand() * train_size))

                        test_s_o, test_t_o = train_data_plain[idx]

                        if not config['multi_output']:
                            # create <abs, phrase> pair for each phrase
                            test_s, test_t = split_into_multiple_and_padding([test_s_o], [test_t_o])

                        inputs_unk = np.asarray(unk_filter(np.asarray(test_s[0], dtype='int32')), dtype='int32')
                        prediction, score = agent.generate_multiple(inputs_unk[None, :])

                        outs, metrics = agent.evaluate_multiple([test_s[0]], [test_t],
                                                                [test_s_o], [test_t_o],
                                                                [prediction], [score],
                                                                idx2word)
                        print('*' * 50)
                        print('*' * 50)

                    logger.info('generating [testing set] samples')
                    for _ in range(1):
                        idx            = int(np.floor(n_rng.rand() * test_size))
                        test_s_o, test_t_o = test_data_plain[idx]
                        if not config['multi_output']:
                            test_s, test_t = split_into_multiple_and_padding([test_s_o], [test_t_o])

                        inputs_unk = np.asarray(unk_filter(np.asarray(test_s[0], dtype='int32')), dtype='int32')
                        prediction, score = agent.generate_multiple(inputs_unk[None, :], return_all=False)

                        outs, metrics = agent.evaluate_multiple([test_s[0]], [test_t],
                                                                [test_s_o], [test_t_o],
                                                                [prediction], [score],
                                                                idx2word)
                        print('*' * 50)
                    # write examples to log file
                    with open(config['casestudy_log'], 'w+') as print_case_file:
                        print_case_file.write(print_case)

                # 4. Test on validation data for a few batches, and do early-stopping if needed
                if do_validate and batch_id % 1000 == 0 and not (batch_id==0 and epoch==1):
                    logger.info('Validate @ epoch=%d, batch=%d' % (epoch, batch_id))
                    # 1. Prepare data
                    data_s = np.array(validation_set['source'])
                    data_t = np.array(validation_set['target'])

                    # if len(data_s) > 2000:
                    #     data_s = data_s[:2000]
                    #     data_t = data_t[:2000]
                    # if not multi_output, split one data (with multiple targets) into multiple ones
                    if not config['multi_output']:
                        data_s, data_t = split_into_multiple_and_padding(data_s, data_t)

                    loss_valid = []

                    # for minibatch_id in range(int(math.ceil(len(data_s)/config['mini_batch_size']))):
                    #     mini_data_s = data_s[minibatch_id * config['mini_batch_size']:min((minibatch_id + 1) * config['mini_batch_size'], len(data_s))]
                    #     mini_data_t = data_t[minibatch_id * config['mini_batch_size']:min((minibatch_id + 1) * config['mini_batch_size'], len(data_t))]

                    mini_data_idx = 0
                    max_size = config['mini_mini_batch_length']
                    stack_size = 0
                    mini_data_s = []
                    mini_data_t = []
                    while mini_data_idx < len(data_s):
                        if len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) >= max_size:
                            logger.error('mini_mini_batch_length is too small. Enlarge it to 2 times')
                            max_size = len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) * 2
                            config['mini_mini_batch_length'] = max_size

                        while mini_data_idx < len(data_s) and stack_size + len(data_s[mini_data_idx]) * len(data_t[mini_data_idx]) < max_size:
                            mini_data_s.append(data_s[mini_data_idx])
                            mini_data_t.append(data_t[mini_data_idx])
                            stack_size += len(data_s[mini_data_idx]) * len(data_t[mini_data_idx])
                            mini_data_idx += 1
                        mini_data_s = np.asarray(mini_data_s)
                        mini_data_t = np.asarray(mini_data_t)

                        if config['copynet']:
                            data_c = cc_martix(mini_data_s, mini_data_t)
                            loss_valid += [agent.validate_(unk_filter(mini_data_s), unk_filter(mini_data_t), data_c)]
                        else:
                            loss_valid += [agent.validate_(unk_filter(mini_data_s), unk_filter(mini_data_t))]

                        if mini_data_idx % 100 == 0:
                            print('\t %d / %d' % (mini_data_idx, math.ceil(len(data_s))))

                        mini_data_s = []
                        mini_data_t = []
                        stack_size = 0

                    mean_ll = np.average(np.concatenate([l[0] for l in loss_valid]))
                    mean_ppl = np.average(np.concatenate([l[1] for l in loss_valid]))
                    logger.info('\tPrevious best score: \t ll=%f, \t ppl=%f' % (valid_param['valid_best_score'][0], valid_param['valid_best_score'][1]))
                    logger.info('\tCurrent score: \t ll=%f, \t ppl=%f' % (mean_ll, mean_ppl))

                    if mean_ll < valid_param['valid_best_score'][0]:
                        valid_param['valid_best_score'] = (mean_ll, mean_ppl)
                        logger.info('New best score')
                        valid_param['valids_not_improved'] = 0
                    else:
                        valid_param['valids_not_improved'] += 1
                        logger.info('Not improved for %s tests.' % valid_param['valids_not_improved'])

                # 5. Save model
                if batch_id % 500 == 0 and batch_id > 1:
                    # save the weights every K rounds
                    agent.save(config['path_experiment'] + '/experiments.{0}.id={1}.epoch={2}.batch={3}.pkl'.format(config['task_name'], config['timemark'], epoch, batch_id))

                    # save the game(training progress) in case of interrupt!
                    optimizer_config = agent.optimizer.get_config()
                    serialize_to_file([name_ordering, batch_id, loss, valid_param, optimizer_config], config['path_experiment'] + '/save_training_status.id={0}.epoch={1}.batch={2}.pkl'.format(config['timemark'], epoch, batch_id))
                    print(optimizer_config)
                    # agent.save_weight_json(config['path_experiment'] + '/weight.print.id={0}.epoch={1}.batch={2}.json'.format(config['timemark'], epoch, batch_id))

                # 6. Stop if exceed patience
                if valid_param['valids_not_improved']  >= valid_param['patience']:
                    print("Not improved for %s epochs. Stopping..." % valid_param['valids_not_improved'])
                    valid_param['early_stop'] = True
                    break

    '''
    test accuracy and f-score at the end of each epoch
    '''
    if do_predict:
        for dataset_name in config['testing_datasets']:
            # override the original test_set
            # if the dataset does not provide postag, use load_testing_data()
            test_set = keyphrase_test_dataset.testing_data_loader(dataset_name, kwargs=dict(basedir=config['path'])).load_testing_data(word2idx)
            # test_set = keyphrase_test_dataset.testing_data_loader(dataset_name, kwargs=dict(basedir=config['path'])).load_testing_data_postag(word2idx)
            # test_set = test_sets[dataset_name]

            test_data_plain = list(zip(*(test_set['source_str'], test_set['target_str'], test_set['source'], test_set['target'])))
            test_size = len(test_data_plain)

            print(dataset_name)
            print('Size of test data=%d' % test_size)
            print('Avg length=%d, Max length=%d' % (np.average([len(s) for s in test_set['source']]), np.max([len(s) for s in test_set['source']])))

            # use the first 400 samples in krapivin for testing
            if dataset_name == 'krapivin':
                test_data_plain = test_data_plain[:400]
                test_size = len(test_data_plain)

            progbar_test = Progbar(test_size, logger)
            logger.info('Predicting on %s' % dataset_name)

            input_encodings = []
            output_encodings = []

            predictions = []
            scores = []
            test_s_list = []
            test_t_list = []
            test_s_o_list = []
            test_t_o_list = []

            start_idx = -1

            # start_idx = 10800
            # test_set, test_s_list, test_t_list, test_s_o_list, test_t_o_list, input_encodings, predictions, scores, output_encodings, idx2word = deserialize_from_file(config['predict_path'] + 'predict.{0}.{1}.id={2}.pkl'.format(config['predict_type'], dataset_name, start_idx))

            # Predict on testing data
            for idx in range(start_idx + 1, len(test_data_plain)): # len(test_data_plain)
                source_str, target_str, test_s_o, test_t_o = test_data_plain[idx]
                print('*'*20 + '  ' + str(idx)+ '  ' + '*'*20)
                # print(source_str)
                # print('[%d]%s' % (len(test_s_o), str(test_s_o)))
                # print(target_str)
                # print(test_t_o)
                # print('')

                if not config['multi_output']:
                    test_s, test_t = split_into_multiple_and_padding([test_s_o], [test_t_o])
                test_s = test_s[0]

                test_s_list.append(test_s)
                test_t_list.append(test_t)
                test_s_o_list.append(test_s_o)
                test_t_o_list.append(test_t_o)

                print('test_s_o=%d, test_t_o=%d, test_s=%d, test_t=%d' % (len(test_s_o), len(test_t_o), len(test_s), len(test_t)))

                inputs_unk = np.asarray(unk_filter(np.asarray(test_s, dtype='int32')), dtype='int32')
                # inputs_ = np.asarray(test_s, dtype='int32')

                # inputs_unk = theano.shared(inputs_unk)

                if config['return_encoding']:
                    input_encoding, prediction, score, output_encoding = agent.generate_multiple(inputs_unk[None, :], return_all=True, return_encoding=True)
                    input_encodings.append(input_encoding)
                    output_encodings.append(output_encoding)
                else:
                    prediction, score = agent.generate_multiple(inputs_unk[None, :], return_encoding=False)

                # print(prediction)
                for p in prediction:
                    if any([f>=50000 for f in p]):
                        print(p)

                predictions.append(prediction)
                scores.append(score)
                progbar_test.update(idx, [])

                # temporary save
                # if idx % 100 == 0:
                #     print('Saving dump to: '+ config['predict_path'] + 'predict.{0}.{1}.id={2}.pkl'.format(config['predict_type'], dataset_name, idx))
                #     serialize_to_file(
                #         [test_set, test_s_list, test_t_list, test_s_o_list, test_t_o_list, input_encodings, predictions,
                #          scores, output_encodings, idx2word],
                #         config['predict_path'] + 'predict.{0}.{1}.id={2}.pkl'.format(config['predict_type'], dataset_name, idx))

            # store predictions in file
            serialize_to_file([test_set, test_s_list, test_t_list, test_s_o_list, test_t_o_list, input_encodings, predictions, scores, output_encodings, idx2word], config['predict_path'] + 'predict.{0}.{1}.pkl'.format(config['predict_type'], dataset_name))

    '''
    Evaluate on Testing Data
    '''
    if do_evaluate:
        for dataset_name in config['testing_datasets']:
            print_test = open(config['predict_path'] + '/experiments.{0}.id={1}.testing@{2}.{3}.len={4}.beam={5}.log'.format(config['task_name'],config['timemark'],dataset_name, config['predict_type'], config['max_len'], config['sample_beam']), 'w')

            test_set, test_s_list, test_t_list, test_s_o_list, test_t_o_list, _, predictions, scores, _, idx2word = deserialize_from_file(config['predict_path']+'predict.{0}.{1}.pkl'.format(config['predict_type'], dataset_name))

            # use the first 400 samples in krapivin for testing
            if dataset_name == 'krapivin':
                new_test_set = {}
                for k,v in test_set.items():
                    new_test_set[k]  = v[:400]
                test_s_list     = test_s_list[:400] # numericalized source with padding
                test_t_list     = test_t_list[:400] # numericalized targets with padding
                test_s_o_list   = test_s_o_list[:400] # numericalized source without padding
                test_t_o_list   = test_t_o_list[:400] # numericalized targets without padding
                predictions     = predictions[:400]
                scores          = scores[:400]

                test_set = new_test_set

            print_test.write('Evaluating on %s size=%d @ epoch=%d \n' % (dataset_name, test_size, epoch))
            logger.info('Evaluating on %s size=%d @ epoch=%d \n' % (dataset_name, test_size, epoch))

            do_stem = True
            if dataset_name == 'semeval':
                do_stem = False

            # Evaluation
            outs, overall_score = keyphrase_utils.evaluate_multiple(config, test_set, test_s_list, test_t_list,
                                                        test_s_o_list, test_t_o_list,
                                                        predictions, scores, idx2word, do_stem,
                                                        model_name=config['task_name'], dataset_name=dataset_name)

            print_test.write(' '.join(outs))
            print_test.write(' '.join(['%s : %s' % (str(k), str(v)) for k,v in overall_score.items()]))
            logger.info('*' * 50)

            logger.info(overall_score)
            print_test.close()


================================================
FILE: keyphrase/keyphrase_utils.py
================================================
import math

import logging
from nltk.stem.porter import *
import numpy as np
import os
import copy

import keyphrase.dataset.keyphrase_test_dataset as test_dataset
from dataset import dataset_utils

logger = logging.getLogger(__name__)

def evaluate_multiple(config, test_set, inputs, outputs,
                      original_input, original_outputs,
                      samples, scores, idx2word, do_stem,
                      model_name, dataset_name):
    '''
    inputs_unk is same as inputs except for filtered out all the low-freq words to 1 (<unk>)
    return the top few keywords, number is set in config
    :param: original_input, same as inputs, the vector of one input sentence
    :param: original_outputs, vectors of corresponding multiple outputs (e.g. keyphrases)
    :return:
    '''

    # Generate keyphrases
    # if inputs_unk is None:
    #     samples, scores = self.generate_multiple(inputs[None, :], return_all=True)
    # else:
    #     samples, scores = self.generate_multiple(inputs_unk[None, :], return_all=True)

    stemmer = PorterStemmer()
    # Evaluation part
    outs = []
    micro_metrics = []
    micro_matches = []
    predict_scores = []

    # load stopword
    with open(config['path'] + '/dataset/stopword/stopword_en.txt') as stopword_file:
        stopword_set = set([stemmer.stem(w.strip()) for w in stopword_file])

    # postag_lists = [[s[1] for s in d] for d in test_set['tagged_source']]
    # postag_lists = [[] for d in test_set['tagged_source']]

    model_nickname = config['model_name']  # 'TfIdf', 'TextRank', 'SingleRank', 'ExpandRank', 'Maui', 'Kea', 'RNN', 'CopyRNN'
    base_dir = config['path'] + '/dataset/keyphrase/prediction/' + model_nickname + '_' + config['timemark'] + '/'
    # text_dir = config['baseline_data_path'] + dataset_name + '/text/'
    # target_dir = config['baseline_data_path'] + dataset_name + '/keyphrase/'
    prediction_dir = base_dir + dataset_name
    # doc_names = [name[:name.index('.')] for name in os.listdir(text_dir)]

    loader      = test_dataset.testing_data_loader(dataset_name, kwargs=dict(basedir=config['path']))
    docs        = loader.get_docs(return_dict=False)
    doc_names   = [d.name for d in docs if not d.name.endswith('.DS_Store')] # annoying macos


    # reload the targets from corpus directly
    # target_dir = config['baseline_data_path'] + dataset_name + '/keyphrase/'

    if 'source_postag' not in test_set or len(test_set['source_postag']) != len(doc_names):
        print('postag not found')
        test_set['source_postag'] = test_set['target_str']

    assert len(doc_names) == len(test_set['source_str']) == len(inputs) == len(test_set['target_str']) == len(samples) == len(scores) == len(test_set['source_postag'])

    # for input_sentence, target_list, predict_list, score_list in zip(inputs, original_outputs, samples, scores):
    for doc_name, source_str, input_sentence, target_list, predict_list, score_list, postag_list in zip(doc_names, test_set['source_str'], inputs, test_set['target_str'], samples, scores, test_set['source_postag']):

        '''
        enumerate each document, process target/predict/score and measure via p/r/f1
        '''
        target_outputs = []
        original_target_list = copy.copy(target_list) # no stemming
        predict_indexes = []
        original_predict_outputs = [] # no stemming
        predict_outputs = []
        predict_score = []
        predict_set = set()
        correctly_matched = np.asarray([0] * max(len(target_list), len(predict_list)), dtype='int32')
        is_copied = []

        # stem the original input, do on source_str not the index list input_sentence
        # stemmed_input = [stemmer.stem(w) for w in cut_zero(input_sentence, idx2word)]
        stemmed_input = [stemmer.stem(w) for w in source_str]

        # convert target index into string
        for target in target_list:
            # target = cut_zero(target, idx2word)
            if do_stem:
                target = [stemmer.stem(w) for w in target]
            # print(target)

            keep = True
            # whether do filtering on groundtruth phrases. if config['target_filter']==None, do nothing
            if config['target_filter']:
                match = None
                for i in range(len(stemmed_input) - len(target) + 1):
                    match = None
                    for j in range(len(target)):
                        if target[j] != stemmed_input[i + j]:
                            match = False
                            break
                    if j == len(target) - 1 and match == None:
                        match = True
                        break

                if match == True:
                    # if match and 'appear-only', keep this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and True
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and False
                    else:
                        keep = keep and True

                elif match == False:
                    # if not match and 'appear-only', discard this phrase
                    if config['target_filter'] == 'appear-only':
                        keep = keep and False
                    # if not match and 'non-appear-only', keep this phrase
                    elif config['target_filter'] == 'non-appear-only':
                        keep = keep and True
                    else:
                        keep = keep and True

            if not keep:
                continue

            target_outputs.append(target)

        # check if prediction is noun-phrase, initialize a filter. Be sure this should be after stemming
        if config['noun_phrase_only']:
            stemmed_source = [stemmer.stem(w) for w in source_str]
            noun_phrases = dataset_utils.get_none_phrases(stemmed_source, postag_list, config['max_len'])
            noun_phrase_set = set([' '.join(p[0]) for p in noun_phrases])

        def cut_zero(sample_index, idx2word, source_str):
            sample_index = list(sample_index)
            # if 0 not in sample:
            #     return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample]
            # # return the string before 0 (<eol>)
            # return ['{}'.format(idx2word[w].encode('utf-8')) for w in sample[:sample.index(0)]]

            if 0 in sample_index:
                sample_index = sample_index[:sample_index.index(0)]

            wordlist = []
            find_copy = False
            for w_index in sample_index:
                if w_index >= config['voc_size']:
                    wordlist.append(source_str[w_index-config['voc_size']].encode('utf-8'))
                    find_copy = True
                else:
                    wordlist.append(idx2word[w_index].encode('utf-8'))
            if find_copy:
                logger.info('Find copy! - %s - %s' % (' '.join(wordlist), str(sample_index)))
            return sample_index, wordlist

        single_word_maximum = 1
        # convert predict index into string
        for id, (predict, score) in enumerate(zip(predict_list, score_list)):
            predict_index, original_predict = cut_zero(predict, idx2word, source_str)
            predict = [stemmer.stem(w) for w in original_predict]

            # filter some not good ones
            keep = True
            if len(predict) == 0:
                keep = False
            number_digit = 0
            for w in predict:
                w = w.strip()
                if w == '<unk>' or w == '<eos>':
                    keep = False
                if re.match(r'[_,\(\)\.\'%]', w):
                    keep = False
                    # print('\t\tPunctuations! - %s' % str(predict))
                """
                if w == '<digit>':
                    number_digit += 1
                """

            """
            if len(predict) >= 1 and (predict[0] in stopword_set or predict[-1] in stopword_set):
                keep = False
            """

            # filter out single-word predictions
            if len(predict) <= 1:
                if single_word_maximum > 0:
                    single_word_maximum -= 1
                else:
                    keep = False

            # whether do filtering on predicted phrases. if config['predict_filter']==None, do nothing
            if config['predict_filter']:
                match = None
                for i in range(len(stemmed_input) - len(predict) + 1):
                    match = None
                    for j in range(len(predict)):
                        if predict[j] != stemmed_input[i + j]:
                            match = False
                            break
                    if j == len(predict) - 1 and match == None:
                        match = True
                        break

                if match == True:
                    # if match and 'appear-only', keep this phrase
                    if config['predict_filter'] == 'appear-only':
                        keep = keep and True
                    elif config['predict_filter'] == 'non-appear-only':
                        keep = keep and False
                elif match == False:
                    # if not match and 'appear-only', discard this phrase
                    if config['predict_filter'] == 'appear-only':
                        keep = keep and False
                    # if not match and 'non-appear-only', keep this phrase
                    elif config['predict_filter'] == 'non-appear-only':
                        keep = keep and True

            """
            # if all are <digit>, discard
            if number_digit == len(predict):
                keep = False
            """

            # remove duplicates (some predictions are duplicate after stemming)
            key = '-'.join(predict)

            if key in predict_set:
                keep = False

            """
            # if #(word) == #(letter), it predicts like this: h a s k e l
            if sum([len(w) for w in predict])==len(predict) and len(predict) > 2:
                keep = False
                # print('\t\tall letters! - %s' % str(predict))
            # check if prediction is noun-phrase
            if config['noun_phrase_only']:
                if ' '.join(predict) not in noun_phrase_set:
                    print('Not a NP: %s' % (' '.join(predict)))
                    keep = False
            """

            # discard invalid ones
            if not keep:
                continue

            if any(i_>config['voc_size'] for i_ in predict_index):
                is_copied.append(1)
            else:
                is_copied.append(0)

            original_predict_outputs.append(original_predict)
            predict_indexes.append(predict_index)
            predict_outputs.append(predict)
            predict_score.append(score)
            predict_set.add(key)

        # whether keep the longest phrases only, as there're too many phrases are part of other longer phrases
        if config['keep_longest']:
            match_phrase_index = []

            for ii, p_ii in enumerate(predict_outputs): # shorter one
                match_times = 0
                for jj, p_jj in enumerate(predict_outputs): # longer one
                    if ii==jj or len(p_ii)>=len(p_jj): # p_jj must be longer than p_ii
                        continue

                    match = None
                    for start in range(len(p_jj) - len(p_ii) + 1): # iterate the start of long phrase
                        match = None
                        for w_index in range(len(p_ii)): # iterate the short phrase
                            if (p_ii[w_index]!=p_jj[start+w_index]):
                                match = False
                                break
                        if w_index == len(p_ii) - 1 and match == None:
                            match = True
                            match_times += 1
                if match_times == 1: # p_ii is part of p_jj, discard
                    match_phrase_index.append(ii)
                    # print("Matched pair: %s \t - \t %s" % (str(p_ii), str(p_jj)))
                    # pass

            original_predict_outputs = np.delete(original_predict_outputs, match_phrase_index)
            predict_indexes = np.delete(predict_indexes, match_phrase_index)
            predict_outputs = np.delete(predict_outputs, match_phrase_index)
            predict_score  = np.delete(predict_score, match_phrase_index)
            is_copied  = np.delete(is_copied, match_phrase_index)

        # check whether the predicted phrase is correct (match any groundtruth)
        for p_id, predict in enumerate(predict_outputs):
            for target in target_outputs:
                if len(target) == len(predict):
                    flag = True
                    for i, w in enumerate(predict):
                        if predict[i] != target[i]:
                            flag = False
                    if flag:
                        correctly_matched[p_id] = 1
                        # print('%s correct!!!' % predict)

        original_predict_outputs = np.asarray(original_predict_outputs)
        predict_indexes = np.asarray(predict_indexes)
        predict_outputs = np.asarray(predict_outputs)
        predict_score = np.asarray(predict_score)
        is_copied = np.asarray(is_copied)
        # normalize the score?
        if config['normalize_score']:
            predict_score = np.asarray(
                [math.log(math.exp(score) / len(predict)) for predict, score in zip(predict_outputs, predict_score)])
            score_list_index = np.argsort(predict_score)
            original_predict_outputs = original_predict_outputs[score_list_index]
            predict_indexes = predict_indexes[score_list_index]
            predict_outputs = predict_outputs[score_list_index]
            predict_score = predict_score[score_list_index]
            correctly_matched = correctly_matched[score_list_index]
            is_copied = is_copied[score_list_index]


        metric_dict = {}

        metric_dict['appear_target_number'] = len(target_outputs)
        metric_dict['target_number'] = len(target_list)
        '''
        Compute micro metrics
        '''
        for number_to_predict in [5, 10, 15, 20, 30, 40, 50]: #5, 10, 15, 20, 30, 40, 50
            metric_dict['correct_number@%d' % number_to_predict] = sum(correctly_matched[:number_to_predict])

            metric_dict['p@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict])) / float(
                number_to_predict)

            if len(target_outputs) != 0:
                metric_dict['r@%d' % number_to_predict] = float(sum(correctly_matched[:number_to_predict])) / float(
                    len(target_outputs))
            else:
                metric_dict['r@%d' % number_to_predict] = 0

            if metric_dict['p@%d' % number_to_predict] + metric_dict['r@%d' % number_to_predict] != 0:
                metric_dict['f1@%d' % number_to_predict] = 2 * metric_dict['p@%d' % number_to_predict] * metric_dict[
                    'r@%d' % number_to_predict] / float(
                    metric_dict['p@%d' % number_to_predict] + metric_dict['r@%d' % number_to_predict])
            else:
                metric_dict['f1@%d' % number_to_predict] = 0

            # Compute the binary preference measure (Bpref)
            bpref = 0.
            trunked_match = correctly_matched[:number_to_predict].tolist()  # get the first K prediction to evaluate
            match_indexes = np.nonzero(trunked_match)[0]

            if len(match_indexes) > 0:
                for mid, mindex in enumerate(match_indexes):
                    bpref += 1. - float(mindex - mid) / float(number_to_predict)  # there're mindex elements, and mid elements are correct, before the (mindex+1)-th element
                metric_dict['bpref@%d' % number_to_predict] = float(bpref)/float(len(match_indexes))
            else:
                metric_dict['bpref@%d' % number_to_predict] = 0

            # Compute the mean reciprocal rank (MRR)
            rank_first = 0
            try:
                rank_first = trunked_match.index(1) + 1
            except ValueError:
                pass

            if rank_first > 0:
                metric_dict['mrr@%d' % number_to_predict] = float(1)/float(rank_first)
            else:
                metric_dict['mrr@%d' % number_to_predict] = 0

        micro_metrics.append(metric_dict)
        micro_matches.append(correctly_matched)
        predict_scores.append(predict_score)

        '''
        Output keyphrases to prediction folder
        '''
        if not os.path.exists(prediction_dir):
            os.makedirs(prediction_dir)

        with open(prediction_dir + '/' + doc_name + '.txt.phrases', 'w') as output_file:
            output_file.write('\n'.join([' '.join(o_) for o_ in original_predict_outputs]))

        '''
        Print information on each prediction
        '''
        # print stuff
        a = '[SOURCE][{0}]: {1}'.format(len(input_sentence) ,' '.join(source_str))
        logger.info(a)
        a += '\n'

        b = '[GROUND-TRUTH]: %d/%d ground-truth phrases\n\t\t' % (len(target_outputs), len(target_list))
        target_output_set = set(['_'.join(t) for t in target_outputs])
        for id, target in enumerate(original_target_list):
            if '_'.join([stemmer.stem(w) for w in target]) in target_output_set:
                b += '['+' '.join(target) + ']; '
            else:
                b += ' '.join(target) + '; '
        logger.info(b)
        b += '\n'
        c = '[PREDICTION]: %d/%d predictions\n' % (len(predict_outputs), len(predict_list))
        c += '[Correct@10] = %d\n' % metric_dict['correct_number@10']
        c += '[Correct@50] = %d\n' % metric_dict['correct_number@50']
        for id, (predict, score, predict_index) in enumerate(zip(original_predict_outputs, predict_score, predict_indexes)):
            c += ('\n\t\t[%.3f][%d][%d]' % (score, len(predict), sum([len(w) for w in predict]))) + ' '.join(predict)
            if correctly_matched[id] == 1:
                c += ' [correct!]'
            if is_copied[id] == 1:
                c += '[copied!] %s'%str(predict_index)
                # print(('\n\t\t[%.3f]'% score) + ' '.join(predict) + ' [correct!]')
                # print(('\n\t\t[%.3f]'% score) + ' '.join(predict))
        c += '\n'

        # c = '[DECODE]: {}'.format(' '.join(cut_zero(phrase, idx2word)))
        # if inputs_unk is not None:
        #     k = '[_INPUT]: {}\n'.format(' '.join(cut_zero(inputs_unk.tolist(),  idx2word, Lmax=len(idx2word))))
        #     logger.info(k)
        # a += k
        logger.info(c)
        a += b + c

        for number_to_predict in [5, 10, 15, 20, 30, 40, 50]:
            d = '@%d - Precision=%.4f, Recall=%.4f, F1=%.4f, Bpref=%.4f, MRR=%.4f' % (
            number_to_predict, metric_dict['p@%d' % number_to_predict], metric_dict['r@%d' % number_to_predict],
            metric_dict['f1@%d' % number_to_predict], metric_dict['bpref@%d' % number_to_predict], metric_dict['mrr@%d' % number_to_predict])
            logger.info(d)
            a += d + '\n'

        logger.info('*' * 100)
        outs.append(a)
        outs.append('*' * 100 + '\n')

    # we could omit the bad data which contains 0 predictions. But for consistency we use all for evaluation
    """
    if config['target_filter'] == 'appear-only':
        real_test_size = sum([1 if m['appear_target_number'] > 0 else 0 for m in micro_metrics])
    elif config['target_filter'] == 'non-appear-only':
        real_test_size = sum([1 if m['appear_target_number'] > 0 else 0 for m in micro_metrics])
    else:
        real_test_size = len(inputs)
    """
    real_test_size = len(inputs)
    logger.info("real_test_size = %d" % real_test_size)

    '''
    Compute the corpus evaluation
    '''
    logger.info('Experiment result: %s' % (config['predict_path'] + '/' + model_name+'-'+dataset_name+'.txt'))
    csv_writer = open(config['predict_path'] + '/' + model_name+'-'+dataset_name+'.txt', 'w')
    overall_score = {}
    for k in [5, 10, 15, 20, 30, 40, 50]:
        correct_number = sum([m['correct_number@%d' % k] for m in micro_metrics])
        appear_target_number = sum([m['appear_target_number'] for m in micro_metrics])
        target_number = sum([m['target_number'] for m in micro_metrics])

        # Compute the Micro Measures, by averaging the micro-score of each prediction
        overall_score['p@%d' % k] = float(sum([m['p@%d' % k] for m in micro_metrics])) / float(real_test_size)
        overall_score['r@%d' % k] = float(sum([m['r@%d' % k] for m in micro_metrics])) / float(real_test_size)
        overall_score['f1@%d' % k] = float(sum([m['f1@%d' % k] for m in micro_metrics])) / float(real_test_size)

        output_str = 'Overall - %s valid testing data=%d, Number of Target=%d/%d, Number of Prediction=%d, Number of Correct=%d' % (
                    config['predict_type'], real_test_size,
                    appear_target_number, target_number,
                    real_test_size * k, correct_number
        )
        outs.append(output_str+'\n')
        logger.info(output_str)
        output_str = 'Micro:\t\tP@%d=%f, R@%d=%f, F1@%d=%f' % (
                    k, overall_score['p@%d' % k],
                    k, overall_score['r@%d' % k],
                    k, overall_score['f1@%d' % k]
        )
        outs.append(output_str+'\n')
        logger.info(output_str)
        csv_writer.write('Micro@%d, %f, %f, %f\n' % (
                    k,
                    overall_score['p@%d' % k],
                    overall_score['r@%d' % k],
                    overall_score['f1@%d' % k]
        ))

        # Compute the Macro Measures
        overall_score['macro_p@%d' % k] = correct_number / float(real_test_size * k)
        overall_score['macro_r@%d' % k] = correct_number / float(appear_target_number)
        if overall_score['macro_p@%d' % k] + overall_score['macro_r@%d' % k] > 0:
            overall_score['macro_f1@%d' % k] = 2 * overall_score['macro_p@%d' % k] * overall_score[
                'macro_r@%d' % k] / float(overall_score['macro_p@%d' % k] + overall_score['macro_r@%d' % k])
        else:
            overall_score['macro_f1@%d' % k] = 0

        output_str = 'Macro:\t\tP@%d=%f, R@%d=%f, F1@%d=%f' % (
                    k, overall_score['macro_p@%d' % k],
                    k, overall_score['macro_r@%d' % k],
                    k, overall_score['macro_f1@%d' % k]
        )
        outs.append(output_str+'\n')
        logger.info(output_str)
        csv_writer.write('Macro@%d, %f, %f, %f\n' % (
                    k,
                    overall_score['macro_p@%d' % k],
                    overall_score['macro_r@%d' % k],
                    overall_score['macro_f1@%d' % k]
        ))

        # Compute the binary preference measure (Bpref)
        overall_score['bpref@%d' % k] = float(sum([m['bpref@%d' % k] for m in micro_metrics])) / float(real_test_size)

        # Compute the mean reciprocal rank (MRR)
        overall_score['mrr@%d' % k] = float(sum([m['mrr@%d' % k] for m in micro_metrics])) / float(real_test_size)

        output_str = '\t\t\tBpref@%d=%f, MRR@%d=%f' % (
                    k, overall_score['bpref@%d' % k],
                    k, overall_score['mrr@%d' % k]
        )
        outs.append(output_str+'\n')
        logger.info(output_str)

    # evaluate the score cutoff

    for cutoff in range(15):
        overall_predicted_number    = 0
        overall_correct_number      = 0
        overall_target_number       = sum([m['target_number'] for m in micro_metrics])

        for score_list, metric_dict, correctly_matched in zip(predict_scores, micro_metrics, micro_matches):
            predicted_number            = len(filter(lambda s:s < cutoff, score_list))
            overall_predicted_number    += predicted_number
            overall_correct_number      += sum(correctly_matched[:predicted_number])

        if overall_predicted_number > 0:
            macro_p = float(overall_correct_number) / float(overall_predicted_number)
        else:
            macro_p = 0
        macro_r = float(overall_correct_number) / float(overall_target_number)

        if macro_p + macro_r > 0:
            macro_f1 = 2. * macro_p * macro_r / (macro_p + macro_r)
        else:
            macro_f1 = 0

        logger.info('Macro,cutoff@%d, correct_number=%d, predicted_number=%d, target_number=%d, p=%f, r=%f, f1=%f' % (
                    cutoff,
                    overall_correct_number, overall_predicted_number, overall_target_number,
                    macro_p, macro_r, macro_f1
        ))
        csv_writer.write('Macro,cutoff@%d, %f, %f, %f\n' % (
                    cutoff, macro_p, macro_r, macro_f1
        ))


    csv_writer.close()

    return outs, overall_score


def export_keyphrase(predictions, text_dir, prediction_dir):
    doc_names = [name[:name.index('.')] for name in os.listdir(text_dir)]
    for name_, prediction_ in zip(doc_names, predictions):
        with open(prediction_dir+name_+'.phrases') as output_file:
            output_file.write('\n'.join(prediction_))


================================================
FILE: keyphrase/util/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: keyphrase/util/gpu-test.py
================================================
import numpy
import time
import os
os.environ['PATH'] = "/usr/local/cuda-8.0/bin:/usr/local/cuda-8.0/lib64:" + os.environ['PATH']
os.environ['THEANO_FLAGS'] = 'device=gpu'

from theano import function, config, shared, tensor
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')


================================================
FILE: keyphrase/util/stanford-pos-tagger.py
================================================
# -*- coding: utf-8 -*-
# Natural Language Toolkit: Interface to the Stanford Part-of-speech and Named-Entity Taggers
#
# Copyright (C) 2001-2016 NLTK Project
# Author: Nitin Madnani <nmadnani@ets.org>
#         Rami Al-Rfou' <ralrfou@cs.stonybrook.edu>
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT

"""
A module for interfacing with the Stanford taggers.

Tagger models need to be downloaded from http://nlp.stanford.edu/software
and the STANFORD_MODELS environment variable set (a colon-separated
list of paths).

For more details see the documentation for StanfordPOSTagger and StanfordNERTagger.
"""

import os
import tempfile
from subprocess import PIPE
import warnings

from nltk.internals import find_file, find_jar, config_java, java, _java_options, find_jars_within_path
from nltk.tag.api import TaggerI
from nltk import compat

_stanford_url = 'http://nlp.stanford.edu/software'


class StanfordTagger(TaggerI):
    """
    An interface to Stanford taggers. Subclasses must define:

    - ``_cmd`` property: A property that returns the command that will be
      executed.
    - ``_SEPARATOR``: Class constant that represents that character that
      is used to separate the tokens from their tags.
    - ``_JAR`` file: Class constant that represents the jar file name.
    """

    _SEPARATOR = ''
    _JAR = ''

    def __init__(self, model_filename, path_to_jar=None, encoding='utf8', verbose=False, java_options='-mx1000m'):
        if not self._JAR:
            warnings.warn('The StanfordTagger class is not meant to be '
                          'instantiated directly. Did you mean StanfordPOSTagger or StanfordNERTagger?')
        self._stanford_jar = find_jar(
            self._JAR, path_to_jar,
            searchpath=(), url=_stanford_url,
            verbose=verbose)

        self._stanford_model = find_file(model_filename,
                                         env_vars=('STANFORD_MODELS',), verbose=verbose)

        # Adding logging jar files to classpath
        stanford_dir = os.path.split(self._stanford_jar)[0]
        self._stanford_jar = tuple(find_jars_within_path(stanford_dir))

        self._encoding = encoding
        self.java_options = java_options

    @property
    def _cmd(self):
        raise NotImplementedError


def tag(self, tokens):
    # This function should return list of tuple rather than list of list
    return sum(self.tag_sents([tokens]), [])


def tag_sents(self, sentences):
    encoding = self._encoding
    default_options = ' '.join(_java_options)
    config_java(options=self.java_options, verbose=False)

    # Create a temporary input file
    _input_fh, self._input_file_path = tempfile.mkstemp(text=True)

    cmd = list(self._cmd)
    cmd.extend(['-encoding', encoding])

    # Write the actual sentences to the temporary input file
    _input_fh = os.fdopen(_input_fh, 'wb')
    _input = '\n'.join((' '.join(x) for x in sentences))
    if isinstance(_input, compat.text_type) and encoding:
        _input = _input.encode(encoding)
    _input_fh.write(_input)
    _input_fh.close()

    # Run the tagger and get the output
    stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
                                   stdout=PIPE, stderr=PIPE)
    stanpos_output = stanpos_output.decode(encoding)

    # Delete the temporary file
    os.unlink(self._input_file_path)

    # Return java configurations to their default values
    config_java(options=default_options, verbose=False)

    return self.parse_output(stanpos_output, sentences)


def parse_output(self, text, sentences=None):
    # Output the tagged sentences
    tagged_sentences = []
    for tagged_sentence in text.strip().split("\n"):
        sentence = []
        for tagged_word in tagged_sentence.strip().split():
            word_tags = tagged_word.strip().split(self._SEPARATOR)
            sentence.append((''.join(word_tags[:-1]), word_tags[-1]))
        tagged_sentences.append(sentence)
    return tagged_sentences

class StanfordNERTagger(StanfordTagger):
    """
    A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:

    - a model trained on training data
    - (optionally) the path to the stanford tagger jar file. If not specified here,
      then this jar file must be specified in the CLASSPATH envinroment variable.
    - (optionally) the encoding of the training data (default: UTF-8)

    Example:

        >>> from nltk.tag import StanfordNERTagger
        >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP
        >>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
         ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
         ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
    """

    _SEPARATOR = '/'
    _JAR = 'stanford-ner.jar'
    _FORMAT = 'slashTags'

    def __init__(self, *args, **kwargs):
        super(StanfordNERTagger, self).__init__(*args, **kwargs)