Full Code of memray/seq2seq-keyphrase for AI

master e8660727a4f1 cached

51 files

693.0 KB

170.8k tokens

584 symbols

1 requests

Download .txt

Showing preview only (719K chars total). Download the full file or copy to clipboard to get everything.

Repository: memray/seq2seq-keyphrase
Branch: master
Commit: e8660727a4f1
Files: 51
Total size: 693.0 KB

Directory structure:
gitextract_qitx157j/

├── .gitignore
├── LICENSE
├── README.md
├── emolga/
│   ├── __init__.py
│   ├── basic/
│   │   ├── __init__.py
│   │   ├── activations.py
│   │   ├── initializations.py
│   │   ├── objectives.py
│   │   └── optimizers.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   └── build_dataset.py
│   ├── layers/
│   │   ├── __init__.py
│   │   ├── attention.py
│   │   ├── core.py
│   │   ├── embeddings.py
│   │   ├── gridlstm.py
│   │   ├── ntm_minibatch.py
│   │   └── recurrent.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── core.py
│   │   ├── covc_encdec.py
│   │   ├── encdec.py
│   │   ├── ntm_encdec.py
│   │   ├── pointers.py
│   │   └── variational.py
│   └── utils/
│       ├── __init__.py
│       ├── generic_utils.py
│       ├── io_utils.py
│       ├── np_utils.py
│       ├── test_utils.py
│       └── theano_utils.py
└── keyphrase/
    ├── __init__.py
    ├── baseline/
    │   ├── evaluate.py
    │   └── export_dataset.py
    ├── config.py
    ├── dataset/
    │   ├── __init__.py
    │   ├── dataset_utils.py
    │   ├── inspec/
    │   │   ├── __init__.py
    │   │   ├── inspec_export_json.py
    │   │   └── key_convert_maui.py
    │   ├── json_count.py
    │   ├── keyphrase_dataset.py
    │   ├── keyphrase_test_dataset.py
    │   ├── keyphrase_train_dataset.py
    │   └── million-paper/
    │       ├── clean_export_json.py
    │       └── preprocess.py
    ├── keyphrase_copynet.py
    ├── keyphrase_utils.py
    └── util/
        ├── __init__.py
        ├── gpu-test.py
        └── stanford-pos-tagger.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# added by memray
/.idea/
/Experiment/
/dataset/
/stanford-postagger/
.DS_Store

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover

# Translations
*.mo
*.pot

# Django stuff:
*.log

# Sphinx documentation
docs/_build/

# PyBuilder
target/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2016 Rui Meng

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# seq2seq-keyphrase
### Note: this repository has been deprecated. Please move to our latest code/data/model release for keyphrase generation at [https://github.com/memray/OpenNMT-kpg-release](https://github.com/memray/OpenNMT-kpg-release). Thank you.

Data
==========
Check out all datasets at [https://huggingface.co/memray/](https://huggingface.co/memray/).


Introduction
==========
This is an implementation of [Deep Keyphrase Generation](http://memray.me/uploads/acl17-keyphrase-generation.pdf) based on [CopyNet](https://github.com/MultiPath/CopyNet).

One training dataset (**KP20k**), five testing datasets (**KP20k, Inspec, NUS, SemEval, Krapivin**) and one pre-trained model are provided. 

Note that the model is trained on scientific papers (abstract and keyword) in Computer Science domain, so it's expected to work well only for CS papers.


Cite
==========
If you use the code or datasets, please cite the following paper:

> Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky and Yu Chi. Deep Keyphrase Generation. 55th Annual Meeting of Association for Computational Linguistics, 2017. [[PDF]](http://memray.me/uploads/acl17-keyphrase-generation.pdf) [[arXiv]](https://arxiv.org/abs/1704.06879)

```
@InProceedings{meng-EtAl:2017:Long,
  author    = {Meng, Rui  and  Zhao, Sanqiang  and  Han, Shuguang  and  He, Daqing  and  Brusilovsky, Peter  and  Chi, Yu},
  title     = {Deep Keyphrase Generation},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {582--592},
  url       = {http://aclweb.org/anthology/P17-1054}
}
```


================================================
FILE: emolga/__init__.py
================================================
__author__ = 'yinpengcheng'


================================================
FILE: emolga/basic/__init__.py
================================================
__author__ = 'jiataogu'


================================================
FILE: emolga/basic/activations.py
================================================
import theano.tensor as T


def softmax(x):
    return T.nnet.softmax(x.reshape((-1, x.shape[-1]))).reshape(x.shape)


def vector_softmax(x):
    return T.nnet.softmax(x.reshape((1, x.shape[0])))[0]


def time_distributed_softmax(x):
    import warnings
    warnings.warn("time_distributed_softmax is deprecated. Just use softmax!", DeprecationWarning)
    return softmax(x)


def softplus(x):
    return T.nnet.softplus(x)


def relu(x):
    return T.nnet.relu(x)


def tanh(x):
    return T.tanh(x)


def sigmoid(x):
    return T.nnet.sigmoid(x)


def hard_sigmoid(x):
    return T.nnet.hard_sigmoid(x)


def linear(x):
    '''
    The function returns the variable that is passed in, so all types work
    '''
    return x


def maxout2(x):
    shape = x.shape
    if x.ndim == 1:
        shape1 = T.cast(shape[0] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape1, shape2])
        x = x.max(1)
    elif x.ndim == 2:
        shape1 = T.cast(shape[1] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape[0], shape1, shape2])
        x = x.max(2)
    elif x.ndim == 3:
        shape1 = T.cast(shape[2] / 2, 'int32')
        shape2 = T.cast(2, 'int32')
        x = x.reshape([shape[0], shape[1], shape1, shape2])
        x = x.max(3)
    return x


from emolga.utils.generic_utils import get_from_module


def get(identifier):
    return get_from_module(identifier, globals(), 'activation function')


================================================
FILE: emolga/basic/initializations.py
================================================
import theano
import theano.tensor as T
import numpy as np

from emolga.utils.theano_utils import sharedX, shared_zeros, shared_ones


def get_fans(shape):
    if isinstance(shape, int):
        shape = (1, shape)
    fan_in = shape[0] if len(shape) == 2 else np.prod(shape[1:])
    fan_out = shape[1] if len(shape) == 2 else shape[0]
    return fan_in, fan_out


def uniform(shape, scale=0.1):
    return sharedX(np.random.uniform(low=-scale, high=scale, size=shape))


def normal(shape, scale=0.05):
    return sharedX(np.random.randn(*shape) * scale)


def lecun_uniform(shape):
    ''' Reference: LeCun 98, Efficient Backprop
        http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
    '''
    fan_in, fan_out = get_fans(shape)
    scale = np.sqrt(3. / fan_in)
    return uniform(shape, scale)


def glorot_normal(shape):
    ''' Reference: Glorot & Bengio, AISTATS 2010
    '''
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(2. / (fan_in + fan_out))
    return normal(shape, s)


def glorot_uniform(shape):
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(6. / (fan_in + fan_out))
    return uniform(shape, s)


def he_normal(shape):
    ''' Reference:  He et al., http://arxiv.org/abs/1502.01852
    '''
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(2. / fan_in)
    return normal(shape, s)


def he_uniform(shape):
    fan_in, fan_out = get_fans(shape)
    s = np.sqrt(6. / fan_in)
    return uniform(shape, s)


def orthogonal(shape, scale=1.1):
    ''' From Lasagne
    '''
    flat_shape = (shape[0], np.prod(shape[1:]))
    a = np.random.normal(0.0, 1.0, flat_shape)
    u, _, v = np.linalg.svd(a, full_matrices=False)
    # pick the one with the correct shape
    q = u if u.shape == flat_shape else v
    q = q.reshape(shape)
    return sharedX(scale * q[:shape[0], :shape[1]])


def identity(shape, scale=1):
    if len(shape) != 2 or shape[0] != shape[1]:
        raise Exception("Identity matrix initialization can only be used for 2D square matrices")
    else:
        return sharedX(scale * np.identity(shape[0]))


def zero(shape):
    return shared_zeros(shape)


def one(shape):
    return shared_ones(shape)

from emolga.utils.generic_utils import get_from_module
def get(identifier):
    return get_from_module(identifier, globals(), 'initialization')


================================================
FILE: emolga/basic/objectives.py
================================================
from __future__ import absolute_import
import theano
import theano.tensor as T
import numpy as np
from six.moves import range

if theano.config.floatX == 'float64':
    epsilon = 1.0e-9
else:
    epsilon = 1.0e-7


def mean_squared_error(y_true, y_pred):
    return T.sqr(y_pred - y_true).mean(axis=-1)


def mean_absolute_error(y_true, y_pred):
    return T.abs_(y_pred - y_true).mean(axis=-1)


def mean_absolute_percentage_error(y_true, y_pred):
    return T.abs_((y_true - y_pred) / T.clip(T.abs_(y_true), epsilon, np.inf)).mean(axis=-1) * 100.


def mean_squared_logarithmic_error(y_true, y_pred):
    return T.sqr(T.log(T.clip(y_pred, epsilon, np.inf) + 1.) - T.log(T.clip(y_true, epsilon, np.inf) + 1.)).mean(axis=-1)


def squared_hinge(y_true, y_pred):
    return T.sqr(T.maximum(1. - y_true * y_pred, 0.)).mean(axis=-1)


def hinge(y_true, y_pred):
    return T.maximum(1. - y_true * y_pred, 0.).mean(axis=-1)


def categorical_crossentropy(y_true, y_pred):
    '''Expects a binary class matrix instead of a vector of scalar classes
    '''
    y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
    # scale preds so that the class probas of each sample sum to 1
    y_pred /= y_pred.sum(axis=-1, keepdims=True)
    cce = T.nnet.categorical_crossentropy(y_pred, y_true)
    return cce


def binary_crossentropy(y_true, y_pred):
    y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
    bce = T.nnet.binary_crossentropy(y_pred, y_true).mean(axis=-1)
    return bce


def poisson_loss(y_true, y_pred):
    return T.mean(y_pred - y_true * T.log(y_pred + epsilon), axis=-1)

####################################################
# Variational Auto-encoder

def gaussian_kl_divergence(mean, ln_var):
    """Computes the KL-divergence of Gaussian variables from the standard one.

    Given two variable ``mean`` representing :math:`\\mu` and ``ln_var``
    representing :math:`\\log(\\sigma^2)`, this function returns a variable
    representing the KL-divergence between the given multi-dimensional Gaussian
    :math:`N(\\mu, S)` and the standard Gaussian :math:`N(0, I)`

    .. math::

       D_{\\mathbf{KL}}(N(\\mu, S) \\| N(0, I)),

    where :math:`S` is a diagonal matrix such that :math:`S_{ii} = \\sigma_i^2`
    and :math:`I` is an identity matrix.

    Args:
        mean (~chainer.Variable): A variable representing mean of given
            gaussian distribution, :math:`\\mu`.
        ln_var (~chainer.Variable): A variable representing logarithm of
            variance of given gaussian distribution, :math:`\\log(\\sigma^2)`.

    Returns:
        ~chainer.Variable: A variable representing KL-divergence between
            given gaussian distribution and the standard gaussian.

    """
    var = T.exp(ln_var)
    return  0.5 * T.sum(mean * mean + var - ln_var - 1, 1)


# aliases
mse = MSE = mean_squared_error
mae = MAE = mean_absolute_error
mape = MAPE = mean_absolute_percentage_error
msle = MSLE = mean_squared_logarithmic_error
gkl = GKL = gaussian_kl_divergence

from emolga.utils.generic_utils import get_from_module
def get(identifier):
    return get_from_module(identifier, globals(), 'objective')


================================================
FILE: emolga/basic/optimizers.py
================================================
from __future__ import absolute_import
import theano
import sys

from theano.sandbox.rng_mrg import MRG_RandomStreams
import theano.tensor as T
import logging

from emolga.utils.theano_utils import shared_zeros, shared_scalar, floatX
from emolga.utils.generic_utils import get_from_module
from six.moves import zip
from copy import copy, deepcopy

logger = logging.getLogger(__name__)


def clip_norm(g, c, n):
    if c > 0:
        g = T.switch(T.ge(n, c), g * c / n, g)
    return g


def kl_divergence(p, p_hat):
    return p_hat - p + p * T.log(p / p_hat)


class Optimizer(object):
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)
        self.updates   = []
        self.save_parm = []

    def add(self, v):
        self.save_parm += [v]

    def get_state(self):
        return [u[0].get_value() for u in self.updates]

    def set_state(self, value_list):
        assert len(self.updates) == len(value_list)
        for u, v in zip(self.updates, value_list):
            u[0].set_value(floatX(v))

    def get_updates(self, params, loss):
        raise NotImplementedError

    def get_gradients(self, loss, params):
        """
        Consider the situation that gradient is weighted.
        """
        if isinstance(loss, list):
            grads = T.grad(loss[0], params, consider_constant=loss[1:])  # gradient of loss
        else:
            grads = T.grad(loss, params)

        if hasattr(self, 'clipnorm') and self.clipnorm > 0:
            print('use gradient clipping!!')
            print('clipnorm = %f' % self.clipnorm)
            norm = T.sqrt(sum([T.sum(g ** 2) for g in grads]))
            grads = [clip_norm(g, self.clipnorm, norm) for g in grads]
        else:
            print('not use gradient clipping!!')

        return grads

    def get_config(self):
        return {"name": self.__class__.__name__}


class SGD(Optimizer):

    def __init__(self, lr=0.05, momentum=0.9, decay=0.01, nesterov=True, *args, **kwargs):
        super(SGD, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.iterations = shared_scalar(0)
        self.lr = shared_scalar(lr)
        self.momentum = shared_scalar(momentum)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        lr = self.lr * (1.0 / (1.0 + self.decay * self.iterations))
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g in zip(params, grads):
            m = shared_zeros(p.get_value().shape)  # momentum
            v = self.momentum * m - lr * g  # velocity
            self.updates.append((m, v))

            if self.nesterov:
                new_p = p + self.momentum * v - lr * g
            else:
                new_p = p + v

            self.updates.append((p, new_p))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "momentum": float(self.momentum.get_value()),
                "decay": float(self.decay.get_value()),
                "nesterov": self.nesterov}


class RMSprop(Optimizer):
    def __init__(self, lr=0.001, rho=0.9, epsilon=1e-6, *args, **kwargs):
        super(RMSprop, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)
        self.rho = shared_scalar(rho)
        self.iterations = shared_scalar(0)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g, a in zip(params, grads, accumulators):
            new_a = self.rho * a + (1 - self.rho) * g ** 2  # update accumulator
            self.updates.append((a, new_a))

            new_p = p - self.lr * g / T.sqrt(new_a + self.epsilon)
            self.updates.append((p, new_p))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "rho": float(self.rho.get_value()),
                "epsilon": self.epsilon}


class Adagrad(Optimizer):
    def __init__(self, lr=0.01, epsilon=1e-6, *args, **kwargs):
        super(Adagrad, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)

    def get_updates(self, params, constraints, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        self.updates = []

        for p, g, a, c in zip(params, grads, accumulators, constraints):
            new_a = a + g ** 2  # update accumulator
            self.updates.append((a, new_a))
            new_p = p - self.lr * g / T.sqrt(new_a + self.epsilon)
            self.updates.append((p, c(new_p)))  # apply constraints
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "epsilon": self.epsilon}


class Adadelta(Optimizer):
    '''
        Reference: http://arxiv.org/abs/1212.5701
    '''
    def __init__(self, lr=0.1, rho=0.95, epsilon=1e-6, *args, **kwargs):
        super(Adadelta, self).__init__(**kwargs)
        self.__dict__.update(locals())
        self.lr = shared_scalar(lr)
        self.iterations = shared_scalar(0)

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        accumulators = [shared_zeros(p.get_value().shape) for p in params]
        delta_accumulators = [shared_zeros(p.get_value().shape) for p in params]
        # self.updates = []
        self.updates = [(self.iterations, self.iterations + 1.)]

        for p, g, a, d_a in zip(params, grads, accumulators, delta_accumulators):
            new_a = self.rho * a + (1 - self.rho) * g ** 2  # update accumulator
            self.updates.append((a, new_a))

            # use the new accumulator and the *old* delta_accumulator
            update = g * T.sqrt(d_a + self.epsilon) / T.sqrt(new_a +
                                                             self.epsilon)

            new_p = p - self.lr * update
            self.updates.append((p, new_p))

            # update delta_accumulator
            new_d_a = self.rho * d_a + (1 - self.rho) * update ** 2
            self.updates.append((d_a, new_d_a))
        return self.updates

    def get_config(self):
        return {"name": self.__class__.__name__,
                "lr": float(self.lr.get_value()),
                "rho": self.rho,
                "epsilon": self.epsilon}


class Adam(Optimizer):  # new Adam is designed for our purpose.
    '''
        Reference: http://arxiv.org/abs/1412.6980v8

        Default parameters follow those provided in the original paper.
        We add Gaussian Noise to improve the performance.
    '''
    def __init__(self, lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8, save=False, rng=None, *args, **kwargs):
        print('args=%s' % str(args))
        print('kwargs=%s' % str(kwargs))
        super(Adam, self).__init__(**kwargs)
        self.__dict__.update(locals())
        print(locals())

        # if 'iterations' in kwargs:
        #     print('iterations=%s' % str(kwargs['iterations']))
        #     self.iterations = shared_scalar(kwargs['iterations'],  name='iteration')
        # else:
        #     print('iterations not set')
        #     self.iterations = shared_scalar(0,  name='iteration')
        self.iterations = shared_scalar(0, name='iteration')
        self.lr         = shared_scalar(lr, name='lr')
        # self.rng        = MRG_RandomStreams(use_cuda=True)
        self.noise      = []
        self.forget     = dict()
        # self.rng        = rng
        self.beta_1     = beta_1
        self.beta_2     = beta_2
        self.epsilon    = epsilon

        self.add(self.iterations)
        self.add(self.lr)

    def add_noise(self, param):
        if param.name not in self.noise:
            logger.info('add gradient noise to {}'.format(param))
            self.noise += [param.name]

    def add_forget(self, param):
        if param.name not in self.forget:
            logger.info('add forgetting list to {}'.format(param))
            self.forget[param.name] = theano.shared(param.get_value())

    def get_updates(self, params, loss):
        grads = self.get_gradients(loss, params)
        self.updates = [(self.iterations, self.iterations + 1.)]
        self.pu = []

        t = self.iterations + 1
        lr_t = self.lr * T.sqrt(1 - self.beta_2**t) / (1 - self.beta_1**t)
        for p, g in zip(params, grads):
            m = theano.shared(p.get_value() * 0., name=p.name + '_m')  # zero init of moment
            v = theano.shared(p.get_value() * 0., name=p.name + '_v')  # zero init of velocity

            self.add(m)
            self.add(v)

            # g_noise = self.rng.normal(g.shape, 0, T.sqrt(0.005 * t ** (-0.55)), dtype='float32')

            # if p.name in self.noise:
            #     g_deviated = g + g_noise
            # else:
            #     g_deviated = g

            g_deviated = g  #  + g_noise
            m_t = (self.beta_1 * m) + (1 - self.beta_1) * g_deviated
            v_t = (self.beta_2 * v) + (1 - self.beta_2) * (g_deviated**2)
            u_t = -lr_t * m_t / (T.sqrt(v_t) + self.epsilon)
            p_t = p + u_t

            # # memory reformatting!
            # if p.name in self.forget:
            #     p_t = (1 - p_mem) * p_t + p_mem * self.forget[p.name]
            #     p_s = (1 - p_fgt) * p_t + p_fgt * self.forget[p.name]
            #     self.updates.append((self.forget[p.name], p_s))

            self.updates.append((m, m_t))
            self.updates.append((v, v_t))
            self.updates.append((p, p_t))  # apply constraints
            self.pu.append((p, p_t - p))

        if self.save:
            return self.updates, self.pu
        return self.updates

    def get_config(self):
        # print(theano.tensor.cast(self.lr, dtype='float32').eval())
        # print(int(theano.tensor.cast(self.iterations, dtype='int32').eval()))
        config = {'lr':     float(theano.tensor.cast(self.lr, dtype='float32').eval()),
                  'beta_1': float(self.beta_1),
                  'beta_2': float(self.beta_2),
                  'iterations':  int(theano.tensor.cast(self.iterations, dtype='int32').eval()),
                  'noise':  self.noise
                  }
        base_config = super(Adam, self).get_config()
        return_config = dict(list(base_config.items()) + list(config.items()))
        print('Getting config of optimizer: \n\t\t %s' % str(return_config))
        return return_config

# aliases
sgd = SGD
rmsprop = RMSprop
adagrad = Adagrad
adadelta = Adadelta
adam = Adam


def get(identifier, kwargs=None):
    return get_from_module(identifier, globals(), 'optimizer', instantiate=True,
                           kwargs=kwargs)


================================================
FILE: emolga/dataset/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Python File Template 
"""

import os

__author__ = "Rui Meng"
__email__ = "rui.meng@pitt.edu"

if __name__ == '__main__':
    pass

================================================
FILE: emolga/dataset/build_dataset.py
================================================
import json

__author__ = 'jiataogu'
import numpy as np
import numpy.random as rng
import cPickle as pickle
import pprint
import sys
import hickle

from collections import OrderedDict
from fuel import datasets
from fuel import transformers
from fuel import schemes
from fuel import streams

def serialize_to_file_json(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    f = open(path, 'w')
    json.dump(obj, f)
    f.close()

def serialize_to_file_hdf5(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    f = open(path, 'w')
    hickle.dump(obj, f)
    f.close()

def serialize_to_file(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
    print('serialize to %s' % path)
    f = open(path, 'wb')
    pickle.dump(obj, f, protocol=protocol)
    f.close()


def show_txt(array, path):
    f = open(path, 'w')
    for line in array:
        f.write(' '.join(line) + '\n')

    f.close()


def divide_dataset(dataset, test_size, max_size):
    train_set = dict()
    test_set  = dict()

    for w in dataset:
        train_set[w] = dataset[w][test_size:max_size].astype('int32')
        test_set[w]  = dataset[w][:test_size].astype('int32')

    return train_set, test_set

def deserialize_from_file_json(path):
    f = open(path, 'r')
    obj = json.load(f)
    f.close()
    return obj

def deserialize_from_file_hdf5(path):
    f = open(path, 'r')
    obj = hickle.load(f)
    f.close()
    return obj

def deserialize_from_file(path):
    f = open(path, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj


def build_fuel(data):
    # create fuel dataset.
    dataset     = datasets.IndexableDataset(indexables=OrderedDict([('data', data)]))
    dataset.example_iteration_scheme \
                = schemes.ShuffledExampleScheme(dataset.num_examples)
    return dataset, len(data)


def obtain_stream(dataset, batch_size, size=1):
    if size == 1:
        data_stream = dataset.get_example_stream()
        data_stream = transformers.Batch(data_stream, iteration_scheme=schemes.ConstantScheme(batch_size))

        # add padding and masks to the dataset
        data_stream = transformers.Padding(data_stream, mask_sources=('data'))
        return data_stream
    else:
        data_streams = [dataset.get_example_stream() for _ in range(size)]
        data_streams = [transformers.Batch(data_stream, iteration_scheme=schemes.ConstantScheme(batch_size))
                        for data_stream in data_streams]
        data_streams = [transformers.Padding(data_stream, mask_sources=('data')) for data_stream in data_streams]
        return data_streams

def build_ptb():
    path = './ptbcorpus/'
    print(path)
    # make the dataset and vocabulary
    X_train = [l.split() for l in open(path + 'ptb.train.txt').readlines()]
    X_test  = [l.split() for l in open(path + 'ptb.test.txt').readlines()]
    X_valid = [l.split() for l in open(path + 'ptb.valid.txt').readlines()]

    X = X_train + X_test + X_valid
    idx2word    = dict(enumerate(set([w for l in X for w in l]), 1))
    idx2word[0] = '<eol>'
    word2idx    = {v: k for k, v in idx2word.items()}
    ixwords_train = [[word2idx[w] for w in l] for l in X_train]
    ixwords_test  = [[word2idx[w] for w in l] for l in X_test]
    ixwords_valid = [[word2idx[w] for w in l] for l in X_valid]
    ixwords_tv    = [[word2idx[w] for w in l] for l in (X_train + X_valid)]

    max_len = max([len(w) for w in X_train])
    print(max_len)
    # serialization:
    # serialize_to_file(ixwords_train, path + 'data_train.pkl')
    # serialize_to_file(ixwords_test,  path + 'data_test.pkl')
    # serialize_to_file(ixwords_valid, path + 'data_valid.pkl')
    # serialize_to_file(ixwords_tv,    path + 'data_tv.pkl')
    # serialize_to_file([idx2word, word2idx], path + 'voc.pkl')
    # show_txt(X, 'data.txt')
    print('save done.')


def filter_unk(X, min_freq=5):
    voc = dict()
    for l in X:
        for w in l:
            if w not in voc:
                voc[w]  = 1
            else:
                voc[w] += 1

    word2idx   = dict()
    word2idx['<eol>'] = 0
    id2word    = dict()
    id2word[0] = '<eol>'

    at         = 1
    for w in voc:
        if voc[w] > min_freq:
            word2idx[w] = at
            id2word[at] = w
            at += 1

    word2idx['<unk>'] = at
    id2word[at] = '<unk>'
    return word2idx, id2word


def build_msr():
    # path = '/home/thoma/Work/Dial-DRL/dataset/MSRSCC/'
    path = '/Users/jiataogu/Work/Dial-DRL/dataset/MSRSCC/'
    print(path)

    X           = [l.split() for l in open(path + 'train.txt').readlines()]
    word2idx, idx2word = filter_unk(X, min_freq=5)
    print('vocabulary size={0}. {1} samples'.format(len(word2idx), len(X)))

    mean_len = np.mean([len(w) for w in X])
    print('mean len = {}'.format(mean_len))

    ixwords     = [[word2idx[w]
                    if w in word2idx
                    else word2idx['<unk>']
                    for w in l] for l in X]
    print(ixwords[0])
    # serialization:
    serialize_to_file(ixwords, path + 'data_train.pkl')


if __name__ == '__main__':
    build_msr()
    # build_ptb()
    # build_dataset()
    # game = GuessOrder(size=8)
    # q = 'Is there any number smaller de than 6 in the last 3 numbers ?'
    # print(game.easy_parse(q))



================================================
FILE: emolga/layers/__init__.py
================================================
__author__ = 'yinpengcheng'


================================================
FILE: emolga/layers/attention.py
================================================
__author__ = 'jiataogu'
from .core import *
"""
Attention Model.
    <::: Two kinds of attention models ::::>
    -- Linear Transformation
    -- Inner Product
"""


class Attention(Layer):
    def __init__(self, target_dim, source_dim, hidden_dim,
                 init='glorot_uniform', name='attention',
                 coverage=False, max_len=50,
                 shared=False):

        super(Attention, self).__init__()
        self.init       = initializations.get(init)
        self.softmax    = activations.get('softmax')
        self.tanh       = activations.get('tanh')
        self.target_dim = target_dim
        self.source_dim = source_dim
        self.hidden_dim = hidden_dim
        self.max_len    = max_len
        self.coverage   = coverage

        if coverage:
            print('Use Coverage Trick!')

        self.Wa         = self.init((self.target_dim, self.hidden_dim))
        self.Ua         = self.init((self.source_dim, self.hidden_dim))
        self.va         = self.init((self.hidden_dim, 1))

        self.Wa.name, self.Ua.name, self.va.name = \
                '{}_Wa'.format(name), '{}_Ua'.format(name), '{}_va'.format(name)
        self.params     = [self.Wa, self.Ua, self.va]
        if coverage:
            self.Ca      = self.init((1, self.hidden_dim))
            self.Ca.name = '{}_Ca'.format(name)
            self.params += [self.Ca]

    def __call__(self, X, S,
                 Smask=None,
                 return_log=False,
                 Cov=None):
        assert X.ndim + 1 == S.ndim, 'source should be one more dimension than target.'
        # X is the decoder representation of t-1:    (nb_samples, hidden_dims)
        # S is the context vector, hidden representation of source text:    (nb_samples, maxlen_s, context_dim)
        # X_mask: mask, an array showing which elements in X are not 0 [nb_sample, max_len]
        # Cov is the coverage vector (nb_samples, maxlen_s)

        if X.ndim == 1:
            X = X[None, :]
            S = S[None, :, :]
            if not Smask:
                Smask = Smask[None, :]

        Eng   = dot(X[:, None, :], self.Wa) + dot(S, self.Ua)  # Concat Attention by Bahdanau et al. 2015 (nb_samples, source_num, hidden_dims)
        Eng   = self.tanh(Eng)

        # location aware by adding previous coverage information, let model learn how to handle coverage
        if self.coverage:
            Eng += dot(Cov[:, :, None], self.Ca)  # (nb_samples, source_num, hidden_dims)

        Eng   = dot(Eng, self.va)
        Eng   = Eng[:, :, 0]                      # 3rd dim is 1, discard it (nb_samples, source_num)

        if Smask is not None:
            # I want to use mask!
            EngSum = logSumExp(Eng, axis=1, mask=Smask)
            if return_log:
                return (Eng - EngSum) * Smask
            else:
                return T.exp(Eng - EngSum) * Smask
        else:
            if return_log:
                return T.log(self.softmax(Eng))
            else:
                return self.softmax(Eng)


class CosineAttention(Layer):
    def __init__(self, target_dim, source_dim,
                 init='glorot_uniform',
                 use_pipe=True,
                 name='attention'):

        super(CosineAttention, self).__init__()
        self.init       = initializations.get(init)
        self.softmax    = activations.get('softmax')
        self.softplus   = activations.get('softplus')
        self.tanh       = activations.get('tanh')
        self.use_pipe   = use_pipe

        self.target_dim = target_dim
        self.source_dim = source_dim

        # pipe
        if self.use_pipe:
            self.W_key  = Dense(self.target_dim, self.source_dim, name='W_key')
        else:
            assert target_dim == source_dim
            self.W_key  = Identity(name='W_key')
        self._add(self.W_key)

        # sharpen
        # self.W_beta     = Dense(self.target_dim, 1, name='W_beta')
        # dio-sharpen
        # self.W_beta     = Dense(self.target_dim, self.source_dim, name='W_beta')
        # self._add(self.W_beta)

        # self.gamma      = self.init((source_dim, ))
        # self.gamma      = self.init((target_dim, source_dim))
        # self.gamma.name = 'o_gamma'
        # self.params    += [self.gamma]

    def __call__(self, X, S, Smask=None, return_log=False):
        assert X.ndim + 1 == S.ndim, 'source should be one more dimension than target.'

        if X.ndim == 1:
            X = X[None, :]
            S = S[None, :, :]
            if not Smask:
                Smask = Smask[None, :]

        key   = self.W_key(X)                   # (nb_samples, source_dim)
        # beta  = self.softplus(self.W_beta(X))   # (nb_samples, source_dim)

        Eng   = dot_2d(key, S)  #, g=self.gamma)
        # Eng   = cosine_sim2d(key, S)  # (nb_samples, source_num)
        # Eng   = T.repeat(beta, Eng.shape[1], axis=1) * Eng

        if Smask is not None:
            # I want to use mask!
            EngSum = logSumExp(Eng, axis=1, mask=Smask)
            if return_log:
                return (Eng - EngSum) * Smask
            else:
                return T.exp(Eng - EngSum) * Smask
        else:
            if return_log:
                return T.log(self.softmax(Eng))
            else:
                return self.softmax(Eng)



================================================
FILE: emolga/layers/core.py
================================================
# -*- coding: utf-8 -*-

from emolga.utils.theano_utils import *
import emolga.basic.initializations as initializations
import emolga.basic.activations as activations


class Layer(object):
    def __init__(self):
        self.params  = []
        self.layers  = []
        self.monitor = {}
        self.watchlist = []

    def init_updates(self):
        self.updates = []

    def _monitoring(self):
        # add monitoring variables
        for l in self.layers:
            for v in l.monitor:
                name = v + '@' + l.name
                print(name)
                self.monitor[name] = l.monitor[v]

    def __call__(self, X, *args, **kwargs):
        return X

    def _add(self, layer):
        if layer:
            self.layers.append(layer)
            self.params += layer.params

    def supports_masked_input(self):
        ''' Whether or not this layer respects the output mask of its previous layer in its calculations. If you try
        to attach a layer that does *not* support masked_input to a layer that gives a non-None output_mask() that is
        an error'''
        return False

    def get_output_mask(self, train=None):
        '''
        For some models (such as RNNs) you want a way of being able to mark some output data-points as
        "masked", so they are not used in future calculations. In such a model, get_output_mask() should return a mask
        of one less dimension than get_output() (so if get_output is (nb_samples, nb_timesteps, nb_dimensions), then the mask
        is (nb_samples, nb_timesteps), with a one for every unmasked datapoint, and a zero for every masked one.

        If there is *no* masking then it shall return None. For instance if you attach an Activation layer (they support masking)
        to a layer with an output_mask, then that Activation shall also have an output_mask. If you attach it to a layer with no
        such mask, then the Activation's get_output_mask shall return None.

        Some emolga have an output_mask even if their input is unmasked, notably Embedding which can turn the entry "0" into
        a mask.
        '''
        return None

    def set_weights(self, weights):
        for p, w in zip(self.params, weights):
            if p.eval().shape != w.shape:
                raise Exception("Layer shape %s not compatible with weight shape %s." % (p.eval().shape, w.shape))
            p.set_value(floatX(w))

    def get_weights(self):
        weights = []
        for p in self.params:
            weights.append(p.get_value())
        return weights

    def get_params(self):
        return self.params

    def set_name(self, name):
        for i in range(len(self.params)):
            if self.params[i].name is None:
                self.params[i].name = '%s_p%d' % (name, i)
            else:
                self.params[i].name = name + '_' + self.params[i].name
        self.name = name


class MaskedLayer(Layer):
    '''
    If your layer trivially supports masking (by simply copying the input mask to the output), then subclass MaskedLayer
    instead of Layer, and make sure that you incorporate the input mask into your calculation of get_output()
    '''
    def supports_masked_input(self):
        return True


class Identity(Layer):
    def __init__(self, name='Identity'):
        super(Identity, self).__init__()
        if name is not None:
            self.set_name(name)

    def __call__(self, X):
        return X


class Dense(Layer):
    def __init__(self, input_dim, output_dim, init='glorot_uniform', activation='tanh', name='Dense',
                 learn_bias=True, negative_bias=False):

        super(Dense, self).__init__()
        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.linear = (activation == 'linear')

        # self.input = T.matrix()
        self.W = self.init((self.input_dim, self.output_dim))
        if not negative_bias:
            self.b = shared_zeros((self.output_dim))
        else:
            self.b = shared_ones((self.output_dim))

        self.learn_bias = learn_bias
        if self.learn_bias:
            self.params = [self.W, self.b]
        else:
            self.params = [self.W]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.W.name = '%s_W' % name
        self.b.name = '%s_b' % name

    def __call__(self, X):
        # output = self.activation(T.dot(X, self.W) + 4. * self.b) # why with a 4.0 here? change to 1
        output = self.activation(T.dot(X, self.W) + self.b)
        return output

    def reverse(self, Y):
        assert self.linear

        output = T.dot((Y - self.b), self.W.T)
        return output


class Dense2(Layer):
    def __init__(self, input_dim1, input_dim2, output_dim, init='glorot_uniform', activation='tanh', name='Dense', learn_bias=True):

        super(Dense2, self).__init__()
        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim1 = input_dim1
        self.input_dim2 = input_dim2
        self.output_dim = output_dim
        self.linear = (activation == 'linear')

        # self.input = T.matrix()

        self.W1 = self.init((self.input_dim1, self.output_dim))
        self.W2 = self.init((self.input_dim2, self.output_dim))
        self.b  = shared_zeros((self.output_dim))

        self.learn_bias = learn_bias
        if self.learn_bias:
            self.params = [self.W1, self.W2, self.b]
        else:
            self.params = [self.W1, self.W2]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.W1.name = '%s_W1' % name
        self.W2.name = '%s_W2' % name
        self.b.name = '%s_b' % name

    def __call__(self, X1, X2):
        output = self.activation(T.dot(X1, self.W1) + T.dot(X2, self.W2) + self.b)
        return output


class Constant(Layer):
    def __init__(self, input_dim, output_dim, init=None, activation='tanh', name='Bias'):

        super(Constant, self).__init__()
        assert input_dim == output_dim, 'Bias Layer needs to have the same input/output nodes.'

        self.init = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.b = shared_zeros(self.output_dim)
        self.params = [self.b]

        if name is not None:
            self.set_name(name)

    def set_name(self, name):
        self.b.name = '%s_b' % name

    def __call__(self, X=None):
        output = self.activation(self.b)
        if X:
            L = X.shape[0]
            output = T.extra_ops.repeat(output[None, :], L, axis=0)
        return output


class MemoryLinear(Layer):
    def __init__(self, input_dim, input_wdth, init='glorot_uniform',
                 activation='tanh', name='Bias', has_input=True):
        super(MemoryLinear, self).__init__()

        self.init       = initializations.get(init)
        self.activation = activations.get(activation)
        self.input_dim  = input_dim
        self.input_wdth = input_wdth

        self.b = self.init((self.input_dim, self.input_wdth))
        self.params = [self.b]

        if has_input:
            self.P = self.init((self.input_dim, self.input_wdth))
            self.params += [self.P]

        if name is not None:
            self.set_name(name)

    def __call__(self, X=None):
        out = self.b[None, :, :]
        if X:
            out += self.P[None, :, :] * X
        return self.activation(out)


class Dropout(MaskedLayer):
    """
        Hinton's dropout.
    """
    def __init__(self, rng=None, p=1., name=None):
        super(Dropout, self).__init__()
        self.p   = p
        self.rng = rng

    def __call__(self, X, train=True):
        if self.p > 0.:
            retain_prob = 1. - self.p
            if train:
                X *= self.rng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
            else:
                X *= retain_prob
        return X


class Activation(MaskedLayer):
    """
        Apply an activation function to an output.
    """
    def __init__(self, activation):
        super(Activation, self).__init__()
        self.activation = activations.get(activation)

    def __call__(self, X):
        return self.activation(X)



================================================
FILE: emolga/layers/embeddings.py
================================================
# -*- coding: utf-8 -*-

from .core import Layer
from emolga.utils.theano_utils import *
import emolga.basic.initializations as initializations


class Embedding(Layer):
    '''
        Turn positive integers (indexes) into denses vectors of fixed size.
        e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

        @input_dim: size of vocabulary (highest input integer + 1)
        @out_dim: size of dense representation
    '''

    def __init__(self, input_dim, output_dim, init='uniform', name=None):

        super(Embedding, self).__init__()
        self.init = initializations.get(init)
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.W = self.init((self.input_dim, self.output_dim))

        self.params = [self.W]

        if name is not None:
            self.set_name(name)

    def get_output_mask(self, X):
        '''
        T.ones_like(X): Return an array of ones with shape and type of input.
            T.eq(X, 0): X==0?
            1 - T.eq(X, 0): X!=0?
        :return an array shows that which x!=0
        '''
        return T.ones_like(X) * (1 - T.eq(X, 0))

    def __call__(self, X, mask_zero=False, context=None):
        '''
        return the embedding of X
        :param X:         a set of words, all the X have same length due to padding
                            shape=[nb_sample, max_len]
        :param mask_zero: whether return the mask of X, a list of [0,1] showing which x!=0
        :param context:
        :return
                emb_X:    embedding of X, shape = [nb_sample, max_len, emb_dim]
                X_mask:   mask of X,      shape=[nb_sample, max_len]

        '''
        if context is None:
            out = self.W[X]
        else:
            assert context.ndim == 3
            flag  = False
            if X.ndim == 1:
                flag = True
                X = X[:, None]

            b_size = context.shape[0]

            EMB = T.repeat(self.W[None, :, :], b_size, axis=0)
            EMB = T.concatenate([EMB, context], axis=1)

            m_size = EMB.shape[1]
            e_size = EMB.shape[2]
            maxlen = X.shape[1]

            EMB = EMB.reshape((b_size * m_size, e_size))
            Z   = (T.arange(b_size)[:, None] * m_size + X).reshape((b_size * maxlen,))
            out = EMB[Z]  # (b_size * maxlen, e_size)

            if not flag:
                out = out.reshape((b_size, maxlen, e_size))
            else:
                out = out.reshape((b_size, e_size))

        if mask_zero:
            return out, T.cast(self.get_output_mask(X), dtype='float32')
        else:
            return out


class Zero(Layer):
    def __call__(self, X):
        out = T.zeros(X.shape)
        return out


class Bias(Layer):
    def __call__(self, X):
        tmp = X.flatten()
        tmp = tmp.dimshuffle(0, 'x')
        return tmp


================================================
FILE: emolga/layers/gridlstm.py
================================================
__author__ = 'jiataogu'
"""
The file is the implementation of Grid-LSTM
In this stage we only support 2D LSTM with Pooling.
"""
from recurrent import *
from attention import Attention
import logging
import copy
logger = logging.getLogger(__name__)


class Grid(Recurrent):
    """
    Grid Cell for Grid-LSTM
    ===================================================
    LSTM
            [h', m'] = LSTM(x, h, m):
                gi = sigmoid(Wi * x + Ui * h + Vi * m)  # Vi is peep-hole
                gf = sigmoid(Wf * x + Uf * h + Vf * m)
                go = sigmoid(Wo * x + Uo * h + Vo * m)
                gc = tanh(Wc * x +Uc * h)

                m' = gf @ m + gi @ gc  (@ represents element-wise dot.)
                h' = go @ tanh(m')

    ===================================================
    Grid
    (here is an example for 2D Grid LSTM with priority dimension = 1)
     -------------
    |    c'  d'   |     Grid Block and Grid Updates.
    | a         a'|
    |             |     [d' c'] = LSTM_d([b, d],  c)
    | b         b'|     [a' b'] = LSTM_t([b, d'], a)
    |    c   d    |
     -------------
    ===================================================
    Details please refer to:
        "Grid Long Short-Term Memory", http://arxiv.org/abs/1507.01526
    """
    def __init__(self,
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=False,
                 name=None, weights=None,
                 identity_connect=None
                 ):
        super(Grid, self).__init__()

        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'

        """
        Initialization.
        """
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def build(self):
        """
        Build the model weights
        """
        logger.info("Building GridPool-LSTM !!")
        self.W = dict()
        self.U = dict()
        self.V = dict()
        self.b = dict()

        # ******************************************************************************************
        for k in xrange(self.N):       # N-Grids (for 2 dimensions, 0 is for time; 1 is for depth.)
            axis  = self.axies[k]
            # input layers:
            if self.input_dims[k] > 0 and self.use_input:
                # use the data information.
                self.W[axis + '#i'], self.W[axis + '#f'], \
                self.W[axis + '#o'], self.W[axis + '#c']  \
                      = [self.init((self.input_dims[k], self.output_dims[k])) for _ in xrange(4)]

            # hidden layers:
            for j in xrange(self.N):   # every hidden states inputs.
                pos   = self.axies[j]
                if k == j:
                    self.U[axis + pos + '#i'], self.U[axis + pos + '#f'], \
                    self.U[axis + pos + '#o'], self.U[axis + pos + '#c']  \
                        = [self.inner_init((self.output_dims[j], self.output_dims[k])) for _ in xrange(4)]
                else:
                    self.U[axis + pos + '#i'], self.U[axis + pos + '#f'], \
                    self.U[axis + pos + '#o'], self.U[axis + pos + '#c']  \
                        = [self.init((self.output_dims[j], self.output_dims[k])) for _ in xrange(4)]

            # bias layers:
            self.b[axis + '#i'], self.b[axis + '#o'], self.b[axis + '#c']  \
                      = [shared_zeros(self.output_dims[k]) for _ in xrange(3)]
            self.b[axis + '#f'] = self.forget_bias_init(self.output_dims[k])

            # peep-hole layers:
            if self.peephole:
                self.V[axis + '#i'], self.V[axis + '#f'], self.V[axis + '#o'] \
                      = [self.init(self.output_dims[k]) for _ in xrange(3)]
        # *****************************************************************************************

        # set names for these weights
        for A, n in zip([self.W, self.U, self.b, self.V], ['W', 'U', 'b', 'V']):
            for w in A:
                A[w].name = n + '_' + w

        # set parameters
        self.params = [self.W[s] for s in self.W] + \
                      [self.U[s] for s in self.U] + \
                      [self.b[s] for s in self.b] + \
                      [self.V[s] for s in self.V]

    def lstm_(self, k, H, m, x, identity=False):
        """
       LSTM
            [h', m'] = LSTM(x, h, m):
                gi = sigmoid(Wi * x + Ui * h + Vi * m)  # Vi is peep-hole
                gf = sigmoid(Wf * x + Uf * h + Vf * m)
                go = sigmoid(Wo * x + Uo * h + Vo * m)
                gc = tanh(Wc * x +Uc * h)

                m' = gf @ m + gi @ gc  (@ represents element-wise dot.)
                h' = go @ tanh(m')

        """
        assert len(H) == self.N, 'we have to use all the hidden states in Grid LSTM'
        axis           = self.axies[k]

        # *************************************************************************
        # bias energy
        ei, ef, eo, ec = [self.b[axis + p] for p in ['#i', '#f', '#o', '#c']]

        # hidden energy
        for j in xrange(self.N):
            pos  = self.axies[j]

            ei  += T.dot(H[j], self.U[axis + pos + '#i'])
            ef  += T.dot(H[j], self.U[axis + pos + '#f'])
            eo  += T.dot(H[j], self.U[axis + pos + '#o'])
            ec  += T.dot(H[j], self.U[axis + pos + '#c'])

        # input energy (if any)
        if self.input_dims[k] > 0 and self.use_input:
            ei  += T.dot(x, self.W[axis + '#i'])
            ef  += T.dot(x, self.W[axis + '#f'])
            eo  += T.dot(x, self.W[axis + '#o'])
            ec  += T.dot(x, self.W[axis + '#c'])

        # peep-hole connections
        if self.peephole:
            ei  += m * self.V[axis + '#i'][None, :]
            ef  += m * self.V[axis + '#f'][None, :]
            eo  += m * self.V[axis + '#o'][None, :]
        # *************************************************************************

        # compute the gates.
        i        = self.inner_activation(ei)
        f        = self.inner_activation(ef)
        o        = self.inner_activation(eo)
        c        = self.activation(ec)

        # update the memory and hidden states.
        m_new    = f * m + i * c
        h_new    = o * self.activation(m_new)

        return h_new, m_new

    def grid_(self,
              hs_i,
              ms_i,
              xs_i,
              priority=1,
              identity=None):
        """
        ===================================================
        Grid (2D as an example)
         -------------
        |    c'  d'   |     Grid Block and Grid Updates.
        | a         a'|
        |             |     [d' c'] = LSTM_d([b, d],  c)
        | b         b'|     [a' b'] = LSTM_t([b, d'], a)   priority
        |    c   d    |
         -------------
         a = my | b = hy | c = mx | d = hx
        ===================================================

        Currently masking is not considered in GridLSTM.
        """
        # compute LSTM updates for non-priority dimensions
        H_new   = hs_i
        M_new   = ms_i
        for k in xrange(self.N):
            if k == priority:
                continue
            m   = ms_i[k]
            x   = xs_i[k]
            H_new[k], M_new[k] \
                = self.lstm_(k, hs_i, m, x)

            if identity is not None:
                if identity[k]:
                    H_new[k] += hs_i[k]

        # compute LSTM updates along the priority dimension
        if priority >= 0:
            hs_ii   = H_new
            H_new[priority], M_new[priority] \
                    = self.lstm_(priority, hs_ii, ms_i[priority], xs_i[priority])
            if identity is not None:
                if identity[priority]:
                    H_new[priority] += hs_ii[priority]

        return H_new, M_new


class SequentialGridLSTM(Grid):
    """
    Details please refer to:
        "Grid Long Short-Term Memory",
            http://arxiv.org/abs/1507.01526

    SequentialGridLSTM is a typical 2D-GridLSTM,
    which has one flexible dimension (time) and one fixed dimension (depth)
    Input information is added along x-axis.
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=False,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 pooling=True,
                 attention=False,
                 shared=True,
                 dropout=0,
                 rng=None,
                 ):
        super(Grid, self).__init__()

        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert input_dims[1]    == 0, 'we have no y-axis inputs here.'
        assert shared, 'we share the weights in this stage.'
        assert not (attention and pooling), 'attention and pooling cannot be set at the same time.'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.pooling          = pooling
        self.attention        = attention
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        if self.identity_connect is not None:
            logger.info('Identity Connection: {}'.format(self.identity_connect))

        """
        Build the model weights.
        """
        # build the centroid grid.
        self.build()

        # input projection layer (projected to time-axis)       [x]
        self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')

        self._add(self.Ph)
        self._add(self.Pm)

        # learn init for depth-axis hidden states/memory cells. [y]
        if self.learn_init:
            self.M0      = self.init((depth, output_dims[1]))
            if self.pooling:
                self.H0  = self.init(output_dims[1])
            else:
                self.H0  = self.init((depth, output_dims[1]))

            self.M0.name, self.H0.name = 'M0', 'H0'
            self.params += [self.M0, self.H0]

        # if we use attention instead of max-pooling
        if self.pooling:
            self.PP      = Dense(output_dims[1] + input_dims[0], output_dims[1], # init='orthogonal',
                                 name='PP', activation='linear')
            self._add(self.PP)

        if self.attention:
            self.A       = Attention(target_dim=input_dims[0],
                                     source_dim=output_dims[1],
                                     hidden_dim=200, name='attender')
            self._add(self.A)

        # if self.dropout > 0:
        #     logger.info(">>>>>> USE DropOut !! <<<<<<")
        #     self.D       = Dropout(rng=rng, p=self.dropout, name='Dropout')

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        # since depth is not determined, we cannot decide the number of inputs
        # for one time step.
        # if pooling is True:
        #    args = [raw_input] +       (sequence)
        #           [hy] + [my]*depth   (output_info)
        #
        inputs = args[0]
        Hy_tm1 = [args[k] for k in range(1, 1 + self.depth)]
        My_tm1 = [args[k] for k in range(1 + self.depth, 1 + 2 * self.depth)]

        # x_axis input projection (get hx_t, mx_t)
        hx_t   = self.Ph(inputs)           # (nb_samples, output_dim0)
        mx_t   = self.Pm(inputs)           # (nb_samples, output_dim0)

        # build computation path from bottom to top.
        Hx_t   = [hx_t]
        Mx_t   = [mx_t]
        Hy_t   = []
        My_t   = []
        for d in xrange(self.depth):
            hs_i       = [Hx_t[-1], Hy_tm1[d]]
            ms_i       = [Mx_t[-1], My_tm1[d]]
            xs_i       = [inputs,   T.zeros_like(inputs)]

            hs_o, ms_o = self.grid_(hs_i, ms_i, xs_i, priority=self.priority, identity=self.identity_connect)

            Hx_t      += [hs_o[0]]
            Hy_t      += [hs_o[1]]
            Mx_t      += [ms_o[0]]
            My_t      += [ms_o[1]]

        hx_out = Hx_t[-1]
        mx_out = Mx_t[-1]

        # get the output (output_y, output_x)
        # MAX-Pooling
        if self.pooling:
            # hy_t       = T.max([self.PP(hy) for hy in Hy_t], axis=0)
            hy_t       = T.max([self.PP(T.concatenate([hy, inputs], axis=-1)) for hy in Hy_t], axis=0)
            Hy_t       = [hy_t] * self.depth

        if self.attention:
            HHy_t      = T.concatenate([hy[:, None, :] for hy in Hy_t], axis=1)  # (nb_samples, n_depth, out_dim1)
            annotation = self.A(inputs, HHy_t)   # (nb_samples, n_depth)
            hy_t       = T.sum(HHy_t * annotation[:, :, None], axis=1)           # (nb_samples, out_dim1)
            Hy_t       = [hy_t] * self.depth

        R = Hy_t + My_t + [hx_out, mx_out]
        return tuple(R)

    def __call__(self, X, init_H=None, init_M=None,
                 return_sequence=False, one_step=False,
                 return_info='hy', train=True):
        # It is training/testing path
        self.train = train

        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_H is not None, 'previous state must be provided!'
            assert init_M is not None, 'previous cell must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_H is None:
            if self.learn_init:
                init_m     = T.repeat(self.M0[:, None, :], X.shape[1], axis=1)
                if self.pooling:
                    init_h = T.repeat(self.H0[None, :], self.depth, axis=0)
                else:
                    init_h = self.H0
                init_h     = T.repeat(init_h[:, None, :], X.shape[1], axis=1)

                init_H     = []
                init_M     = []
                for j in xrange(self.depth):
                    init_H.append(init_h[j])
                    init_M.append(init_m[j])
            else:
                init_H     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * self.depth
                init_M     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * self.depth
            pass

        # computational graph !
        if not one_step:
            sequences    = [X]
            outputs_info = init_H + init_M + [None, None]
            outputs, _   = theano.scan(
                self._step,
                sequences=sequences,
                outputs_info=outputs_info
            )
        else:
            outputs      = self._step(*([X[0]] + init_H + init_M))

        if   return_info == 'hx':
            if return_sequence:
                return outputs[0].dimshuffle((1, 0, 2))
            return outputs[-2][-1]
        elif return_info == 'hy':
            assert self.pooling or self.attention, 'y-axis hidden states are only used in the ``Pooling Mode".'
            if return_sequence:
                return outputs[2].dimshuffle((1, 0, 2))
            return outputs[2][-1]
        elif return_info == 'hxhy':
            assert self.pooling or self.attention, 'y-axis hidden states are only used in the ``Pooling Mode".'
            if return_sequence:
                return outputs[-2].dimshuffle((1, 0, 2)), outputs[2].dimshuffle((1, 0, 2))    # x-y
            return outputs[-2][-1], outputs[2][-1]


class PyramidGridLSTM2D(Grid):
    """
    A variant version of Sequential LSTM where we introduce a Pyramid structure.
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=True,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 shared=True,
                 dropout=0
                 ):

        super(Grid, self).__init__()
        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert output_dims[0] == output_dims[1], 'Here we only support square model.'
        assert shared, 'we share the weights in this stage.'
        assert use_input, 'use input and add them in the middle'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.input_dims       = input_dims
        self.output_dims      = output_dims
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Build the model weights.
        """
        # build the centroid grid.
        self.build()

        # # input projection layer (projected to time-axis)       [x]
        # self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        # self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')
        #
        # self._add(self.Ph)
        # self._add(self.Pm)

        # learn init/
        if self.learn_init:
            self.hx0 = self.init((1, output_dims[0]))
            self.hy0 = self.init((1, output_dims[1]))
            self.mx0 = self.init((1, output_dims[0]))
            self.my0 = self.init((1, output_dims[1]))

            self.hx0.name, self.hy0.name = 'hx0', 'hy0'
            self.mx0.name, self.my0.name = 'mx0', 'my0'
            self.params += [self.hx0, self.hy0, self.mx0, self.my0]

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        inputs = args[0]
        hx_tm1 = args[1]
        mx_tm1 = args[2]
        hy_tm1 = args[3]
        my_tm1 = args[4]

        # zero constant inputs.
        pre_info    = [[[T.zeros_like(hx_tm1)
                         for _ in xrange(self.depth)]
                         for _ in xrange(self.depth)]
                         for _ in xrange(4)]  # hx, mx, hy, my

        pre_inputs  = [[T.zeros_like(inputs)
                       for _ in xrange(self.depth)]
                       for _ in xrange(self.depth)]

        for kk in xrange(self.depth):
            pre_inputs[kk][kk] = inputs

        pre_info[0][0][0] = hx_tm1
        pre_info[1][0][0] = mx_tm1
        pre_info[2][0][0] = hy_tm1
        pre_info[3][0][0] = my_tm1

        for step_x in xrange(self.depth):
            for step_y in xrange(self.depth):
                # input hidden/memory/input information
                print pre_info[0][-1][-1], pre_info[2][-1][-1]

                hs_i  = [pre_info[0][step_x][step_y],
                         pre_info[2][step_x][step_y]]
                ms_i  = [pre_info[1][step_x][step_y],
                         pre_info[3][step_x][step_y]]
                xs_i  = [pre_inputs[step_x][step_y],
                         pre_inputs[step_x][step_y]]

                # compute grid-lstm
                hs_o, ms_o = self.grid_(hs_i, ms_i, xs_i, priority =-1)

                # output hidden/memory information
                if (step_x == self.depth - 1) and (step_y == self.depth - 1):
                    hx_t, mx_t, hy_t, my_t = hs_o[0], ms_o[0], hs_o[1], ms_o[1]
                    return hx_t, mx_t, hy_t, my_t

                if step_x + 1 < self.depth:
                    pre_info[0][step_x + 1][step_y] = hs_o[0]
                    pre_info[1][step_x + 1][step_y] = ms_o[0]

                if step_y + 1 < self.depth:
                    pre_info[2][step_x][step_y + 1] = hs_o[1]
                    pre_info[3][step_x][step_y + 1] = ms_o[1]

    def __call__(self, X, init_x=None, init_y=None,
                 return_sequence=False, one_step=False):
        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_x is not None, 'previous x must be provided!'
            assert init_y is not None, 'previous y must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_x is None:
            if self.learn_init:
                init_mx    = T.repeat(self.mx0, X.shape[1], axis=0)
                init_my    = T.repeat(self.my0, X.shape[1], axis=0)
                init_hx    = T.repeat(self.hx0, X.shape[1], axis=0)
                init_hy    = T.repeat(self.hy0, X.shape[1], axis=0)

                init_input = [init_hx, init_mx, init_hy, init_my]
            else:
                init_x     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[0]), 1)] * 2
                init_y     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * 2

                init_input = init_x + init_y
        else:
            init_input = init_x + init_y

        if not one_step:
            sequence       = [X]
            output_info    = init_input
            outputs, _     = theano.scan(
                self._step,
                sequences=sequence,
                outputs_info=output_info
            )
        else:
            outputs        = self._step(*([X[0]] + init_x + init_y))

        if return_sequence:
            hxs = outputs[0].dimshuffle((1, 0, 2))
            hys = outputs[2].dimshuffle((1, 0, 2))
            hs  = T.concatenate([hxs, hys], axis=-1)
            return hs
        else:
            hx  = outputs[0][-1]
            hy  = outputs[2][-1]
            h   = T.concatenate([hx, hy], axis=-1)
            return h


class PyramidLSTM(Layer):
    """
    A more flexible Pyramid LSTM structure!
    """
    def __init__(self,
                 # parameters for Grid.
                 output_dims,
                 input_dims,    # [0, ... 0], 0 represents no external inputs.
                 priority=1,
                 peephole=True,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 use_input=True,
                 name=None, weights=None,
                 identity_connect=None,

                 # parameters for 2D-GridLSTM
                 depth=5,
                 learn_init=False,
                 shared=True,
                 dropout=0
                 ):

        super(PyramidLSTM, self).__init__()
        assert len(output_dims) == 2, 'in this stage, we only support 2D Grid-LSTM'
        assert len(input_dims)  == len(output_dims), '# of inputs must match # of outputs.'
        assert output_dims[0] == output_dims[1], 'Here we only support square model.'
        assert shared, 'we share the weights in this stage.'
        assert use_input, 'use input and add them in the middle'

        """
        Initialization.
        """
        logger.info(":::: Sequential Grid-Pool LSTM ::::")
        self.N                = len(output_dims)
        self.depth            = depth
        self.dropout          = dropout

        self.priority         = priority
        self.peephole         = peephole
        self.use_input        = use_input
        self.learn_init       = learn_init

        self.init             = initializations.get(init)
        self.inner_init       = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation       = activations.get(activation)
        self.relu             = activations.get('relu')
        self.inner_activation = activations.get(inner_activation)

        self.identity_connect = identity_connect
        self.axies            = {0: 'x', 1: 'y', 2: 'z', 3: 'w'}  # only support at most 4D now!

        """
        Build the model weights.
        """
        # build the centroid grid (3 grid versions)
        self.grids = [Grid(output_dims,
                           input_dims,
                           -1,
                           peephole,
                           init, inner_init,
                           forget_bias_init,
                           activation, inner_activation, use_input,
                           name='Grid*{}'.format(k)
                           ) for k in xrange(3)]

        for k in xrange(3):
            self.grids[k].build()
            self._add(self.grids[k])

        # # input projection layer (projected to time-axis)       [x]
        # self.Ph  = Dense(input_dims[0], output_dims[0], name='Ph')
        # self.Pm  = Dense(input_dims[0], output_dims[0], name='Pm')
        #
        # self._add(self.Ph)
        # self._add(self.Pm)

        # learn init/
        if self.learn_init:
            self.hx0 = self.init((1, output_dims[0]))
            self.hy0 = self.init((1, output_dims[1]))
            self.mx0 = self.init((1, output_dims[0]))
            self.my0 = self.init((1, output_dims[1]))

            self.hx0.name, self.hy0.name = 'hx0', 'hy0'
            self.mx0.name, self.my0.name = 'mx0', 'my0'
            self.params += [self.hx0, self.hy0, self.mx0, self.my0]

        """
        Others info.
        """
        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self, *args):
        inputs = args[0]
        hx_tm1 = args[1]
        mx_tm1 = args[2]
        hy_tm1 = args[3]
        my_tm1 = args[4]

        # zero constant inputs.
        pre_info    = [[[T.zeros_like(hx_tm1)
                         for _ in xrange(self.depth)]
                         for _ in xrange(self.depth)]
                         for _ in xrange(4)]  # hx, mx, hy, my

        pre_inputs  = [[T.zeros_like(inputs)
                       for _ in xrange(self.depth)]
                       for _ in xrange(self.depth)]

        for kk in xrange(self.depth):
            pre_inputs[kk][kk] = inputs

        pre_info[0][0][0] = hx_tm1
        pre_info[1][0][0] = mx_tm1
        pre_info[2][0][0] = hy_tm1
        pre_info[3][0][0] = my_tm1

        for step_x in xrange(self.depth):
            for step_y in xrange(self.depth):
                # input hidden/memory/input information
                print pre_info[0][-1][-1], pre_info[2][-1][-1]

                hs_i  = [pre_info[0][step_x][step_y],
                         pre_info[2][step_x][step_y]]
                ms_i  = [pre_info[1][step_x][step_y],
                         pre_info[3][step_x][step_y]]
                xs_i  = [pre_inputs[step_x][step_y],
                         pre_inputs[step_x][step_y]]

                # compute grid-lstm
                if (step_x + step_y + 1) < self.depth:
                    hs_o, ms_o = self.grids[0].grid_(hs_i, ms_i, xs_i, priority =-1)
                elif (step_x + step_y + 1) == self.depth:
                    hs_o, ms_o = self.grids[1].grid_(hs_i, ms_i, xs_i, priority =-1)
                else:
                    hs_o, ms_o = self.grids[2].grid_(hs_i, ms_i, xs_i, priority =-1)

                # output hidden/memory information
                if (step_x == self.depth - 1) and (step_y == self.depth - 1):
                    hx_t, mx_t, hy_t, my_t = hs_o[0], ms_o[0], hs_o[1], ms_o[1]
                    return hx_t, mx_t, hy_t, my_t

                if step_x + 1 < self.depth:
                    pre_info[0][step_x + 1][step_y] = hs_o[0]
                    pre_info[1][step_x + 1][step_y] = ms_o[0]

                if step_y + 1 < self.depth:
                    pre_info[2][step_x][step_y + 1] = hs_o[1]
                    pre_info[3][step_x][step_y + 1] = ms_o[1]

    def __call__(self, X, init_x=None, init_y=None,
                 return_sequence=False, one_step=False):
        # recently we did not support masking.
        if X.ndim == 2:
            X = X[:, None, :]

        # one step
        if one_step:
            assert init_x is not None, 'previous x must be provided!'
            assert init_y is not None, 'previous y must be provided!'

        X = X.dimshuffle((1, 0, 2))
        if init_x is None:
            if self.learn_init:
                init_mx    = T.repeat(self.mx0, X.shape[1], axis=0)
                init_my    = T.repeat(self.my0, X.shape[1], axis=0)
                init_hx    = T.repeat(self.hx0, X.shape[1], axis=0)
                init_hy    = T.repeat(self.hy0, X.shape[1], axis=0)

                init_input = [init_hx, init_mx, init_hy, init_my]
            else:
                init_x     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[0]), 1)] * 2
                init_y     = [T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dims[1]), 1)] * 2

                init_input = init_x + init_y
        else:
            init_input = init_x + init_y

        if not one_step:
            sequence       = [X]
            output_info    = init_input
            outputs, _     = theano.scan(
                self._step,
                sequences=sequence,
                outputs_info=output_info
            )
        else:
            outputs        = self._step(*([X[0]] + init_x + init_y))

        if return_sequence:
            hxs = outputs[0].dimshuffle((1, 0, 2))
            hys = outputs[2].dimshuffle((1, 0, 2))
            hs  = T.concatenate([hxs, hys], axis=-1)
            return hs
        else:
            hx  = outputs[0][-1]
            hy  = outputs[2][-1]
            h   = T.concatenate([hx, hy], axis=-1)
            return h

================================================
FILE: emolga/layers/ntm_minibatch.py
================================================
__author__ = 'jiataogu'
import theano
import theano.tensor as T

import scipy.linalg as sl
import numpy as np
from .core import *
from .recurrent import *
import copy

"""
This implementation supports both minibatch learning and on-line training.
We need a minibatch version for Neural Turing Machines.
"""


class Reader(Layer):
    """
        "Reader Head" of the Neural Turing Machine.
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(Reader, self).__init__()
        self.input_dim = input_dim
        self.memory_dim = memory_width

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)

        self.tanh = activations.get('tanh')
        self.sigmoid = activations.get('sigmoid')
        self.softplus = activations.get('softplus')
        self.vec_softmax = activations.get('vector_softmax')
        self.softmax = activations.get('softmax')

        """
        Reader Params.
        """
        self.W_key = self.init((input_dim, memory_width))
        self.W_shift = self.init((input_dim, shift_width))
        self.W_beta = self.init(input_dim)
        self.W_gama = self.init(input_dim)
        self.W_g = self.init(input_dim)

        self.b_key = shared_zeros(memory_width)
        self.b_shift = shared_zeros(shift_width)
        self.b_beta = theano.shared(floatX(0))
        self.b_gama = theano.shared(floatX(0))
        self.b_g = theano.shared(floatX(0))

        self.shift_conv = shift_conv

        # add params and set names.
        self.params = [self.W_key, self.W_shift, self.W_beta, self.W_gama, self.W_g,
                       self.b_key, self.b_shift, self.b_beta, self.b_gama, self.b_g]

        self.W_key.name, self.W_shift.name, self.W_beta.name, \
        self.W_gama.name, self.W_g.name = 'W_key', 'W_shift', 'W_beta', \
                                          'W_gama', 'W_g'

        self.b_key.name, self.b_shift.name, self.b_beta.name, \
        self.b_gama.name, self.b_g.name = 'b_key', 'b_shift', 'b_beta', \
                                          'b_gama', 'b_g'

    def __call__(self, X, w_temp, m_temp):
        # input dimensions
        # X:      (nb_samples, input_dim)
        # w_temp: (nb_samples, memory_dim)
        # m_temp: (nb_samples, memory_dim, memory_width) ::tensor_memory

        key = dot(X, self.W_key, self.b_key)  # (nb_samples, memory_width)
        shift = self.softmax(
            dot(X, self.W_shift, self.b_shift))  # (nb_samples, shift_width)

        beta = self.softplus(dot(X, self.W_beta, self.b_beta))[:, None]  # (nb_samples, x)
        gamma = self.softplus(dot(X, self.W_gama, self.b_gama)) + 1.  # (nb_samples,)
        gamma = gamma[:, None]  # (nb_samples, x)
        g = self.sigmoid(dot(X, self.W_g, self.b_g))[:, None]  # (nb_samples, x)

        signal = [key, shift, beta, gamma, g]

        w_c = self.softmax(
            beta * cosine_sim2d(key, m_temp))  # (nb_samples, memory_dim) //content-based addressing
        w_g = g * w_c + (1 - g) * w_temp  # (nb_samples, memory_dim) //history interpolation
        w_s = shift_convolve2d(w_g, shift, self.shift_conv)  # (nb_samples, memory_dim) //convolutional shift
        w_p = w_s ** gamma  # (nb_samples, memory_dim) //sharpening
        w_t = w_p / T.sum(w_p, axis=1)[:, None]  # (nb_samples, memory_dim)
        return w_t


class Writer(Reader):
    """
        "Writer head" of the Neural Turing Machine
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(Writer, self).__init__(input_dim, memory_width, shift_width, shift_conv,
                                     init, inner_init, name)

        """
        Writer Params.
        """
        self.W_erase = self.init((input_dim, memory_width))
        self.W_add = self.init((input_dim, memory_width))

        self.b_erase = shared_zeros(memory_width)
        self.b_add = shared_zeros(memory_width)

        # add params and set names.
        self.params += [self.W_erase, self.W_add, self.b_erase, self.b_add]

        self.W_erase.name, self.W_add.name = 'W_erase', 'W_add'
        self.b_erase.name, self.b_add.name = 'b_erase', 'b_add'

    def get_fixer(self, X):
        erase = self.sigmoid(dot(X, self.W_erase, self.b_erase))  # (nb_samples, memory_width)
        add   = self.sigmoid(dot(X, self.W_add, self.b_add))  # (nb_samples, memory_width)
        return erase, add


class Controller(Recurrent):
    """
    Controller used in Neural Turing Machine.
        - Core cell (Memory)
        - Reader head
        - Writer head
    It is a simple RNN version. In reality the Neural Turing Machine will use the LSTM cell.
    """

    def __init__(self,
                 input_dim,
                 memory_dim,
                 memory_width,
                 hidden_dim,
                 shift_width=3,
                 init='glorot_uniform',
                 inner_init='orthogonal',
                 name=None,
                 readonly=False,
                 curr_input=False,
                 recurrence=False,
                 memorybook=None
                 ):
        super(Controller, self).__init__()
        # Initialization of the dimensions.
        self.input_dim     = input_dim
        self.memory_dim    = memory_dim
        self.memory_width  = memory_width
        self.hidden_dim    = hidden_dim
        self.shift_width   = shift_width

        self.init          = initializations.get(init)
        self.inner_init    = initializations.get(inner_init)
        self.tanh          = activations.get('tanh')
        self.softmax       = activations.get('softmax')
        self.vec_softmax   = activations.get('vector_softmax')

        self.readonly      = readonly
        self.curr_input    = curr_input
        self.recurrence    = recurrence
        self.memorybook    = memorybook

        """
        Controller Module.
        """
        # hidden projection:
        self.W_in          = self.init((input_dim, hidden_dim))
        self.b_in          = shared_zeros(hidden_dim)
        self.W_rd          = self.init((memory_width, hidden_dim))
        self.W_in.name     = 'W_in'
        self.b_in.name     = 'b_in'
        self.W_rd.name     = 'W_rd'
        self.params        = [self.W_in, self.b_in, self.W_rd]

        # use recurrence:
        if self.recurrence:
            self.W_hh      = self.inner_init((hidden_dim, hidden_dim))
            self.W_hh.name = 'W_hh'
            self.params   += [self.W_hh]

        # Shift convolution
        shift_conv         = sl.circulant(np.arange(memory_dim)).T[
                                np.arange(-(shift_width // 2), (shift_width // 2) + 1)][::-1]

        # use the current input for weights.
        if self.curr_input:
            controller_size = self.input_dim + self.hidden_dim
        else:
            controller_size = self.hidden_dim

        # write head
        if not readonly:
            self.writer    = Writer(controller_size, memory_width, shift_width, shift_conv, name='writer')
            self.writer.set_name('writer')
            self._add(self.writer)

        # read head
        self.reader        = Reader(controller_size, memory_width, shift_width, shift_conv, name='reader')
        self.reader.set_name('reader')
        self._add(self.reader)

        # ***********************************************************
        # reserved for None initialization (we don't use these often)
        self.memory_init   = self.init((memory_dim, memory_width))
        self.w_write_init  = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.w_read_init   = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.contr_init    = self.tanh(np.random.rand(1, hidden_dim).astype(theano.config.floatX))

        if name is not None:
            self.set_name(name)

    def _controller(self, input_t, read_t, controller_tm1=None):
        # input_t : (nb_sample, input_dim)
        # read_t  : (nb_sample, memory_width)
        # controller_tm1: (nb_sample, hidden_dim)
        if self.recurrence:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(controller_tm1, self.W_hh) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)
        else:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)

    @staticmethod
    def _read(w_read, memory):
        # w_read : (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # return dot(w_read, memory)

        return T.sum(w_read[:, :, None] * memory, axis=1)

    @staticmethod
    def _write(w_write, memory, erase, add):
        # w_write: (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # erase/add: (nb_sample, memory_width)

        w_write  = w_write[:, :, None]
        erase    = erase[:, None, :]
        add      = add[:, None, :]

        m_erased = memory * (1 - w_write * erase)
        memory_t = m_erased + w_write * add  # (nb_sample, memory_dim, memory_width)
        return memory_t

    def _step(self, input_t, mask_t,
              memory_tm1,
              w_write_tm1, w_read_tm1,
              controller_tm1):
        # input_t:     (nb_sample, input_dim)
        # memory_tm1:  (nb_sample, memory_dim, memory_width)
        # w_write_tm1: (nb_sample, memory_dim)
        # w_read_tm1:  (nb_sample, memory_dim)
        # controller_tm1: (nb_sample, hidden_dim)

        # read the memory
        if self.curr_input:
            info     = T.concatenate((controller_tm1, input_t), axis=1)
            w_read_t = self.reader(info, w_read_tm1, memory_tm1)
            read_tm1 = self._read(w_read_t, memory_tm1)
        else:
            read_tm1 = self._read(w_read_tm1, memory_tm1)       # (nb_sample, memory_width)

        # get the new controller (hidden states.)
        if self.recurrence:
            controller_t = self._controller(input_t, read_tm1, controller_tm1)
        else:
            controller_t = self._controller(input_t, read_tm1)  # (nb_sample, controller_size)

        # update the memory cell (if need)
        if not self.readonly:
            if self.curr_input:
                infow          = T.concatenate((controller_t, input_t), axis=1)
                w_write_t      = self.writer(infow, w_write_tm1, memory_tm1)     # (nb_sample, memory_dim)
                erase_t, add_t = self.writer.get_fixer(infow)                    # (nb_sample, memory_width)
            else:
                w_write_t      = self.writer(controller_t, w_write_tm1, memory_tm1)
                erase_t, add_t = self.writer.get_fixer(controller_t)
            memory_t           = self._write(w_write_t, memory_tm1, erase_t, add_t)  # (nb_sample, memory_dim, memory_width)
        else:
            w_write_t          = w_write_tm1
            memory_t           = memory_tm1

        # get the next reading weights.
        if not self.curr_input:
            w_read_t           = self.reader(controller_t, w_read_tm1, memory_t)  # (nb_sample, memory_dim)

        # over masking
        memory_t     = memory_t     * mask_t[:, :, None] + memory_tm1 * (1 - mask_t[:, :, None])
        w_read_t     = w_read_t     * mask_t + w_read_tm1     * (1 - mask_t)
        w_write_t    = w_write_t    * mask_t + w_write_tm1    * (1 - mask_t)
        controller_t = controller_t * mask_t + controller_tm1 * (1 - mask_t)

        return memory_t, w_write_t, w_read_t, controller_t

    def __call__(self, X, mask=None, M=None, init_ww=None,
                 init_wr=None, init_c=None, return_sequence=False,
                 one_step=False, return_full=False):
        # recurrent cell only work for tensor.
        if X.ndim == 2:
            X = X[:, None, :]
        nb_samples = X.shape[0]

        # mask
        if mask is None:
            mask = T.alloc(1., X.shape[0], 1)

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        # ***********************************************************************
        # initialization states
        if M is None:
            memory_init  = T.repeat(self.memory_init[None, :, :], nb_samples, axis=0)
        else:
            memory_init  = M

        if init_wr is None:
            w_read_init  = T.repeat(self.w_read_init, nb_samples, axis=0)
        else:
            w_read_init  = init_wr

        if init_ww is None:
            w_write_init = T.repeat(self.w_write_init, nb_samples, axis=0)
        else:
            w_write_init = init_ww

        if init_c is None:
            contr_init   = T.repeat(self.contr_init, nb_samples, axis=0)
        else:
            contr_init   = init_c
        # ************************************************************************

        outputs_info = [memory_init, w_write_init, w_read_init, contr_init]

        if one_step:
            seq = [X[0], padded_mask[0]]
            outputs = self._step(*(seq + outputs_info))
            return outputs
        else:
            seq = [X, padded_mask]
            outputs, _ = theano.scan(
                self._step,
                sequences=seq,
                outputs_info=outputs_info,
                name='controller_recurrence'
            )

        self.monitor['memory_info']   = outputs[0]
        self.monitor['write_weights'] = outputs[1]
        self.monitor['read_weights']  = outputs[2]

        if not return_full:
            if return_sequence:
                return outputs[-1].dimshuffle((1, 0, 2))
            return outputs[-1][-1]
        else:
            if return_sequence:
                return [a.dimshuffle((1, 0, 2)) for a in outputs]
            return [a[-1] for a in outputs]


class AttentionReader(Layer):
    """
        "Reader Head" of the Neural Turing Machine.
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(AttentionReader, self).__init__()
        self.input_dim = input_dim
        self.memory_dim = memory_width

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)

        self.tanh = activations.get('tanh')
        self.sigmoid = activations.get('sigmoid')
        self.softplus = activations.get('softplus')
        self.vec_softmax = activations.get('vector_softmax')
        self.softmax = activations.get('softmax')

        """
        Reader Params.
        """
        self.W_key   = self.init((input_dim, memory_width))
        self.W_lock  = self.inner_init((memory_width, memory_width))

        self.W_shift = self.init((input_dim, shift_width))
        self.W_beta = self.init(input_dim)
        self.W_gama = self.init(input_dim)
        self.W_g = self.init(input_dim)

        # self.v     = self.init(memory_width)
        self.b_key = shared_zeros(memory_width)
        self.b_shift = shared_zeros(shift_width)
        self.b_beta = theano.shared(floatX(0))
        self.b_gama = theano.shared(floatX(0))
        self.b_g = theano.shared(floatX(0))

        self.shift_conv = shift_conv

        # add params and set names.
        self.params = [self.W_key, self.W_shift, self.W_beta, self.W_gama, self.W_g,
                       self.b_key, self.b_shift, self.b_beta, self.b_gama, self.b_g,
                       self.W_lock]

        self.W_key.name, self.W_shift.name, self.W_beta.name, \
        self.W_gama.name, self.W_g.name = 'W_key', 'W_shift', 'W_beta', \
                                          'W_gama', 'W_g'
        self.W_lock.name  = 'W_lock'

        self.b_key.name, self.b_shift.name, self.b_beta.name, \
        self.b_gama.name, self.b_g.name = 'b_key', 'b_shift', 'b_beta', \
                                          'b_gama', 'b_g'

    def __call__(self, X, w_temp, m_temp):
        # input dimensions
        # X:      (nb_samples, input_dim)
        # w_temp: (nb_samples, memory_dim)
        # m_temp: (nb_samples, memory_dim, memory_width) ::tensor_memory

        key   = dot(X, self.W_key, self.b_key)  # (nb_samples, memory_width)
        lock  = dot(m_temp, self.W_lock)        # (nb_samples, memory_dim, memory_width)
        shift = self.softmax(
            dot(X, self.W_shift, self.b_shift))  # (nb_samples, shift_width)

        beta = self.softplus(dot(X, self.W_beta, self.b_beta))[:, None]  # (nb_samples, x)
        gamma = self.softplus(dot(X, self.W_gama, self.b_gama)) + 1.  # (nb_samples,)
        gamma = gamma[:, None]  # (nb_samples, x)
        g = self.sigmoid(dot(X, self.W_g, self.b_g))[:, None]  # (nb_samples, x)

        signal = [key, shift, beta, gamma, g]

        energy = T.sum(key[:, None, :] * lock, axis=2)
        # energy = T.tensordot(key[:, None, :] + lock, self.v, [2, 0])
        w_c    = self.softmax(beta * energy)
        # w_c = self.softmax(
        #     beta * cosine_sim2d(key, m_temp))  # (nb_samples, memory_dim) //content-based addressing
        w_g = g * w_c + (1 - g) * w_temp  # (nb_samples, memory_dim) //history interpolation
        w_s = shift_convolve2d(w_g, shift, self.shift_conv)  # (nb_samples, memory_dim) //convolutional shift
        w_p = w_s ** gamma  # (nb_samples, memory_dim) //sharpening
        w_t = w_p / T.sum(w_p, axis=1)[:, None]  # (nb_samples, memory_dim)
        return w_t


class AttentionWriter(AttentionReader):
    """
        "Writer head" of the Neural Turing Machine
    """

    def __init__(self, input_dim, memory_width, shift_width, shift_conv,
                 init='glorot_uniform', inner_init='orthogonal',
                 name=None):
        super(AttentionWriter, self).__init__(input_dim, memory_width, shift_width, shift_conv,
                                     init, inner_init, name)

        """
        Writer Params.
        """
        self.W_erase = self.init((input_dim, memory_width))
        self.W_add = self.init((input_dim, memory_width))

        self.b_erase = shared_zeros(memory_width)
        self.b_add = shared_zeros(memory_width)

        # add params and set names.
        self.params += [self.W_erase, self.W_add, self.b_erase, self.b_add]

        self.W_erase.name, self.W_add.name = 'W_erase', 'W_add'
        self.b_erase.name, self.b_add.name = 'b_erase', 'b_add'

    def get_fixer(self, X):
        erase = self.sigmoid(dot(X, self.W_erase, self.b_erase))  # (nb_samples, memory_width)
        add   = self.sigmoid(dot(X, self.W_add, self.b_add))  # (nb_samples, memory_width)
        return erase, add



class BernoulliController(Recurrent):
    """
    Controller used in Neural Turing Machine.
        - Core cell (Memory): binary memory
        - Reader head
        - Writer head
    It is a simple RNN version. In reality the Neural Turing Machine will use the LSTM cell.
    """

    def __init__(self,
                 input_dim,
                 memory_dim,
                 memory_width,
                 hidden_dim,
                 shift_width=3,
                 init='glorot_uniform',
                 inner_init='orthogonal',
                 name=None,
                 readonly=False,
                 curr_input=False,
                 recurrence=False,
                 memorybook=None
                 ):
        super(BernoulliController, self).__init__()
        # Initialization of the dimensions.
        self.input_dim     = input_dim
        self.memory_dim    = memory_dim
        self.memory_width  = memory_width
        self.hidden_dim    = hidden_dim
        self.shift_width   = shift_width

        self.init          = initializations.get(init)
        self.inner_init    = initializations.get(inner_init)
        self.tanh          = activations.get('tanh')
        self.softmax       = activations.get('softmax')
        self.vec_softmax   = activations.get('vector_softmax')
        self.sigmoid       = activations.get('sigmoid')

        self.readonly      = readonly
        self.curr_input    = curr_input
        self.recurrence    = recurrence
        self.memorybook    = memorybook

        """
        Controller Module.
        """
        # hidden projection:
        self.W_in          = self.init((input_dim, hidden_dim))
        self.b_in          = shared_zeros(hidden_dim)
        self.W_rd          = self.init((memory_width, hidden_dim))
        self.W_in.name     = 'W_in'
        self.b_in.name     = 'b_in'
        self.W_rd.name     = 'W_rd'
        self.params        = [self.W_in, self.b_in, self.W_rd]

        # use recurrence:
        if self.recurrence:
            self.W_hh      = self.inner_init((hidden_dim, hidden_dim))
            self.W_hh.name = 'W_hh'
            self.params   += [self.W_hh]

        # Shift convolution
        shift_conv         = sl.circulant(np.arange(memory_dim)).T[
                                np.arange(-(shift_width // 2), (shift_width // 2) + 1)][::-1]

        # use the current input for weights.
        if self.curr_input:
            controller_size = self.input_dim + self.hidden_dim
        else:
            controller_size = self.hidden_dim

        # write head
        if not readonly:
            self.writer    = AttentionWriter(controller_size, memory_width, shift_width, shift_conv, name='writer')
            self.writer.set_name('writer')
            self._add(self.writer)

        # read head
        self.reader        = AttentionReader(controller_size, memory_width, shift_width, shift_conv, name='reader')
        self.reader.set_name('reader')
        self._add(self.reader)

        # ***********************************************************
        # reserved for None initialization (we don't use these often)
        self.memory_init   = self.sigmoid(self.init((memory_dim, memory_width)))
        self.w_write_init  = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.w_read_init   = self.softmax(np.random.rand(1, memory_dim).astype(theano.config.floatX))
        self.contr_init    = self.tanh(np.random.rand(1, hidden_dim).astype(theano.config.floatX))

        if name is not None:
            self.set_name(name)

    def _controller(self, input_t, read_t, controller_tm1=None):
        # input_t : (nb_sample, input_dim)
        # read_t  : (nb_sample, memory_width)
        # controller_tm1: (nb_sample, hidden_dim)
        if self.recurrence:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(controller_tm1, self.W_hh) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)
        else:
            return self.tanh(dot(input_t, self.W_in) +
                             dot(read_t, self.W_rd)  +
                             self.b_in)

    @staticmethod
    def _read(w_read, memory):
        # w_read : (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # return dot(w_read, memory)

        return T.sum(w_read[:, :, None] * memory, axis=1)

    @staticmethod
    def _write(w_write, memory, erase, add):
        # w_write: (nb_sample, memory_dim)
        # memory : (nb_sample, memory_dim, memory_width)
        # erase/add: (nb_sample, memory_width)

        w_write  = w_write[:, :, None]
        erase    = erase[:, None, :]     # erase is a gate.
        add      = add[:, None, :]       # add is a bias

        # m_erased = memory * (1 - w_write * erase)
        # memory_t = m_erased + w_write * add  # (nb_sample, memory_dim, memory_width)
        memory_t = memory * (1 - w_write * erase) + \
                   add * w_write * (1 - erase)

        return memory_t

    def _step(self, input_t, mask_t,
              memory_tm1,
              w_write_tm1, w_read_tm1,
              controller_tm1):
        # input_t:     (nb_sample, input_dim)
        # memory_tm1:  (nb_sample, memory_dim, memory_width)
        # w_write_tm1: (nb_sample, memory_dim)
        # w_read_tm1:  (nb_sample, memory_dim)
        # controller_tm1: (nb_sample, hidden_dim)

        # read the memory
        if self.curr_input:
            info     = T.concatenate((controller_tm1, input_t), axis=1)
            w_read_t = self.reader(info, w_read_tm1, memory_tm1)
            read_tm1 = self._read(w_read_t, memory_tm1)
        else:
            read_tm1 = self._read(w_read_tm1, memory_tm1)       # (nb_sample, memory_width)

        # get the new controller (hidden states.)
        if self.recurrence:
            controller_t = self._controller(input_t, read_tm1, controller_tm1)
        else:
            controller_t = self._controller(input_t, read_tm1)  # (nb_sample, controller_size)

        # update the memory cell (if need)
        if not self.readonly:
            if self.curr_input:
                infow          = T.concatenate((controller_t, input_t), axis=1)
                w_write_t      = self.writer(infow, w_write_tm1, memory_tm1)     # (nb_sample, memory_dim)
                erase_t, add_t = self.writer.get_fixer(infow)                    # (nb_sample, memory_width)
            else:
                w_write_t      = self.writer(controller_t, w_write_tm1, memory_tm1)
                erase_t, add_t = self.writer.get_fixer(controller_t)
            memory_t           = self._write(w_write_t, memory_tm1, erase_t, add_t)  # (nb_sample, memory_dim, memory_width)
        else:
            w_write_t          = w_write_tm1
            memory_t           = memory_tm1

        # get the next reading weights.
        if not self.curr_input:
            w_read_t           = self.reader(controller_t, w_read_tm1, memory_t)  # (nb_sample, memory_dim)

        # over masking
        memory_t     = memory_t     * mask_t[:, :, None] + memory_tm1 * (1 - mask_t[:, :, None])
        w_read_t     = w_read_t     * mask_t + w_read_tm1     * (1 - mask_t)
        w_write_t    = w_write_t    * mask_t + w_write_tm1    * (1 - mask_t)
        controller_t = controller_t * mask_t + controller_tm1 * (1 - mask_t)

        return memory_t, w_write_t, w_read_t, controller_t

    def __call__(self, X, mask=None, M=None, init_ww=None,
                 init_wr=None, init_c=None, return_sequence=False,
                 one_step=False, return_full=False):
        # recurrent cell only work for tensor.
        if X.ndim == 2:
            X = X[:, None, :]
        nb_samples = X.shape[0]

        # mask
        if mask is None:
            mask = T.alloc(1., X.shape[0], 1)

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        # ***********************************************************************
        # initialization states
        if M is None:
            memory_init  = T.repeat(self.memory_init[None, :, :], nb_samples, axis=0)
        else:
            memory_init  = M

        if init_wr is None:
            w_read_init  = T.repeat(self.w_read_init, nb_samples, axis=0)
        else:
            w_read_init  = init_wr

        if init_ww is None:
            w_write_init = T.repeat(self.w_write_init, nb_samples, axis=0)
        else:
            w_write_init = init_ww

        if init_c is None:
            contr_init   = T.repeat(self.contr_init, nb_samples, axis=0)
        else:
            contr_init   = init_c
        # ************************************************************************

        outputs_info = [memory_init, w_write_init, w_read_init, contr_init]

        if one_step:
            seq = [X[0], padded_mask[0]]
            outputs = self._step(*(seq + outputs_info))
            return outputs
        else:
            seq = [X, padded_mask]
            outputs, _ = theano.scan(
                self._step,
                sequences=seq,
                outputs_info=outputs_info,
                name='controller_recurrence'
            )

        self.monitor['memory_info'] = outputs

        if not return_full:
            if return_sequence:
                return outputs[-1].dimshuffle((1, 0, 2))
            return outputs[-1][-1]
        else:
            if return_sequence:
                return [a.dimshuffle((1, 0, 2)) for a in outputs]
            return [a[-1] for a in outputs]

================================================
FILE: emolga/layers/recurrent.py
================================================
# -*- coding: utf-8 -*-
from abc import abstractmethod
from .core import *


class Recurrent(MaskedLayer):
    """
        Recurrent Neural Network
    """

    @staticmethod
    def get_padded_shuffled_mask(mask, pad=0):
        """
        change the order of dims of mask, to match the dim of inputs outside
            [1] change the 2D matrix into 3D, (nb_samples, max_sent_len, 1)
            [2] dimshuffle to (max_sent_len, nb_samples, 1)
            the value on dim=0 could be either 0 or 1?
        :param: mask, shows x is a word (!=0) or not(==0), shape=(n_samples, max_sent_len)
        """
        # mask is (n_samples, time)
        assert mask, 'mask cannot be None'
        # pad a dim of 1 to the right, (nb_samples, max_sent_len, 1)
        mask = T.shape_padright(mask)
        # mask = T.addbroadcast(mask, -1), make the new dim broadcastable
        mask = T.addbroadcast(mask, mask.ndim-1)

        # change the order of dims, to match the dim of inputs outside
        mask = mask.dimshuffle(1, 0, 2)  # (max_sent_len, nb_samples, 1)

        if pad > 0:
            # left-pad in time with 0
            padding = alloc_zeros_matrix(pad, mask.shape[1], 1)
            mask = T.concatenate([padding, mask], axis=0)
        return mask.astype('int8')


class GRU(Recurrent):
    """
        Gated Recurrent Unit - Cho et al. 2014

        Acts as a spatio-temporal projection,
        turning a sequence of vectors into a single vector.

        Eats inputs with shape:
        (nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)

        and returns outputs with shape:
        if not return_sequences:
            (nb_samples, output_dim)
        if return_sequences:
            (nb_samples, max_sample_length, output_dim)

        z_t         = tanh(W_z*x + U_z*h_t-1 + b_z)
        r_t         = tanh(W_r*x + U_r*h_t-1 + b_r)
        hh_t        = tanh(W_h*x + U_r*(r_t*h_t-1) + b_h)
        h_t         = z_t * h_t-1 + (1 - z_t) * hh_t

        The doc product computation regarding x is independent from time
            so it could be done out of the recurrent process (in advance)
                x_z         = dot(X, self.W_z, self.b_z)
                x_r         = dot(X, self.W_r, self.b_r)
                x_h         = dot(X, self.W_h, self.b_h)

        References:
            On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
                http://www.aclweb.org/anthology/W14-4012
            Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
                http://arxiv.org/pdf/1412.3555v1.pdf
    """

    def __init__(self,
                 input_dim,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(GRU, self).__init__()
        """
        Standard GRU.
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)
        # W is a matrix to map input x_t
        self.W_z = self.init((self.input_dim, self.output_dim))
        self.W_r = self.init((self.input_dim, self.output_dim))
        self.W_h = self.init((self.input_dim, self.output_dim))
        # U is a matrix to map hidden state of last time h_t-1
        self.U_z = self.inner_init((self.output_dim, self.output_dim))
        self.U_r = self.inner_init((self.output_dim, self.output_dim))
        self.U_h = self.inner_init((self.output_dim, self.output_dim))
        # bias terms
        self.b_z = shared_zeros(self.output_dim)
        self.b_r = shared_zeros(self.output_dim)
        self.b_h = shared_zeros(self.output_dim)

        # set names
        self.W_z.name, self.U_z.name, self.b_z.name = 'Wz', 'Uz', 'bz'
        self.W_r.name, self.U_r.name, self.b_r.name = 'Wr', 'Ur', 'br'
        self.W_h.name, self.U_h.name, self.b_h.name = 'Wh', 'Uh', 'bh'

        self.params = [
            self.W_z, self.U_z, self.b_z,
            self.W_r, self.U_r, self.b_r,
            self.W_h, self.U_h, self.b_h,
        ]

        """
        GRU with context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_z = self.init((self.context_dim, self.output_dim))
            self.C_r = self.init((self.context_dim, self.output_dim))
            self.C_h = self.init((self.context_dim, self.output_dim))
            self.C_z.name, self.C_r.name, self.C_h.name = 'Cz', 'Cr', 'Ch'

            self.params += [self.C_z, self.C_r, self.C_h]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xz_t, xr_t, xh_t, mask_t,
              h_tm1,
              u_z, u_r, u_h):
        """
        One step computation of GRU for a batch of data at time t
                sequences=[x_z, x_r, x_h, padded_mask],
                outputs_info=init_h,
                non_sequences=[self.U_z, self.U_r, self.U_h]
        :param xz_t, xr_t, xh_t:
                        value of x of time t after gate z/r/h (computed beforehand)
                            shape=(n_samples, output_emb_dim)
        :param mask_t:  mask of time t, indicates whether t-th token is a word, shape=(n_samples, 1)
        :param h_tm1:   hidden value (output) of last time, shape=(nb_samples, output_emb_dim)
        :param u_z, u_r, u_h:
                        mapping matrix for hidden state of time t-1
                        shape=(output_emb_dim, output_emb_dim)
        :return: h_t:   output, hidden state of time t, shape=(nb_samples, output_emb_dim)
        """
        # h_mask_tm1 = mask_tm1 * h_tm1
        # Here we use a GroundHog-like style which allows

        # activation value of update/reset gate, shape=(n_samples, 1)
        z          = self.inner_activation(xz_t + T.dot(h_tm1, u_z))
        r          = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t       = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t        = z * h_tm1 + (1 - z) * hh_t

        # why use mask_t to mix up h_t and h_tm1 again?
        #   if current term is None (padding term, mask=0), then drop the update (0*h_t and keep use the last state(1*h_tm1)
        h_t        = mask_t * h_t + (1 - mask_t) * h_tm1
        return h_t

    def _step_gate(self,
                   xz_t, xr_t, xh_t, mask_t,
                   h_tm1,
                   u_z, u_r, u_h):
        """
        One step computation of GRU
        :returns
            h_t:   output, hidden state of time t, shape=(n_samples, output_emb_dim)
            z:     value of update gate (after activation), shape=(n_samples, 1)
            r:     value of reset gate (after activation), shape=(n_samples, 1)
        """
        # h_mask_tm1 = mask_tm1 * h_tm1
        # Here we use a GroundHog-like style which allows
        z          = self.inner_activation(xz_t + T.dot(h_tm1, u_z))
        r          = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t       = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t        = z * h_tm1 + (1 - z) * hh_t
        h_t        = mask_t * h_t + (1 - mask_t) * h_tm1
        return h_t, z, r

    def __call__(self, X, mask=None, C=None, init_h=None,
                 return_sequence=False, one_step=False,
                 return_gates=False):
        """
        :param X:       input sequence, a list of word vectors, shape=(n_samples, max_sent_len, input_emb_dim)
        :param mask:    input mask, shows x is a word (!=0) or not(==0), shape=(n_samples, max_sent_len)
        :param C:       context, for encoder is none
        :param init_h:  initial hidden state
        :param return_sequence: if True, return the encoding at each time, or only return the end state
        :param one_step: only go one step computation, or will be done by theano.scan()
        :param return_gates: whether return the gate state
        :return:
        """
        # recurrent cell only work for tensor
        if X.ndim == 2: # X.ndim == 3, shape=(n_samples, max_sent_len, input_emb_dim)
            X = X[:, None, :]
            if mask is not None:
                mask = mask[:, None]

        # mask, shape=(n_samples, max_sent_len)
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], 1)

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        # reshape the mask to shape=(max_sent_len, n_samples, 1)
        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X           = X.dimshuffle((1, 0, 2))     # X:   (max_sent_len, nb_samples, input_emb_dim)
        # compute the gate values at each time in advance
        #       shape of W = (input_emb_dim, output_emb_dim)
        x_z         = dot(X, self.W_z, self.b_z)  # x_z: (max_sent_len, nb_samples, output_emb_dim)
        x_r         = dot(X, self.W_r, self.b_r)  # x_r: (max_sent_len, nb_samples, output_emb_dim)
        x_h         = dot(X, self.W_h, self.b_h)  # x_h: (max_sent_len, nb_samples, output_emb_dim)

        """
        GRU with constant context. (no attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_z     += dot(ctx_step, self.C_z)
            x_r     += dot(ctx_step, self.C_r)
            x_h     += dot(ctx_step, self.C_h)

        """
        GRU with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if not return_gates:
            if one_step:
                seq          = [x_z, x_r, x_h, padded_mask]    # A hidden BUG (1)+++(1) !?!!!?!!?!?
                outputs_info = [init_h]
                non_seq      = [self.U_z, self.U_r, self.U_h]
                outputs = self._step(*(seq + outputs_info + non_seq))

            else:
                outputs, _ = theano.scan(
                    self._step,
                    sequences=[x_z, x_r, x_h, padded_mask],
                    outputs_info=init_h,
                    non_sequences=[self.U_z, self.U_r, self.U_h]
                )

            # return hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
            if return_sequence:
                return outputs.dimshuffle((1, 0, 2))
            # hidden state of last time, shape=(nb_samples, output_emb_dim)
            return outputs[-1]
        else:
            if one_step:
                seq             = [x_z, x_r, x_h, padded_mask]    # A hidden BUG (1)+++(1) !?!!!?!!?!?
                outputs_info    = [init_h]
                non_seq         = [self.U_z, self.U_r, self.U_h]
                outputs, zz, rr = self._step_gate(*(seq + outputs_info + non_seq))

            else:
                outputx, _ = theano.scan(
                    self._step_gate,
                    sequences=[x_z, x_r, x_h, padded_mask],
                    outputs_info=[init_h, None, None],
                    non_sequences=[self.U_z, self.U_r, self.U_h]
                )
                outputs, zz, rr = outputx

            if return_sequence:
                return outputs.dimshuffle((1, 0, 2)), zz.dimshuffle((1, 0, 2)), rr.dimshuffle((1, 0, 2))
            return outputs[-1], zz[-1], rr[-1]


class JZS3(Recurrent):
    """
        Evolved recurrent neural network architectures from the evaluation of thousands
        of models, serving as alternatives to LSTMs and GRUs. See Jozefowicz et al. 2015.

        This corresponds to the `MUT3` architecture described in the paper.

        Takes inputs with shape:
        (nb_samples, max_sample_length (samples shorter than this are padded with zeros at the end), input_dim)

        and returns outputs with shape:
        if not return_sequences:
            (nb_samples, output_dim)
        if return_sequences:
            (nb_samples, max_sample_length, output_dim)

        References:
            An Empirical Exploration of Recurrent Network Architectures
                http://www.jmlr.org/proceedings/papers/v37/jozefowicz15.pdf
    """
    def __init__(self,
                 input_dim,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(JZS3, self).__init__()
        """
        Standard model
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        self.W_z = self.init((self.input_dim, self.output_dim))
        self.U_z = self.inner_init((self.output_dim, self.output_dim))
        self.b_z = shared_zeros(self.output_dim)

        self.W_r = self.init((self.input_dim, self.output_dim))
        self.U_r = self.inner_init((self.output_dim, self.output_dim))
        self.b_r = shared_zeros(self.output_dim)

        self.W_h = self.init((self.input_dim, self.output_dim))
        self.U_h = self.inner_init((self.output_dim, self.output_dim))
        self.b_h = shared_zeros(self.output_dim)

        # set names
        self.W_z.name, self.U_z.name, self.b_z.name = 'Wz', 'Uz', 'bz'
        self.W_r.name, self.U_r.name, self.b_r.name = 'Wr', 'Ur', 'br'
        self.W_h.name, self.U_h.name, self.b_h.name = 'Wh', 'Uh', 'bh'

        self.params = [
            self.W_z, self.U_z, self.b_z,
            self.W_r, self.U_r, self.b_r,
            self.W_h, self.U_h, self.b_h,
        ]

        """
        context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_z = self.init((self.context_dim, self.output_dim))
            self.C_r = self.init((self.context_dim, self.output_dim))
            self.C_h = self.init((self.context_dim, self.output_dim))
            self.C_z.name, self.C_r.name, self.C_h.name = 'Cz', 'Cr', 'Ch'

            self.params += [self.C_z, self.C_r, self.C_h]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xz_t, xr_t, xh_t, mask_t,
              h_tm1,
              u_z, u_r, u_h):
        # h_mask_tm1 = mask_tm1 * h_tm1
        z     = self.inner_activation(xz_t + T.dot(T.tanh(h_tm1), u_z))
        r     = self.inner_activation(xr_t + T.dot(h_tm1, u_r))
        hh_t  = self.activation(xh_t + T.dot(r * h_tm1, u_h))
        h_t   = (hh_t * z + h_tm1 * (1 - z)) * mask_t + (1 - mask_t) * h_tm1
        return h_t

    def __call__(self, X, mask=None, C=None, init_h=None, return_sequence=False, one_step=False):
        # recurrent cell only work for tensor
        if X.ndim == 2:
            X = X[:, None, :]

        # mask
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], X.shape[1])

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))

        x_z = dot(X, self.W_z, self.b_z)
        x_r = dot(X, self.W_r, self.b_r)
        x_h = dot(X, self.W_h, self.b_h)

        """
        JZS3 with constant context. (not attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_z     += dot(ctx_step, self.C_z)
            x_r     += dot(ctx_step, self.C_r)
            x_h     += dot(ctx_step, self.C_h)

        """
        JZS3 with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if one_step:
            seq          = [x_z, x_r, x_h, padded_mask]
            outputs_info = [init_h]
            non_seq      = [self.U_z, self.U_r, self.U_h]
            outputs = self._step(*(seq + outputs_info + non_seq))

        else:
            outputs, updates = theano.scan(
                self._step,
                sequences=[x_z, x_r, x_h, padded_mask],
                outputs_info=init_h,
                non_sequences=[self.U_z, self.U_r, self.U_h],
            )

        if return_sequence:
            return outputs.dimshuffle((1, 0, 2))
        return outputs[-1]


class LSTM(Recurrent):
    def __init__(self,
                 input_dim=0,
                 output_dim=128,
                 context_dim=None,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one',
                 activation='tanh', inner_activation='sigmoid',
                 name=None, weights=None):

        super(LSTM, self).__init__()
        """
        Standard model
        """
        self.input_dim = input_dim
        self.output_dim = output_dim

        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation = activations.get(activation)
        self.inner_activation = activations.get(inner_activation)

        # input gate param.
        self.W_i = self.init((self.input_dim, self.output_dim))
        self.U_i = self.inner_init((self.output_dim, self.output_dim))
        self.b_i = shared_zeros(self.output_dim)

        # forget gate param.
        self.W_f = self.init((self.input_dim, self.output_dim))
        self.U_f = self.inner_init((self.output_dim, self.output_dim))
        self.b_f = self.forget_bias_init(self.output_dim)  # forget gate needs one bias.

        # output gate param.
        self.W_o = self.init((self.input_dim, self.output_dim))
        self.U_o = self.inner_init((self.output_dim, self.output_dim))
        self.b_o = shared_zeros(self.output_dim)

        # memory param.
        self.W_c = self.init((self.input_dim, self.output_dim))
        self.U_c = self.inner_init((self.output_dim, self.output_dim))
        self.b_c = shared_zeros(self.output_dim)

        # set names
        self.W_i.name, self.U_i.name, self.b_i.name = 'Wi', 'Ui', 'bi'
        self.W_f.name, self.U_f.name, self.b_f.name = 'Wf', 'Uf', 'bf'
        self.W_o.name, self.U_o.name, self.b_o.name = 'Wo', 'Uo', 'bo'
        self.W_c.name, self.U_c.name, self.b_c.name = 'Wc', 'Uc', 'bc'

        self.params = [
            self.W_i, self.U_i, self.b_i,
            self.W_f, self.U_f, self.b_f,
            self.W_o, self.U_o, self.b_o,
            self.W_c, self.U_c, self.b_c,
        ]

        """
        context inputs.
        """
        if context_dim is not None:
            self.context_dim = context_dim
            self.C_i = self.init((self.context_dim, self.output_dim))
            self.C_f = self.init((self.context_dim, self.output_dim))
            self.C_o = self.init((self.context_dim, self.output_dim))
            self.C_c = self.init((self.context_dim, self.output_dim))
            self.C_i.name, self.C_f.name, self.C_o.name, self.C_c.name = 'Ci', 'Cf', 'Co', 'Cc'

            self.params += [self.C_i, self.C_f, self.C_o, self.C_c]

        if weights is not None:
            self.set_weights(weights)

        if name is not None:
            self.set_name(name)

    def _step(self,
              xi_t, xf_t, xo_t, xc_t, mask_t,
              h_tm1, c_tm1,
              u_i, u_f, u_o, u_c):
        # h_mask_tm1 = mask_tm1 * h_tm1

        i     = self.inner_activation(xi_t + T.dot(h_tm1, u_i))  # input  gate
        f     = self.inner_activation(xf_t + T.dot(h_tm1, u_f))  # forget gate
        o     = self.inner_activation(xo_t + T.dot(h_tm1, u_o))  # output gate
        c     = self.activation(xc_t + T.dot(h_tm1, u_c))        # memory updates

        # update the memory cell.
        c_t   = f * c_tm1 + i * c
        h_t   = o * self.activation(c_t)

        # masking
        c_t   = c_t * mask_t + (1 - mask_t) * c_tm1
        h_t   = h_t * mask_t + (1 - mask_t) * h_tm1
        return h_t, c_t

    def input_embed(self, X, C=None):
        x_i = dot(X, self.W_i, self.b_i)
        x_f = dot(X, self.W_f, self.b_f)
        x_o = dot(X, self.W_o, self.b_o)
        x_c = dot(X, self.W_c, self.b_c)

        """
        LSTM with constant context. (not attention here.)
        """
        if C is not None:
            assert C.ndim == 2
            ctx_step = C.dimshuffle('x', 0, 1)    # C: (nb_samples, context_dim)
            x_i     += dot(ctx_step, self.C_i)
            x_f     += dot(ctx_step, self.C_f)
            x_o     += dot(ctx_step, self.C_o)
            x_c     += dot(ctx_step, self.C_c)

        return x_i, x_f, x_o, x_c

    def __call__(self, X, mask=None, C=None, init_h=None, init_c=None, return_sequence=False, one_step=False):
        # recurrent cell only work for tensor
        if X.ndim == 2:
            X = X[:, None, :]

        # mask
        if mask is None:  # sampling or beam-search
            mask = T.alloc(1., X.shape[0], X.shape[1])

        # one step
        if one_step:
            assert init_h, 'previous state must be provided!'

        padded_mask = self.get_padded_shuffled_mask(mask, pad=0)
        X = X.dimshuffle((1, 0, 2))
        x_i, x_f, x_o, x_c = self.input_embed(X, C)

        """
        LSTM with additional initial/previous state.
        """
        if init_h is None:
            init_h = T.unbroadcast(alloc_zeros_matrix(X.shape[1], self.output_dim), 1)

        if init_c is None:
            init_c = init_h

        if one_step:
            seq          = [x_i, x_f, x_o, x_c, padded_mask]
            outputs_info = [init_h, init_c]
            non_seq      = [self.U_i, self.U_f, self.U_o, self.U_c]
            outputs = self._step(*(seq + outputs_info + non_seq))

        else:
            outputs, updates = theano.scan(
                self._step,
                sequences=[x_i, x_f, x_o, x_c, padded_mask],
                outputs_info=[init_h, init_c],
                non_sequences=[self.U_i, self.U_f, self.U_o, self.U_c],
            )

        if return_sequence:
            return outputs[0].dimshuffle((1, 0, 2)), outputs[1].dimshuffle((1, 0, 2))  # H, C
        return outputs[0][-1], outputs[1][-1]




================================================
FILE: emolga/models/__init__.py
================================================
__author__ = 'jiataogu'


================================================
FILE: emolga/models/core.py
================================================
import json

import numpy

from keyphrase import config

__author__ = 'jiataogu'
import theano
import logging
import deepdish as dd

from emolga.dataset.build_dataset import serialize_to_file, deserialize_from_file, serialize_to_file_json
from emolga.utils.theano_utils import floatX

logger = logging.getLogger(__name__)


class Model(object):
    def __init__(self):
        self.layers  = []
        self.params  = []
        self.monitor = {}
        self.watchlist = []

    def _add(self, layer):
        if layer:
            self.layers.append(layer)
            self.params += layer.params

    def _monitoring(self):
        # add monitoring variables
        for l in self.layers:
            for v in l.monitor:
                name = v + '@' + l.name
                print(name)
                self.monitor[name] = l.monitor[v]

    def compile_monitoring(self, inputs, updates=None):
        logger.info('compile monitoring')
        for i, v in enumerate(self.monitor):
            self.watchlist.append(v)
            logger.info('monitoring [{0}]: {1}'.format(i, v))

        self.watch = theano.function(inputs,
                                     [self.monitor[v] for v in self.watchlist],
                                     updates=updates
                                     )
        logger.info('done.')

    def set_weights(self, weights):
        if hasattr(self, 'save_parm'):
            params = self.params + self.save_parm
        else:
            params = self.params

        for p, w in zip(params, weights):
            # print(p.name)
            if p.eval().shape != w.shape:
                raise Exception("Layer shape %s not compatible with weight shape %s." % (p.eval().shape, w.shape))
            p.set_value(floatX(w))

    def get_weights(self):
        weights = []
        for p in self.params:
            weights.append(p.get_value())

        if hasattr(self, 'save_parm'):
            for v in self.save_parm:
                weights.append(v.get_value())

        return weights

    def set_name(self, name):
        for i in range(len(self.params)):
            if self.params[i].name is None:
                self.params[i].name = '%s_p%d' % (name, i)
            else:
                self.params[i].name = name + '@' + self.params[i].name
        self.name = name

    def save(self, filename):
        if hasattr(self, 'save_parm'):
            params = self.params + self.save_parm
        else:
            params = self.params
        ps = 'save: <\n'
        for p in params:
            ps += '{0}: {1}\n'.format(p.name, p.eval().shape)
        ps += '> to ... {}'.format(filename)
        # logger.info(ps)

        # hdf5 module seems works abnormal !!
        # dd.io.save(filename, self.get_weights())
        serialize_to_file(self.get_weights(), filename)

    def load(self, filename):
        logger.info('load the weights.')

        # hdf5 module seems works abnormal !!
        # weights = dd.io.load(filename)
        weights = deserialize_from_file(filename)
        # print(len(weights))
        self.set_weights(weights)

    def save_weight_json(self, filename):
        '''
        Write weights into file for manually check (not useful, too many parameters in (-1,1))
        :param filename:
        :return:
        '''
        print('Save weights into file: %s' % filename)
        with open(filename, 'w') as f:
            for p in self.params:
                str = json.dumps(p.get_value().tolist())+'\n'
                f.write(str)

    def load_weight_json(self, filename):
        '''
        Write weights into file for manually check (not useful, too many parameters in (-1,1))
        :param filename:
        :return:
        '''
        count_1 = 0
        count_5 = 0
        count_10 = 0
        total = 0
        max = 0.
        with open(filename, 'r') as f:
            for line in f:
                list = numpy.array(json.loads(line))
                list = list.ravel()
                for e in list:
                    total += 1
                    if abs(e) > 1:
                        count_1 += 1
                    if abs(e) > 5:
                        count_5 += 1
                    if abs(e) > 10:
                        count_10 += 1
                        print(e)
                    if abs(e) > max:
                        max = abs(e)
                        print('new max = %f' % e)
        print('total = %d' % total)
        print('count < 1/5/10 = %d / %d / %d' % (count_1, count_5, count_10))
        print('max = %f' % max)

if __name__ == '__main__':
    config = config.setup_keyphrase_all()  # setup_keyphrase_inspec
    agent = Model()
    agent.load_weight_json(config['weight_json'])

================================================
FILE: emolga/models/covc_encdec.py
================================================
__author__ = 'jiataogu, memray'
import theano
import logging
import copy
import emolga.basic.objectives as objectives
import emolga.basic.optimizers as optimizers

from theano.compile.nanguardmode import NanGuardMode
from emolga.utils.generic_utils import visualize_
from emolga.layers.core import Dropout, Dense, Dense2, Identity
from emolga.layers.recurrent import *
from emolga.layers.ntm_minibatch import Controller
from emolga.layers.embeddings import *
from emolga.layers.attention import *
from emolga.models.core import Model
from nltk.stem.porter import *
import math

logger = logging.getLogger(__name__)
RNN    = GRU             # change it here for other RNN models.
err    = 1e-7


class Encoder(Model):
    """
    Recurrent Neural Network-based Encoder
    It is used to compute the context vector.
    """

    def __init__(self,
                 config, rng, prefix='enc',
                 mode='Evaluation', embed=None, use_context=False):
        super(Encoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.mode = mode
        self.name = prefix
        self.use_context = use_context

        self.return_embed = False
        self.return_sequence = False

        """
        Create all elements of the Encoder's Computational graph
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['enc_voc_size'],
                self.config['enc_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        if self.use_context:
            self.Initializer = Dense(
                config['enc_contxt_dim'],
                config['enc_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )
            self._add(self.Initializer)

        """
        Encoder Core
        """
        # create RNN cells
        if not self.config['bidirectional']:
            logger.info("{}_create RNN cells.".format(self.prefix))
            self.RNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_cell".format(self.prefix)
            )
            self._add(self.RNN)
        else:
            logger.info("{}_create forward RNN cells.".format(self.prefix))
            self.forwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_fw_cell".format(self.prefix)
            )
            self._add(self.forwardRNN)

            logger.info("{}_create backward RNN cells.".format(self.prefix))
            self.backwardRNN = RNN(
                self.config['enc_embedd_dim'],
                self.config['enc_hidden_dim'],
                None if not use_context
                else self.config['enc_contxt_dim'],
                name="{}_bw_cell".format(self.prefix)
            )
            self._add(self.backwardRNN)

        logger.info("create encoder ok.")

    def build_encoder(self, source, context=None, return_embed=False,
                      return_sequence=False,
                      return_gates=False,
                      clean_mask=False):
        """
        Build the Encoder Computational Graph

        For the copynet default configurations (with attention)
            return_embed=True,
            return_sequence=True,
            return_gates=True,
            clean_mask=False
        Input:
            source : source text, a list of indexes, shape=[nb_sample, max_len]
            context: None
        Return:
            For Attention model:
                return_sequence=True: to return the embedding at each time, not just the end state
                return_embed=True:
                    X_out:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
                    X:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
                    X_mask: mask, an array showing which elements in X are not 0 [nb_sample, src_max_len]
                    X_tail: encoding of ending of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]

        nb_sample:  number of samples, defined by batch size
        max_len:    max length of sentence (lengths of input are same after padding)
        """
        # clean_mask means we set the hidden states of masked places as 0.
        # sometimes it will help the program to solve something
        # note that this option only works when return_sequence.
        # we recommend to leave at least one mask in the end of encoded sequence.

        # Initial state
        Init_h = None
        if self.use_context:
            Init_h = self.Initializer(context)

        # word embedding
        if not self.config['bidirectional']:
            X, X_mask = self.Embed(source, True)
            if return_gates:
                X_out, Z, R = self.RNN(X, X_mask, C=context, init_h=Init_h,
                                       return_sequence=return_sequence,
                                       return_gates=True)
            else:
                X_out     = self.RNN(X, X_mask, C=context, init_h=Init_h,
                                     return_sequence=return_sequence,
                                     return_gates=False)
            if return_sequence:
                X_tail    = X_out[:, -1]

                if clean_mask:
                    X_out     = X_out * X_mask[:, :, None]
            else:
                X_tail    = X_out
        else:
            source2 = source[:, ::-1]
            '''
            Get the embedding of inputs
                shape(X)=[nb_sample, max_len, emb_dim]
                shape(X_mask)=[nb_sample, max_len]
            '''
            X,  X_mask  = self.Embed(source , mask_zero=True)
            X2, X2_mask = self.Embed(source2, mask_zero=True)

            '''
            Get the output after RNN
                return_sequence=True
            '''
            if not return_gates:
                '''
                X_out: hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
                '''
                X_out1 = self.backwardRNN(X, X_mask,  C=context, init_h=Init_h, return_sequence=return_sequence)
                X_out2 = self.forwardRNN(X2, X2_mask, C=context, init_h=Init_h, return_sequence=return_sequence)
            else:
                '''
                X_out: hidden state of all times, shape=(nb_samples, max_sent_len, input_emb_dim)
                Z:     update gate value, shape=(n_samples, 1)
                R:     reset gate value, shape=(n_samples, 1)
                '''
                X_out1, Z1, R1  = self.backwardRNN(X, X_mask,  C=context, init_h=Init_h,
                                                   return_sequence=return_sequence,
                                                   return_gates=True)
                X_out2, Z2, R2  = self.forwardRNN(X2, X2_mask, C=context, init_h=Init_h,
                                                  return_sequence=return_sequence,
                                                  return_gates=True)
                Z = T.concatenate([Z1, Z2[:, ::-1, :]], axis=2)
                R = T.concatenate([R1, R2[:, ::-1, :]], axis=2)

            if not return_sequence:
                X_out  = T.concatenate([X_out1, X_out2], axis=1)
                X_tail = X_out
            else:
                X_out  = T.concatenate([X_out1, X_out2[:, ::-1, :]], axis=2)
                X_tail = T.concatenate([X_out1[:, -1], X_out2[:, -1]], axis=1)

                if clean_mask:
                    X_out     = X_out * X_mask[:, :, None]

        X_mask  = T.cast(X_mask, dtype='float32')
        if not return_gates:
            if return_embed:
                return X_out, X, X_mask, X_tail
            return X_out
        else:
            if return_embed:
                return X_out, X, X_mask, X_tail, Z, R
            return X_out, Z, R

    def compile_encoder(self, with_context=False, return_embed=False, return_sequence=False):
        source  = T.imatrix()
        self.return_embed = return_embed
        self.return_sequence = return_sequence
        if with_context:
            context = T.matrix()

            self.encode = theano.function([source, context],
                                          self.build_encoder(source, context,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence),
                                            allow_input_downcast=True)
            self.gtenc  = theano.function([source, context],
                                          self.build_encoder(source, context,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence,
                                                             return_gates=True),
                                            allow_input_downcast=True)
        else:
            """
            return
                X_out:  a list of vectors [nb_sample, max_len, 2*enc_hidden_dim], encoding of each time state (concatenate both forward and backward RNN)
                X:      embedding of text X [nb_sample, max_len, enc_embedd_dim]
                X_mask: mask, an array showing which elements in X are not 0 [nb_sample, max_len]
                X_tail: encoding of end of X, seems not make sense for bidirectional model (head+tail) [nb_sample, 2*enc_hidden_dim]
            """
            self.encode = theano.function([source],
                                          self.build_encoder(source, None,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence),
                                          allow_input_downcast=True)

            """
            return
                Z:  value of update gate, shape=(nb_sample, 1)
                R:  value of update gate, shape=(nb_sample, 1)
            """
            self.gtenc  = theano.function([source],
                                          self.build_encoder(source, None,
                                                             return_embed=return_embed,
                                                             return_sequence=return_sequence,
                                                             return_gates=True),
                                          allow_input_downcast=True)


class Decoder(Model):
    """
    Recurrent Neural Network-based Decoder.
    It is used for:
        (1) Evaluation: compute the probability P(Y|X)
        (2) Prediction: sample the best result based on P(Y|X)
        (3) Beam-search
        (4) Scheduled Sampling (how to implement it?)
    """

    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 highway=False):
        """
        mode = RNN: use a RNN Decoder
        """
        super(Decoder, self).__init__()
        self.config = config
        self.rng = rng
        self.prefix = prefix
        self.name = prefix
        self.mode = mode

        self.highway = highway
        self.init = initializations.get('glorot_uniform')
        self.sigmoid = activations.get('sigmoid')

        # use standard drop-out for input & output.
        # I believe it should not use for context vector.
        self.dropout = config['dropout']
        if self.dropout > 0:
            logger.info('Use standard-dropout!!!!')
            self.D   = Dropout(rng=self.rng, p=self.dropout, name='{}_Dropout'.format(prefix))

        """
        Create all elements of the Decoder's computational graph.
        """
        # create Embedding layers
        logger.info("{}_create embedding layers.".format(self.prefix))
        if embed:
            self.Embed = embed
        else:
            self.Embed = Embedding(
                self.config['dec_voc_size'],
                self.config['dec_embedd_dim'],
                name="{}_embed".format(self.prefix))
            self._add(self.Embed)

        # create Initialization Layers
        logger.info("{}_create initialization layers.".format(self.prefix))
        if not config['bias_code']:
            self.Initializer = Zero()
        else:
            self.Initializer = Dense(
                config['dec_contxt_dim'],
                config['dec_hidden_dim'],
                activation='tanh',
                name="{}_init".format(self.prefix)
            )

        # create RNN cells
        logger.info("{}_create RNN cells.".format(self.prefix))
        if 'location_embed' in self.config:
            if config['location_embed']:
                dec_embedd_dim = 2 * self.config['dec_embedd_dim']
            else:
                dec_embedd_dim = self.config['dec_embedd_dim']
        else:
            dec_embedd_dim = self.config['dec_embedd_dim']

        self.RNN = RNN(
            dec_embedd_dim,
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            name="{}_cell".format(self.prefix)
        )

        self._add(self.Initializer)
        self._add(self.RNN)

        # HighWay Gating
        if highway:
            logger.info("HIGHWAY CONNECTION~~~!!!")
            assert self.config['context_predict']
            assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']

            self.C_x = self.init((self.config['dec_contxt_dim'],
                                  self.config['dec_hidden_dim']))
            self.H_x = self.init((self.config['dec_hidden_dim'],
                                  self.config['dec_hidden_dim']))
            self.b_x = initializations.get('zero')(self.config['dec_hidden_dim'])

            self.C_x.name = '{}_Cx'.format(self.prefix)
            self.H_x.name = '{}_Hx'.format(self.prefix)
            self.b_x.name = '{}_bx'.format(self.prefix)
            self.params += [self.C_x, self.H_x, self.b_x]

        # create readout layers
        logger.info("_create Readout layers")

        # 1. hidden layers readout.
        self.hidden_readout = Dense(
            self.config['dec_hidden_dim'],
            self.config['output_dim']
            if self.config['deep_out']
            else self.config['dec_voc_size'],
            activation='linear',
            name="{}_hidden_readout".format(self.prefix)
        )

        # 2. previous word readout
        self.prev_word_readout = None
        if self.config['bigram_predict']:
            self.prev_word_readout = Dense(
                dec_embedd_dim,
                self.config['output_dim']
                if self.config['deep_out']
                else self.config['dec_voc_size'],
                activation='linear',
                name="{}_prev_word_readout".format(self.prefix),
                learn_bias=False
            )

        # 3. context readout
        self.context_readout = None
        if self.config['context_predict']:
            if not self.config['leaky_predict']:
                self.context_readout = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['output_dim']
                    if self.config['deep_out']
                    else self.config['dec_voc_size'],
                    activation='linear',
                    name="{}_context_readout".format(self.prefix),
                    learn_bias=False
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_hidden_dim']
                self.context_readout = self.hidden_readout

        # option: deep output (maxout)
        if self.config['deep_out']:
            self.activ = Activation(config['deep_out_activ'])
            # self.dropout = Dropout(rng=self.rng, p=config['dropout'])
            self.output_nonlinear = [self.activ]  # , self.dropout]
            self.output = Dense(
                self.config['output_dim'] / 2
                if config['deep_out_activ'] == 'maxout2'
                else self.config['output_dim'],

                self.config['dec_voc_size'],
                activation='softmax',
                name="{}_output".format(self.prefix),
                learn_bias=False
            )
        else:
            self.output_nonlinear = []
            self.output = Activation('softmax')

        # registration:
        self._add(self.hidden_readout)

        if not self.config['leaky_predict']:
            self._add(self.context_readout)

        self._add(self.prev_word_readout)
        self._add(self.output)

        if self.config['deep_out']:
            self._add(self.activ)
        # self._add(self.dropout)

        logger.info("create decoder ok.")

    @staticmethod
    def _grab_prob(probs, X, block_unk=False):
        assert probs.ndim == 3

        batch_size = probs.shape[0]
        max_len = probs.shape[1]
        vocab_size = probs.shape[2]

        probs = probs.reshape((batch_size * max_len, vocab_size))
        return probs[T.arange(batch_size * max_len), X.flatten(1)].reshape(X.shape)  # advanced indexing

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target):
        # Word embedding
        Y, Y_mask = self.Embed(target, True)  # (nb_samples, max_len, embedding_dim)

        if self.config['use_input']:
            X = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)
        else:
            X = 0 * Y

        # option ## drop words.

        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, Y, Y_mask, Count

    def build_decoder(self, target, context=None,
                      return_count=False,
                      train=True):

        """
        Build the Decoder Computational Graph
        For training/testing
        """
        X, X_mask, Y, Y_mask, Count = self.prepare_xy(target)

        # input drop-out if any.
        if self.dropout > 0:
            X = self.D(X, train=train)

        # Initial state of RNN
        Init_h = self.Initializer(context)
        if not self.highway:
            X_out  = self.RNN(X, X_mask, C=context, init_h=Init_h, return_sequence=True)

            # Readout
            readout = self.hidden_readout(X_out)
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            if self.config['context_predict']:
                readout += self.context_readout(context).dimshuffle(0, 'x', 1)
        else:
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

            def _recurrence(x, x_mask, prev_h, c):
                # compute the highway gate for context vector.
                xx    = dot(c, self.C_x, self.b_x) + dot(prev_h, self.H_x)  # highway gate.
                xx    = self.sigmoid(xx)

                cy    = xx * c   # the path without using RNN
                x_out = self.RNN(x, mask=x_mask, C=c, init_h=prev_h, one_step=True)
                hx    = (1 - xx) * x_out
                cy    = T.cast(cy, 'float32')
                x_out = T.cast(x_out, 'float32')
                hx    = T.cast(hx, 'float32')
                return x_out, hx, cy

            outputs, _ = theano.scan(
                _recurrence,
                sequences=[X, X_mask],
                outputs_info=[Init_h, None, None],
                non_sequences=[context]
            )

            # hidden readout + context readout
            readout   = self.hidden_readout( outputs[1].dimshuffle((1, 0, 2)))
            if self.dropout > 0:
                readout = self.D(readout, train=train)

            readout  += self.context_readout(outputs[2].dimshuffle((1, 0, 2)))

            # return to normal size.
            X      = X.dimshuffle((1, 0, 2))
            X_mask = X_mask.dimshuffle((1, 0))

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        prob_dist = self.output(readout)  # (nb_samples, max_len, vocab_size)
        # log_old  = T.sum(T.log(self._grab_prob(prob_dist, target)), axis=1)
        log_prob = T.sum(T.log(self._grab_prob(prob_dist, target) + err) * X_mask, axis=1)
        log_ppl  = log_prob / Count

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl

    """
    Sample one step
    """

    def _step_sample(self, prev_word, prev_stat, context):
        # word embedding (note that for the first word, embedding should be all zero)
        if self.config['use_input']:
            X = T.switch(
                prev_word[:, None] < 0,
                alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim']),
                self.Embed(prev_word)
            )
        else:
            X = alloc_zeros_matrix(prev_word.shape[0], self.config['dec_embedd_dim'])

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        if not self.highway:
            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            # compute the readout probability distribution and sample it
            # here the readout is a matrix, different from the learner.
            readout = self.hidden_readout(next_stat)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            if self.config['context_predict']:
                readout += self.context_readout(context)
        else:
            xx     = dot(context, self.C_x, self.b_x) + dot(prev_stat, self.H_x)  # highway gate.
            xx     = self.sigmoid(xx)

            X_proj = self.RNN(X, C=context, init_h=prev_stat, one_step=True)
            next_stat = X_proj

            readout  = self.hidden_readout((1 - xx) * X_proj)
            if self.dropout > 0:
                readout = self.D(readout, train=False)

            readout += self.context_readout(xx * context)

        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        next_prob = self.output(readout)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat

    """
    Build the sampler for sampling/greedy search/beam search
    """

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context = T.matrix()  # theano variable.

        init_h = self.Initializer(context)
        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], init_h, name='get_init_state', allow_input_downcast=True)
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')
        next_prob, next_sample, next_stat \
            = self._step_sample(prev_word, prev_stat, context)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs = [prev_word, prev_stat, context]
        outputs = [next_prob, next_sample, next_stat]

        self.sample_next = theano.function(inputs, outputs, name='sample_next', allow_input_downcast=True)
        logger.info('done')
        pass

    """
    Build a Stochastic Sampler which can use SCAN to work on GPU.
    However it cannot be used in Beam-search.
    """

    def build_stochastic_sampler(self):
        context = T.matrix()
        init_h = self.Initializer(context)

        logger.info('compile the function: sample')
        pass

    """
    Generate samples, either with stochastic sampling or beam-search!
    """

    def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=False, fixlen=False):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        sample = []
        score = []
        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores = np.zeros(live_k).astype(theano.config.floatX)
        hyp_states = []

        # get initial state of decoder RNN with context
        next_state = self.get_init_state(context)
        next_word = -1 * np.ones((1,)).astype('int32')  # indicator for the first target word (bos target)

        # Start searching!
        for ii in range(maxlen):
            # print(next_word)
            ctx = np.tile(context, [live_k, 1])
            next_prob, next_word, next_state \
                = self.sample_next(next_word, next_state, ctx)  # wtf.

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = next_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = next_word[0]

                sample.append(nw)
                score += next_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                # using beam-search
                # we can only computed in a flatten way!
                cand_scores = hyp_scores[:, None] - np.log(next_prob)
                cand_flat = cand_scores.flatten()
                ranks_flat = cand_flat.argsort()[:(k - dead_k)]

                # fetch the best results.
                voc_size = next_prob.shape[1]
                trans_index = ranks_flat / voc_size
                word_index = ranks_flat % voc_size
                costs = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples = []
                new_hyp_scores = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states = []

                for idx, [ti, wi] in enumerate(zip(trans_index, word_index)):
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])
                    new_hyp_states.append(copy.copy(next_state[ti]))

                # check the finished samples
                new_live_k = 0
                hyp_samples = []
                hyp_scores = []
                hyp_states = []

                for idx in range(len(new_hyp_samples)):
                    if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                        sample.append(new_hyp_samples[idx])
                        score.append(new_hyp_scores[idx])
                        dead_k += 1
                    else:
                        new_live_k += 1
                        hyp_samples.append(new_hyp_samples[idx])
                        hyp_scores.append(new_hyp_scores[idx])
                        hyp_states.append(new_hyp_states[idx])

                hyp_scores = np.array(hyp_scores)
                live_k = new_live_k

                if new_live_k < 1:
                    break
                if dead_k >= k:
                    break

                next_word = np.array([w[-1] for w in hyp_samples])
                next_state = np.array(hyp_states)
                pass
            pass

        # end.
        if not stochastic:
            # dump every remaining one
            if live_k > 0:
                for idx in range(live_k):
                    sample.append(hyp_samples[idx])
                    score.append(hyp_scores[idx])

        return sample, score


class DecoderAtt(Decoder):
    """
    Recurrent Neural Network-based Decoder [for CopyNet-b Only]
    with Attention Mechanism
    """
    def __init__(self,
                 config, rng, prefix='dec',
                 mode='RNN', embed=None,
                 copynet=False, identity=False):
        super(DecoderAtt, self).__init__(
                config, rng, prefix,
                 mode, embed, False)
        self.init     = initializations.get('glorot_uniform')
        self.copynet  = copynet
        self.identity = identity
        # attention reader
        self.attention_reader = Attention(
            self.config['dec_hidden_dim'],
            self.config['dec_contxt_dim'],
            1000,
            name='source_attention',
            coverage=self.config['coverage']
        )
        self._add(self.attention_reader)

        # if use copynet
        if self.copynet:
            if not self.identity:
                self.Is = Dense(
                    self.config['dec_contxt_dim'],
                    self.config['dec_embedd_dim'],
                    name='in-trans'
                )
            else:
                assert self.config['dec_contxt_dim'] == self.config['dec_embedd_dim']
                self.Is = Identity(name='ini')

            self.Os = Dense(
                self.config['dec_readout_dim']
                if not self.config['location_embed']
                    else self.config['dec_readout_dim'] + self.config['dec_embedd_dim'],
                self.config['dec_contxt_dim'],
                name='out-trans'
            )

            if self.config['copygate']:
                self.Gs = Dense(
                    self.config['dec_readout_dim'] + self.config['dec_embedd_dim'],
                    1,
                    name='copy-gate',
                    activation='linear',
                    learn_bias=True,
                    negative_bias=True
                )
                self._add(self.Gs)

            if self.config['location_embed']:
                self._add(self.Is)
            self._add(self.Os)

        logger.info('adjust decoder ok.')

    """
    Build the decoder for evaluation
    """
    def prepare_xy(self, target, cc_matrix):
        '''
        create target input for decoder (append a zero to the head of each sequence)
        :param target: indexes of target words
        :param cc_matrix: copy-matrix, (batch_size, trg_len, src_len), cc_matrix[i][j][k]=1 if j-th word in target matches the k-th word in source
        :return:
            X:          embedding of target sequences(batch_size, trg_len, embedding_dim)
            X_mask:     if x is a real word or padding (batch_size, trg_len)
            LL:         simply the copy-matrix (batch_size, trg_len, src_len)
            XL_mask:    if word ll in LL has any copyable word in source (batch_size, trg_len)
            Y_mask:     original mask of target sequences, but why do we need this? (batch_size, trg_len)
            Count:      number of real words in target, original length of each target sequences. size=(batch_size, 1)
        '''
        # target:      (nb_samples, index_seq)
        # cc_matrix:   (nb_samples, maxlen_t, maxlen_s)
        # context:     (nb_samples)

        # create the embedding of target words and their masks
        Y,  Y_mask  = self.Embed(target, True)  # (batch_size, trg_len, embedding_dim), (batch_size, trg_len)

        # append a zero array to the beginning of input
        #   first word of each target sequence to be zero (just like <BOS>) as the initial input of decoder
        #   create a zero array and concate to Y: (batch_size, 1, embedding_dim) + (batch_size, maxlen_t - 1, embedding_dim)
        #   as it's sure that there's a least one <pad> in the end of Y, so feel free to drop the last word (Y[:, :-1, :])
        X           = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, Y.shape[2]), Y[:, :-1, :]], axis=1)

        # LL          = T.concatenate([alloc_zeros_matrix(Y.shape[0], 1, cc_matrix.shape[2]), cc_matrix[:, :-1, :]], axis=1)

        # LL is the copy matrix
        LL = cc_matrix

        # a mask of copy mask, XL_mask[i][j]=1 shows the word in target has copyable/matching words in source text (batch_size, trg_len)
        XL_mask     = T.cast(T.gt(T.sum(LL, axis=2), 0), dtype='float32')

        # 'use_input' means teacher forcing? if not, make decoder input to be zero
        if not self.config['use_input']:
            X *= 0

        # create the mask of target input, append an [1] array to show <BOS>, size=(batch_size, trg_len)
        X_mask    = T.concatenate([T.ones((Y.shape[0], 1)), Y_mask[:, :-1]], axis=1)
        # how many real words (non-zero/non-padding) in each target sequence, size=(batch_size, 1)
        Count     = T.cast(T.sum(X_mask, axis=1), dtype=theano.config.floatX)
        return X, X_mask, LL, XL_mask, Y_mask, Count

    """
    The most different part. Be cautious!!
    Very different from traditional RNN search.
    """
    def build_decoder(self,
                      target,
                      cc_matrix,
                      context,
                      c_mask,
                      return_count=False,
                      train=True):
        """
        Build the Computational Graph ::> Context is essential
        target:     (batch_size, trg_len)
        cc_matrix:  (batch_size, trg_len, src_len), cc_matrix[i][j][k]=1 if in the i-th sample, the j-th word in target matches the k-th word in source
        context:    (batch_size, src_len, 2 * enc_hidden_dim), encoding of each time step (concatenate both forward and backward RNN encodings)
        context:    (nb_samples, max_len, contxt_dim)
        c_mask:     (batch_size, src_len) mask, X_mask[i][j]=1 means j-th word of sample i in X is not 0 (index of <pad>)
        """
        assert c_mask is not None, 'c_mask must be supplied for this decoder.'
        assert context.ndim == 3, 'context must have 3 dimentions.'

        # A bridge layer transforming context vector of encoder_dim to decoder_dim, Is=(2 * enc_hidden_dim, dec_embedd_dim) if it's bidirectional
        context_A = self.Is(context) # (nb_samples, max_src_len, dec_embedd_dim)

        '''
        X:          embedding of target sequences(batch_size, trg_len, embedding_dim)
        X_mask:     if x is a real word or padding (batch_size, trg_len)
        LL:         simply the copy-matrix (batch_size, trg_len, src_len)
        XL_mask:    if word ll in LL has any copyable word in source (batch_size, trg_len)
        Y_mask:     original mask of target sequences, but why do we need this? (batch_size, trg_len)
        Count:      number of real words in target, original length of each target sequences. size=(batch_size, 1)
        '''
        X, X_mask, LL, XL_mask, Y_mask, Count = self.prepare_xy(target, cc_matrix)

        # input drop-out if any.
        if self.dropout > 0:
            X     = self.D(X, train=train)

        # Initial state of RNN
        Init_h   = self.Initializer(context[:, 0, :])  # initialize hidden vector by converting the last state
        Init_a   = T.zeros((context.shape[0], context.shape[1]), dtype='float32') # (batch_size, src_len)
        coverage = T.zeros((context.shape[0], context.shape[1]), dtype='float32') # (batch_size, src_len)

        # permute to make dim of trg_len first
        X        = X.dimshuffle((1, 0, 2))             # (trg_len, batch_size, embedding_dim)
        X_mask   = X_mask.dimshuffle((1, 0))           # (trg_len, batch_size)
        LL       = LL.dimshuffle((1, 0, 2))            # (trg_len, batch_size, src_len)
        XL_mask  = XL_mask.dimshuffle((1, 0))          # (trg_len, batch_size)

        def _recurrence(x, x_mask, ll, xl_mask, prev_h, prev_a, cov, cc, cm, ca):
            """
            x:      (nb_samples, embed_dims)        embedding of word in target sequence of current time step
            x_mask: (nb_samples, )                  if x is a real word (1) or padding (0)
            ll:     (nb_samples, maxlen_s)          if x can be copied from the i-th word in source sequence (1) or not (0)
            xl_mask:(nb_samples, )                  if x has any copyable word in source sequence
            -----------------------------------------
            prev_h: (nb_samples, hidden_dims)       hidden vector of previous step
            prev_a: (nb_samples, maxlen_s)          a distribution of source telling which words are copy-attended in the previous step (initialized with zero)
            cov:    (nb_samples, maxlen_s)          a coverage vector telling which parts have been covered, implemented by attention (initialized with zero)
            -----------------------------------------
            cc:     (nb_samples, maxlen_s, context_dim)     original context, encoding of source text
            cm:     (nb_samples, maxlen_s)                  mask, (batch_size, src_len), X_mask[i][j]=1 means j-th word of sample i in X is not <pad>
            ca:     (nb_samples, maxlen_s, decoder_dim)     converted context_A, the context vector transformed by the bridge layer Is()
            """
            '''
            Generative Decoding
            '''
            # Compute the attention and get the context weight, <pad> in source is masked
            prob  = self.attention_reader(prev_h, cc, Smask=cm, Cov=cov)
            ncov  = cov + prob

            # Compute the weighted context vector
            cxt   = T.sum(cc * prob[:, :, None], axis=1)

            # Input feeding: obtain the new input by concatenating current input word x and previous attended context
            x_in  = T.concatenate([x, T.sum(ca * prev_a[:, :, None], axis=1)], axis=-1)

            # compute the current hidden states of the RNN. hidden state of last time, shape=(nb_samples, output_emb_dim)
            next_h = self.RNN(x_in, mask=x_mask, C=cxt, init_h=prev_h, one_step=True)

            # compute the current readout vector.
            r_in  = [next_h]
            if self.config['context_predict']:
                r_in  += [cxt]
            if self.config['bigram_predict']:
                r_in  += [x_in]

            # readout the word logits
            r_in    = T.concatenate(r_in, axis=-1) # shape=(nb_samples, output_emb_dim)
            r_out   = self.hidden_readout(next_h)  # obtain the generative logits, (nb_samples, voc_size)
            if self.config['context_predict']:
                r_out += self.context_readout(cxt)
            if self.config['bigram_predict']:
                r_out += self.prev_word_readout(x_in)

            # Get the generate-mode output = tanh(r_out), note it's not logit nor prob
            for l in self.output_nonlinear:
                r_out = l(r_out)

            '''
            Copying Decoding
            '''
            # Eq.8, key=h_j*W_c. Os layer=tanh(dec_readout_dim+dec_embedd_dim, dec_contxt_dim), dec_readout_dim=output_emb_dim + enc_context_dim + dec_embed_dim
            key     = self.Os(r_in)  # output=(nb_samples, dec_contxt_dim) :: key for locating where to copy

            # Eq.8, compute the copy attention weights
            #    (nb_samples, 1, dec_contxt_dim) * (nb_samples, src_maxlen, cxt_dim) -> sum(nb_samples, src_maxlen, 1) -> (nb_samples, src_maxlen)
            Eng     = T.sum(key[:, None, :] * cc, axis=-1)

            # Copy gating, determine the contribution from generative and copying
            if self.config['copygate']:
                gt     = self.sigmoid(self.Gs(r_in))  # Gs=(dec_readout_dim + dec_embedd_dim, 1), output=(nb_samples, 1)
                # plus a log prob to stabilize the computation. but are r_out and Eng log-ed probs?
                r_out += T.log(gt.flatten()[:, None])
                Eng   += T.log(1 - gt.flatten()[:, None])

                # r_out *= gt.flatten()[:, None]
                # Eng   *= 1 - gt.flatten()[:, None]

            # compute the logSumExp of both generative and copying probs, <pad> in source is masked
            EngSum  = logSumExp(Eng, axis=-1, mask=cm, c=r_out)

            # (nb_samples, vocab_size + maxlen_s): T.exp(r_out - EngSum) is generate_prob, T.exp(Eng - EngSum) * cm is copy_prob
            next_p  = T.concatenate([T.exp(r_out - EngSum), T.exp(Eng - EngSum) * cm], axis=-1)
            '''
            self.config['dec_voc_size'] = 50000
                next_b: the first 50000 probs in next_p is p_generate
                next_c: probs after 50000 in next_p is p_copy
            '''
            next_c  = next_p[:, self.config['dec_voc_size']:] * ll           # copy_prob, mask off (ignore) the non-copyable words: (nb_samples, maxlen_s) * (nb_samples, maxlen_s) = (nb_samples, maxlen_s)
            next_b  = next_p[:, :self.config['dec_voc_size']]                # generate_prob
            sum_a   = T.sum(next_c, axis=1, keepdims=True)                   # sum of copy_prob, telling how helpful the copy part is (nb_samples,)
            next_a  = (next_c / (sum_a + err)) * xl_mask[:, None]            # normalize the copy_prob for numerically consideration (nb_samples, maxlen_s), ignored if there is not word can be copied from source

            next_c  = T.cast(next_c, 'float32')
            next_a  = T.cast(next_a, 'float32')

            return next_h, next_a, ncov, sum_a, next_b

        outputs, _ = theano.scan(
            _recurrence,
            sequences=[X, X_mask, LL, XL_mask],
            outputs_info=[Init_h, Init_a, coverage, None, None],
            non_sequences=[context, c_mask, context_A]
        )

        '''
        shuffle (trg_len, batch_size, x) -> (batch_size, trg_len, x)
        X_out:          hidden vector of each decoding step (not useful for computing error)
        source_prob:    normalized copy_prob distribution of each decoding step (not useful for computing error)
        coverages:      coverage vector  (not useful for computing error)
        source_sum:     generate_prob distribution of each decoding step
        prob_dist:      sum of copy_prob of each decoding step
        '''
        X_out, source_prob, coverages, source_sum, prob_dist = [z.dimshuffle((1, 0, 2)) for z in outputs]
        X        = X.dimshuffle((1, 0, 2))
        X_mask   = X_mask.dimshuffle((1, 0))
        XL_mask  = XL_mask.dimshuffle((1, 0))

        # unk masking
        U_mask   = T.ones_like(target, dtype='float32') * (1 - T.eq(target, 1))
        U_mask  += (1 - U_mask) * (1 - XL_mask)

        # The most different part is here !!
        # self._grab_prob(prob_dist, target) computes the error of generative part, source_sum.sum(axis=-1) gives the error of copying part
        log_prob = T.sum(T.log(
                   T.clip(self._grab_prob(prob_dist, target) * U_mask + source_sum.sum(axis=-1) + err, 1e-7, 1.0)
                   ) * X_mask, dtype='float32', axis=1)
        log_ppl  = log_prob / (Count + err)

        if return_count:
            return log_prob, Count
        else:
            return log_prob, log_ppl

    """
    Sample one step
    """

    def _step_sample(self,
                     prev_word,
                     prev_stat,
                     prev_loc,
                     prev_cov,
                     context,
                     c_mask,
                     context_A):
        """
        Get the probability of next word, sec 3.2 and 3.3
        :param prev_word    :   index of previous words, size=(1, live_k)
        :param prev_stat    :   output encoding of last time, size=(1, live_k, output_dim)
        :param prev_loc     :   information needed for copy-based predicting
        :param prev_cov     :   information needed for copy-based predicting
        :param context      :   encoding of source text, shape = [live_k, sent_len, 2*output_dim]
        :param c_mask       :   mask fof source text, shape = [live_k, sent_len]
        :param context_A: an identity layer (do nothing but return the context)
        :returns:
            next_prob       : probabilities of next word, shape=(1, voc_size+sent_len)
                                next_prob0[:voc_size] is generative probability
                                next_prob0[voc_size:voc_size+sent_len] is copy probability
            next_sample     : only useful for stochastic
            next_stat       : output (decoding) vector after time t
            ncov            :
            next_stat       :
        """

        assert c_mask is not None, 'we need the source mask.'
        # word embedding (note that for the first word, embedding should be all zero)
        # if prev_word[:, None] < 0 (only the starting sysbol index=-1)
        #   then return zeros
        #       return alloc_zeros_matrix(prev_word.shape[0], 2 * self.config['dec_embedd_dim']),
        #   else return embedding of the previous words
        #       return self.Embed(prev_word)

        X = T.switch(
            prev_word[:, None] < 0,
            alloc_zeros_matrix(prev_word.shape[0], 2 * self.config['dec_embedd_dim']),
            T.concatenate([self.Embed(prev_word),
                           T.sum(context_A * prev_loc[:, :, None], axis=1)
                           ], axis=-1)
        )

        if self.dropout > 0:
            X = self.D(X, train=False)

        # apply one step of RNN
        Probs  = self.attention_reader(prev_stat, context, c_mask, Cov=prev_cov)
        ncov   = prev_cov + Probs

        cxt    = T.sum(context * Probs[:, :, None], axis=1)

        X_proj, zz, rr = self.RNN(X, C=cxt,
                                  init_h=prev_stat,
                                  one_step=True,
                                  return_gates=True)
        next_stat = X_proj

        # compute the readout probability distribution and sample it
        # here the readout is a matrix, different from the learner.
        readin      = [next_stat]
        if self.config['context_predict']:
            readin += [cxt]
        if self.config['bigram_predict']:
            readin += [X]

        # if gating
        # if self.config['copygate']:
        #     gt      = self.sigmoid(self.Gs(readin))   # (nb_samples, dim)
        #     readin *= 1 - gt
        #     readout = self.hidden_readout(next_stat * gt[:, :self.config['dec_hidden_dim']])
        #     if self.config['context_predict']:
        #         readout += self.context_readout(
        #                 cxt * gt[:, self.config['dec_hidden_dim']:
        #                          self.config['dec_hidden_dim'] + self.config['dec_contxt_dim']])
        #     if self.config['bigram_predict']:
        #         readout += self.prev_word_readout(
        #                 X * gt[:, -2 * self.config['dec_embedd_dim']:])
        # else:
        readout = self.hidden_readout(next_stat)
        if self.config['context_predict']:
            readout += self.context_readout(cxt)
        if self.config['bigram_predict']:
            readout += self.prev_word_readout(X)

        for l in self.output_nonlinear:
            readout = l(readout)

        readin      = T.concatenate(readin, axis=-1)
        key         = self.Os(readin)
        Eng         = T.sum(key[:, None, :] * context, axis=-1)

        # # gating
        if self.config['copygate']:
            gt       = self.sigmoid(self.Gs(readin))  # (nb_samples, 1)
            readout += T.log(gt.flatten()[:, None])
            Eng     += T.log(1 - gt.flatten()[:, None])

        EngSum      = logSumExp(Eng, axis=-1, mask=c_mask, c=readout)

        next_prob   = T.concatenate([T.exp(readout - EngSum), T.exp(Eng - EngSum) * c_mask], axis=-1)
        next_sample = self.rng.multinomial(pvals=next_prob).argmax(1)
        return next_prob, next_sample, next_stat, ncov, next_stat

    def build_sampler(self):
        """
        Build a sampler which only steps once.
        Typically it only works for one word a time?
        """
        logger.info("build sampler ...")
        if self.config['sample_stoch'] and self.config['sample_argmax']:
            logger.info("use argmax search!")
        elif self.config['sample_stoch'] and (not self.config['sample_argmax']):
            logger.info("use stochastic sampling!")
        elif self.config['sample_beam'] > 1:
            logger.info("use beam search! (beam_size={})".format(self.config['sample_beam']))

        # initial state of our Decoder.
        context   = T.tensor3()  # theano variable. shape=(n_sample, sent_len, 2*output_dim)
        c_mask    = T.matrix()   # mask of the input sentence.
        context_A = self.Is(context) # an bridge layer

        init_h = self.Initializer(context[:, 0, :])
        init_a = T.zeros((context.shape[0], context.shape[1]))
        cov    = T.zeros((context.shape[0], context.shape[1]))

        logger.info('compile the function: get_init_state')
        self.get_init_state \
            = theano.function([context], [init_h, init_a, cov], name='get_init_state', allow_input_downcast=True)
        logger.info('done.')

        # word sampler: 1 x 1
        prev_word = T.vector('prev_word', dtype='int32')
        prev_stat = T.matrix('prev_state', dtype='float32')
        prev_a    = T.matrix('prev_a', dtype='float32')
        prev_cov  = T.matrix('prev_cov', dtype='float32')

        next_prob, next_sample, next_stat, ncov, alpha \
            = self._step_sample(prev_word,
                                prev_stat,
                                prev_a,
                                prev_cov,
                                context,
                                c_mask,
                                context_A)

        # next word probability
        logger.info('compile the function: sample_next')
        inputs  = [prev_word, prev_stat, prev_a, prev_cov, context, c_mask]
        outputs = [next_prob, next_sample, next_stat, ncov, alpha]
        self.sample_next = theano.function(inputs, outputs, name='sample_next', allow_input_downcast=True)
        logger.info('done')

    """
    Generate samples, either with stochastic sampling or beam-search!

    [:-:] I have to think over how to modify the BEAM-Search!!
    """
    def get_sample(self,
                   context,  # the RNN encoding of source text at each time step, shape = [1, sent_len, 2*output_dim]
                   c_mask,  # shape = [1, sent_len]
                   sources,  # shape = [1, sent_len]
                   k=1, maxlen=30, stochastic=True,  # k = config['sample_beam'], maxlen = config['max_len']
                   argmax=False, fixlen=False,
                   return_attend=False,
                   type='extractive',
                   generate_ngram=True
                   ):
        # beam size
        if k > 1:
            assert not stochastic, 'Beam search does not support stochastic sampling!!'

        # fix length cannot use beam search
        # if fixlen:
        #     assert k == 1

        # prepare for searching
        Lmax   = self.config['dec_voc_size']
        sample = [] # predited sequences
        attention_probs    = [] # don't know what's this
        attend = []
        score  = [] # probability of predited sequences
        state =  [] # the output encoding of predited sequences

        if stochastic:
            score = 0

        live_k = 1
        dead_k = 0

        hyp_samples = [[]] * live_k
        hyp_scores  = np.zeros(live_k).astype(theano.config.floatX)
        hyp_attention_probs    = [[]] * live_k
        hyp_attends = [[]] * live_k

        # get initial state of decoder RNN with encoding
        #   feed in the encoding of time=0(why 0?! because the X_out of RNN is reverse?), do tanh(W*x+b) and output next_state shape=[1,output_dim]
        #   copy_word_prob and coverage are zeros[context.shape]
        previous_state, copy_word_prob, coverage = self.get_init_state(context)
        # indicator for the first target word (bos target), starts with [-1]
        previous_word = -1 * np.ones((1,)).astype('int32')

        # if aim is extractive, then set the initial beam size to be voc_size
        if type == 'extractive':
            input = sources[0]
            input_set = set(input)
            sequence_set = set()

            if generate_ngram:
                for i in range(len(input)): # loop over start
                    for j in range(1, maxlen): # loop over length
                        if i+j > len(input)-1:
                            break
                        hash_token = [str(s) for s in input[i:i+j]]
                        sequence_set.add('-'.join(hash_token))
                logger.info("Possible n-grams: %d" % len(sequence_set))

        # Start searching!
        for ii in range(maxlen):
            # make live_k copies of context, c_mask and source, to predict next words at once.
            #   np.tile(context, [live_k, 1, 1]) means copying along the axis=0
            context_copies     = np.tile(context, [live_k, 1, 1]) # shape = [live_k, sent_len, 2*output_dim]
            c_mask_copies      = np.tile(c_mask,  [live_k, 1])    # shape = [live_k, sent_len]
            source_copies      = np.tile(sources, [live_k, 1])    # shape = [live_k, sent_len]

            # process word
            def process_():
                """
                copy_mask[i] indicates which words in source have been copied (whether the previous_word[i] appears in source text)
                size = size(source_copies) = [live_k, sent_len]
                Caution:     word2idx['<eol>'] = 0, word2idx['<unk>'] = 1
                """
                copy_mask  = np.zeros((source_copies.shape[0], source_copies.shape[1]), dtype='float32')

                for i in range(previous_word.shape[0]): # loop over the previous_words, index of previous words, size=(1, live_k)
                    #   Note that the model predict a OOV word in the way like voc_size+position_in_source
                    #   if a previous predicted word is OOV (next_word[i] >= Lmax):
                    #       means it predicts the position of word in source text (next_word[i]=voc_size+position_in_source)
                    #           1. set copy_mask to 1, indicates which last word is copied;
                    #           2. set next_word to the real index of this word (source_copies[previous_word[i] - Lmax])
                    #   else:
                    #       means not a OOV word, but may be still copied from source
                    #       check if any word in source_copies[i] is same to previous_word[i]
                    if previous_word[i] >= Lmax:
                        copy_mask[i][previous_word[i] - Lmax] = 1.
                        previous_word[i] = source_copies[i][previous_word[i] - Lmax]
                    else:
                        copy_mask[i] = (source_copies[i] == previous_word[i, None])
                        # for k in range(sss.shape[1]):
                        #     ll[i][k] = (sss[i][k] == next_word[i])
                return copy_mask, previous_word

            copy_mask, previous_word = process_()
            copy_flag = (np.sum(copy_mask, axis=1, keepdims=True) > 0) # boolean indicates if any copy available

            # get the copy probability (eq 6 in paper?)
            next_a  = copy_word_prob * copy_mask # keep the copied ones
            next_a  = next_a / (err + np.sum(next_a, axis=1, keepdims=True)) * copy_flag # normalize
            '''
            Get the probability of next word, sec 3.2 and 3.3
                Return:
                    next_prob0  : probabilities of next word, shape=(live_k, voc_size+sent_len)
                                    next_prob0[:, :voc_size] is generative probability
                                    next_prob0[:, voc_size:voc_size+sent_len] is copy probability
                    next_word   : only useful for stochastic
                    next_state  : output (decoding) vector after time t
                    coverage    :
                    alpha       : just next_state, only useful if return_attend

                Inputs:
                    previous_word       : index of previous words, size=(1, live_k)
                    previous_state      : output encoding of last time, size=(1, live_k, output_dim)
                    next_a, coverage    : information needed for copy-based predicting
                    encoding_copies     : shape = [live_k, sent_len, 2*output_dim]
                    c_mask_copies       : shape = [live_k, sent_len]

                    if don't do copying, only previous_word,previous_state,context_copies,c_mask_copies are needed for predicting
            '''
            next_prob0, next_word, next_state, coverage, alpha \
                = self.sample_next(previous_word, previous_state, next_a, coverage, context_copies, c_mask_copies)
            if not self.config['decode_unk']: # eliminate the probability of <unk>
                next_prob0[:, 1]          = 0.
                next_prob0 /= np.sum(next_prob0, axis=1, keepdims=True)

            def merge_():
                # merge the probabilities, p(w) = p_generate(w)+p_copy(w)
                temple_prob  = copy.copy(next_prob0)
                source_prob  = copy.copy(next_prob0[:, Lmax:])
                for i in range(next_prob0.shape[0]): # loop over all the previous words
                    for j in range(source_copies.shape[1]): # loop over all the source words
                        if (source_copies[i, j] < Lmax) and (source_copies[i, j] != 1): # if word source_copies[i, j] in voc and not a unk
                            temple_prob[i, source_copies[i, j]] += source_prob[i, j] # add the copy prob to generative prob
                            temple_prob[i, Lmax + j]   = 0. # set the corresponding copy prob to be 0

                return temple_prob, source_prob
            # if word in voc, add the copy prob to generative prob and keep generate prob only, else keep the copy prob only
            generate_word_prob, copy_word_prob   = merge_()
            next_prob0[:, Lmax:] = 0. # [not quite useful]set the latter (copy) part to be zeros, actually next_prob0 become really generate_word_prob
            # print('0', next_prob0[:, 3165])
            # print('01', next_prob[:, 3165])
            # # print(next_prob[0, Lmax:])
            # print(ss_prob[0, :])

            if stochastic:
                # using stochastic sampling (or greedy sampling.)
                if argmax:
                    nw = generate_word_prob[0].argmax()
                    next_word[0] = nw
                else:
                    nw = self.rng.multinomial(pvals=generate_word_prob).argmax(1)

                sample.append(nw)
                score += generate_word_prob[0, nw]

                if (not fixlen) and (nw == 0):  # sample reached the end
                    break

            else:
                '''
                using beam-search, keep the top (k-dead_k) results (dead_k is disabled by memray)
                we can only computed in a flatten way!
                '''
                # add the score of new predicted word to the score of whole sequence, the reason why the score of longer sequence getting smaller
                #       add a 1e-10 to avoid log(0)
                #       size(hyp_scores)=[live_k,1], size(generate_word_prob)=[live_k,voc_size+sent_len]
                cand_scores     = hyp_scores[:, None] - np.log(generate_word_prob + 1e-10)
                cand_flat       = cand_scores.flatten()
                ranks_flat      = cand_flat.argsort()[:(k - dead_k)] # get the index of top k predictions

                # recover(stack) the flat results, fetch the best results.
                voc_size        = generate_word_prob.shape[1]
                sequence_index  = ranks_flat / voc_size # flat_index/voc_size is the original sequence index
                next_word_index = ranks_flat % voc_size # flat_index%voc_size is the original word index
                costs           = cand_flat[ranks_flat]

                # get the new hyp samples
                new_hyp_samples         = []
                new_hyp_attention_probs = []
                new_hyp_attends         = []
                new_hyp_scores          = np.zeros(k - dead_k).astype(theano.config.floatX)
                new_hyp_states          = []
                new_hyp_coverage        = []
                new_hyp_copy_word_prob  = []

                for idx, [ti, wi] in enumerate(zip(sequence_index, next_word_index)):
                    ti = int(ti)
                    wi = int(wi)
                    new_hyp_samples.append(hyp_samples[ti] + [wi])
                    new_hyp_scores[idx] = copy.copy(costs[idx])

                    new_hyp_states.append(copy.copy(next_state[ti]))
                    new_hyp_coverage.append(copy.copy(coverage[ti]))
                    new_hyp_copy_word_prob.append(copy.copy(copy_word_prob[ti]))

                    # what's the ppp? generative attention and copy attention?
                    if not return_attend:
                        # probability of current predicted word (generative part and both generative/copying part)
                        new_hyp_attention_probs.append(hyp_attention_probs[ti] + [[next_prob0[ti][wi], generate_word_prob[ti][wi]]])
                    else:
                        # copying probability and attention probability of current predicted word
                        new_hyp_attention_probs.append(hyp_attention_probs[ti] + [(copy_word_prob[ti], alpha[ti])])

                # check the finished samples
                new_live_k          = 0
                hyp_samples         = []
                hyp_scores          = []
                hyp_states          = []
                hyp_coverage        = []
                hyp_attention_probs = []
                hyp_copy_word_prob  = []

                for idx in range(len(new_hyp_samples)):
                    # [bug] change to new_hyp_samples[idx][-1] == 0
                    # if (new_hyp_states[idx][-1] == 0) and (not fixlen):
                    if (new_hyp_samples[idx][-1] == 0 and not fixlen):
                        '''
                        predict an <eos>, this sequence is done
                        put successful prediction into result list
                        '''
                        # worth-noting that if the word index is larger than voc_size, it means a OOV word
                        sample.append(new_hyp_samples[idx])
                        attention_probs.append(new_hyp_attention_probs[idx])
                        score.append(new_hyp_scores[idx])
                        state.append(new_hyp_states[idx])
                        # dead_k += 1
                    if new_hyp_samples[idx][-1] != 0:
                        '''
                        sequence prediction not complete
                        put into candidate list for next round prediction
                        '''
                        # limit predictions must appear in text
                        if type == 'extractive':
                            if new_hyp_samples[idx][-1] not in input_set:
                                continue
                            if generate_ngram:
                                if '-'.join([str(s) for s in new_hyp_samples[idx]]) not in sequence_set:
                                    continue

                        new_li

Download .txt

gitextract_qitx157j/

├── .gitignore
├── LICENSE
├── README.md
├── emolga/
│   ├── __init__.py
│   ├── basic/
│   │   ├── __init__.py
│   │   ├── activations.py
│   │   ├── initializations.py
│   │   ├── objectives.py
│   │   └── optimizers.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   └── build_dataset.py
│   ├── layers/
│   │   ├── __init__.py
│   │   ├── attention.py
│   │   ├── core.py
│   │   ├── embeddings.py
│   │   ├── gridlstm.py
│   │   ├── ntm_minibatch.py
│   │   └── recurrent.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── core.py
│   │   ├── covc_encdec.py
│   │   ├── encdec.py
│   │   ├── ntm_encdec.py
│   │   ├── pointers.py
│   │   └── variational.py
│   └── utils/
│       ├── __init__.py
│       ├── generic_utils.py
│       ├── io_utils.py
│       ├── np_utils.py
│       ├── test_utils.py
│       └── theano_utils.py
└── keyphrase/
    ├── __init__.py
    ├── baseline/
    │   ├── evaluate.py
    │   └── export_dataset.py
    ├── config.py
    ├── dataset/
    │   ├── __init__.py
    │   ├── dataset_utils.py
    │   ├── inspec/
    │   │   ├── __init__.py
    │   │   ├── inspec_export_json.py
    │   │   └── key_convert_maui.py
    │   ├── json_count.py
    │   ├── keyphrase_dataset.py
    │   ├── keyphrase_test_dataset.py
    │   ├── keyphrase_train_dataset.py
    │   └── million-paper/
    │       ├── clean_export_json.py
    │       └── preprocess.py
    ├── keyphrase_copynet.py
    ├── keyphrase_utils.py
    └── util/
        ├── __init__.py
        ├── gpu-test.py
        └── stanford-pos-tagger.py

Download .txt

SYMBOL INDEX (584 symbols across 34 files)

FILE: emolga/basic/activations.py
  function softmax (line 4) | def softmax(x):
  function vector_softmax (line 8) | def vector_softmax(x):
  function time_distributed_softmax (line 12) | def time_distributed_softmax(x):
  function softplus (line 18) | def softplus(x):
  function relu (line 22) | def relu(x):
  function tanh (line 26) | def tanh(x):
  function sigmoid (line 30) | def sigmoid(x):
  function hard_sigmoid (line 34) | def hard_sigmoid(x):
  function linear (line 38) | def linear(x):
  function maxout2 (line 45) | def maxout2(x):
  function get (line 68) | def get(identifier):

FILE: emolga/basic/initializations.py
  function get_fans (line 8) | def get_fans(shape):
  function uniform (line 16) | def uniform(shape, scale=0.1):
  function normal (line 20) | def normal(shape, scale=0.05):
  function lecun_uniform (line 24) | def lecun_uniform(shape):
  function glorot_normal (line 33) | def glorot_normal(shape):
  function glorot_uniform (line 41) | def glorot_uniform(shape):
  function he_normal (line 47) | def he_normal(shape):
  function he_uniform (line 55) | def he_uniform(shape):
  function orthogonal (line 61) | def orthogonal(shape, scale=1.1):
  function identity (line 73) | def identity(shape, scale=1):
  function zero (line 80) | def zero(shape):
  function one (line 84) | def one(shape):
  function get (line 88) | def get(identifier):

FILE: emolga/basic/objectives.py
  function mean_squared_error (line 13) | def mean_squared_error(y_true, y_pred):
  function mean_absolute_error (line 17) | def mean_absolute_error(y_true, y_pred):
  function mean_absolute_percentage_error (line 21) | def mean_absolute_percentage_error(y_true, y_pred):
  function mean_squared_logarithmic_error (line 25) | def mean_squared_logarithmic_error(y_true, y_pred):
  function squared_hinge (line 29) | def squared_hinge(y_true, y_pred):
  function hinge (line 33) | def hinge(y_true, y_pred):
  function categorical_crossentropy (line 37) | def categorical_crossentropy(y_true, y_pred):
  function binary_crossentropy (line 47) | def binary_crossentropy(y_true, y_pred):
  function poisson_loss (line 53) | def poisson_loss(y_true, y_pred):
  function gaussian_kl_divergence (line 59) | def gaussian_kl_divergence(mean, ln_var):
  function get (line 97) | def get(identifier):

FILE: emolga/basic/optimizers.py
  function clip_norm (line 17) | def clip_norm(g, c, n):
  function kl_divergence (line 23) | def kl_divergence(p, p_hat):
  class Optimizer (line 27) | class Optimizer(object):
    method __init__ (line 28) | def __init__(self, **kwargs):
    method add (line 33) | def add(self, v):
    method get_state (line 36) | def get_state(self):
    method set_state (line 39) | def set_state(self, value_list):
    method get_updates (line 44) | def get_updates(self, params, loss):
    method get_gradients (line 47) | def get_gradients(self, loss, params):
    method get_config (line 66) | def get_config(self):
  class SGD (line 70) | class SGD(Optimizer):
    method __init__ (line 72) | def __init__(self, lr=0.05, momentum=0.9, decay=0.01, nesterov=True, *...
    method get_updates (line 79) | def get_updates(self, params, loss):
    method get_config (line 97) | def get_config(self):
  class RMSprop (line 105) | class RMSprop(Optimizer):
    method __init__ (line 106) | def __init__(self, lr=0.001, rho=0.9, epsilon=1e-6, *args, **kwargs):
    method get_updates (line 113) | def get_updates(self, params, loss):
    method get_config (line 126) | def get_config(self):
  class Adagrad (line 133) | class Adagrad(Optimizer):
    method __init__ (line 134) | def __init__(self, lr=0.01, epsilon=1e-6, *args, **kwargs):
    method get_updates (line 139) | def get_updates(self, params, constraints, loss):
    method get_config (line 151) | def get_config(self):
  class Adadelta (line 157) | class Adadelta(Optimizer):
    method __init__ (line 161) | def __init__(self, lr=0.1, rho=0.95, epsilon=1e-6, *args, **kwargs):
    method get_updates (line 167) | def get_updates(self, params, loss):
    method get_config (line 190) | def get_config(self):
  class Adam (line 197) | class Adam(Optimizer):  # new Adam is designed for our purpose.
    method __init__ (line 204) | def __init__(self, lr=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8, sa...
    method add_noise (line 230) | def add_noise(self, param):
    method add_forget (line 235) | def add_forget(self, param):
    method get_updates (line 240) | def get_updates(self, params, loss):
    method get_config (line 282) | def get_config(self):
  function get (line 304) | def get(identifier, kwargs=None):

FILE: emolga/dataset/build_dataset.py
  function serialize_to_file_json (line 17) | def serialize_to_file_json(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
  function serialize_to_file_hdf5 (line 22) | def serialize_to_file_hdf5(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
  function serialize_to_file (line 27) | def serialize_to_file(obj, path, protocol=pickle.HIGHEST_PROTOCOL):
  function show_txt (line 34) | def show_txt(array, path):
  function divide_dataset (line 42) | def divide_dataset(dataset, test_size, max_size):
  function deserialize_from_file_json (line 52) | def deserialize_from_file_json(path):
  function deserialize_from_file_hdf5 (line 58) | def deserialize_from_file_hdf5(path):
  function deserialize_from_file (line 64) | def deserialize_from_file(path):
  function build_fuel (line 71) | def build_fuel(data):
  function obtain_stream (line 79) | def obtain_stream(dataset, batch_size, size=1):
  function build_ptb (line 94) | def build_ptb():
  function filter_unk (line 123) | def filter_unk(X, min_freq=5):
  function build_msr (line 149) | def build_msr():

FILE: emolga/layers/attention.py
  class Attention (line 11) | class Attention(Layer):
    method __init__ (line 12) | def __init__(self, target_dim, source_dim, hidden_dim,
    method __call__ (line 42) | def __call__(self, X, S,
  class CosineAttention (line 82) | class CosineAttention(Layer):
    method __init__ (line 83) | def __init__(self, target_dim, source_dim,
    method __call__ (line 117) | def __call__(self, X, S, Smask=None, return_log=False):

FILE: emolga/layers/core.py
  class Layer (line 8) | class Layer(object):
    method __init__ (line 9) | def __init__(self):
    method init_updates (line 15) | def init_updates(self):
    method _monitoring (line 18) | def _monitoring(self):
    method __call__ (line 26) | def __call__(self, X, *args, **kwargs):
    method _add (line 29) | def _add(self, layer):
    method supports_masked_input (line 34) | def supports_masked_input(self):
    method get_output_mask (line 40) | def get_output_mask(self, train=None):
    method set_weights (line 56) | def set_weights(self, weights):
    method get_weights (line 62) | def get_weights(self):
    method get_params (line 68) | def get_params(self):
    method set_name (line 71) | def set_name(self, name):
  class MaskedLayer (line 80) | class MaskedLayer(Layer):
    method supports_masked_input (line 85) | def supports_masked_input(self):
  class Identity (line 89) | class Identity(Layer):
    method __init__ (line 90) | def __init__(self, name='Identity'):
    method __call__ (line 95) | def __call__(self, X):
  class Dense (line 99) | class Dense(Layer):
    method __init__ (line 100) | def __init__(self, input_dim, output_dim, init='glorot_uniform', activ...
    method set_name (line 126) | def set_name(self, name):
    method __call__ (line 130) | def __call__(self, X):
    method reverse (line 135) | def reverse(self, Y):
  class Dense2 (line 142) | class Dense2(Layer):
    method __init__ (line 143) | def __init__(self, input_dim1, input_dim2, output_dim, init='glorot_un...
    method set_name (line 168) | def set_name(self, name):
    method __call__ (line 173) | def __call__(self, X1, X2):
  class Constant (line 178) | class Constant(Layer):
    method __init__ (line 179) | def __init__(self, input_dim, output_dim, init=None, activation='tanh'...
    method set_name (line 195) | def set_name(self, name):
    method __call__ (line 198) | def __call__(self, X=None):
  class MemoryLinear (line 206) | class MemoryLinear(Layer):
    method __init__ (line 207) | def __init__(self, input_dim, input_wdth, init='glorot_uniform',
    method __call__ (line 226) | def __call__(self, X=None):
  class Dropout (line 233) | class Dropout(MaskedLayer):
    method __init__ (line 237) | def __init__(self, rng=None, p=1., name=None):
    method __call__ (line 242) | def __call__(self, X, train=True):
  class Activation (line 252) | class Activation(MaskedLayer):
    method __init__ (line 256) | def __init__(self, activation):
    method __call__ (line 260) | def __call__(self, X):

FILE: emolga/layers/embeddings.py
  class Embedding (line 8) | class Embedding(Layer):
    method __init__ (line 17) | def __init__(self, input_dim, output_dim, init='uniform', name=None):
    method get_output_mask (line 31) | def get_output_mask(self, X):
    method __call__ (line 40) | def __call__(self, X, mask_zero=False, context=None):
  class Zero (line 85) | class Zero(Layer):
    method __call__ (line 86) | def __call__(self, X):
  class Bias (line 91) | class Bias(Layer):
    method __call__ (line 92) | def __call__(self, X):

FILE: emolga/layers/gridlstm.py
  class Grid (line 13) | class Grid(Recurrent):
    method __init__ (line 41) | def __init__(self,
    method build (line 86) | def build(self):
    method lstm_ (line 140) | def lstm_(self, k, H, m, x, identity=False):
    method grid_ (line 195) | def grid_(self,
  class SequentialGridLSTM (line 243) | class SequentialGridLSTM(Grid):
    method __init__ (line 253) | def __init__(self,
    method _step (line 362) | def _step(self, *args):
    method __call__ (line 413) | def __call__(self, X, init_H=None, init_M=None,
  class PyramidGridLSTM2D (line 476) | class PyramidGridLSTM2D(Grid):
    method __init__ (line 480) | def __init__(self,
    method _step (line 565) | def _step(self, *args):
    method __call__ (line 618) | def __call__(self, X, init_x=None, init_y=None,
  class PyramidLSTM (line 669) | class PyramidLSTM(Layer):
    method __init__ (line 673) | def __init__(self,
    method _step (line 768) | def _step(self, *args):
    method __call__ (line 826) | def __call__(self, X, init_x=None, init_y=None,

FILE: emolga/layers/ntm_minibatch.py
  class Reader (line 17) | class Reader(Layer):
    method __init__ (line 22) | def __init__(self, input_dim, memory_width, shift_width, shift_conv,
    method __call__ (line 67) | def __call__(self, X, w_temp, m_temp):
  class Writer (line 93) | class Writer(Reader):
    method __init__ (line 98) | def __init__(self, input_dim, memory_width, shift_width, shift_conv,
    method get_fixer (line 119) | def get_fixer(self, X):
  class Controller (line 125) | class Controller(Recurrent):
    method __init__ (line 134) | def __init__(self,
    method _controller (line 216) | def _controller(self, input_t, read_t, controller_tm1=None):
    method _read (line 231) | def _read(w_read, memory):
    method _write (line 239) | def _write(w_write, memory, erase, add):
    method _step (line 252) | def _step(self, input_t, mask_t,
    method __call__ (line 302) | def __call__(self, X, mask=None, M=None, init_ww=None,
  class AttentionReader (line 369) | class AttentionReader(Layer):
    method __init__ (line 374) | def __init__(self, input_dim, memory_width, shift_width, shift_conv,
    method __call__ (line 424) | def __call__(self, X, w_temp, m_temp):
  class AttentionWriter (line 454) | class AttentionWriter(AttentionReader):
    method __init__ (line 459) | def __init__(self, input_dim, memory_width, shift_width, shift_conv,
    method get_fixer (line 480) | def get_fixer(self, X):
  class BernoulliController (line 487) | class BernoulliController(Recurrent):
    method __init__ (line 496) | def __init__(self,
    method _controller (line 579) | def _controller(self, input_t, read_t, controller_tm1=None):
    method _read (line 594) | def _read(w_read, memory):
    method _write (line 602) | def _write(w_write, memory, erase, add):
    method _step (line 618) | def _step(self, input_t, mask_t,
    method __call__ (line 668) | def __call__(self, X, mask=None, M=None, init_ww=None,

FILE: emolga/layers/recurrent.py
  class Recurrent (line 6) | class Recurrent(MaskedLayer):
    method get_padded_shuffled_mask (line 12) | def get_padded_shuffled_mask(mask, pad=0):
  class GRU (line 37) | class GRU(Recurrent):
    method __init__ (line 71) | def __init__(self,
    method _step (line 132) | def _step(self,
    method _step_gate (line 165) | def _step_gate(self,
    method __call__ (line 185) | def __call__(self, X, mask=None, C=None, init_h=None,
  class JZS3 (line 278) | class JZS3(Recurrent):
    method __init__ (line 298) | def __init__(self,
    method _step (line 359) | def _step(self,
    method __call__ (line 370) | def __call__(self, X, mask=None, C=None, init_h=None, return_sequence=...
  class LSTM (line 425) | class LSTM(Recurrent):
    method __init__ (line 426) | def __init__(self,
    method _step (line 500) | def _step(self,
    method input_embed (line 520) | def input_embed(self, X, C=None):
    method __call__ (line 539) | def __call__(self, X, mask=None, C=None, init_h=None, init_c=None, ret...

FILE: emolga/models/core.py
  class Model (line 18) | class Model(object):
    method __init__ (line 19) | def __init__(self):
    method _add (line 25) | def _add(self, layer):
    method _monitoring (line 30) | def _monitoring(self):
    method compile_monitoring (line 38) | def compile_monitoring(self, inputs, updates=None):
    method set_weights (line 50) | def set_weights(self, weights):
    method get_weights (line 62) | def get_weights(self):
    method set_name (line 73) | def set_name(self, name):
    method save (line 81) | def save(self, filename):
    method load (line 96) | def load(self, filename):
    method save_weight_json (line 105) | def save_weight_json(self, filename):
    method load_weight_json (line 117) | def load_weight_json(self, filename):

FILE: emolga/models/covc_encdec.py
  class Encoder (line 24) | class Encoder(Model):
    method __init__ (line 30) | def __init__(self,
    method build_encoder (line 104) | def build_encoder(self, source, context=None, return_embed=False,
    method compile_encoder (line 214) | def compile_encoder(self, with_context=False, return_embed=False, retu...
  class Decoder (line 259) | class Decoder(Model):
    method __init__ (line 269) | def __init__(self,
    method _grab_prob (line 435) | def _grab_prob(probs, X, block_unk=False):
    method prepare_xy (line 448) | def prepare_xy(self, target):
    method build_decoder (line 463) | def build_decoder(self, target, context=None,
    method _step_sample (line 544) | def _step_sample(self, prev_word, prev_stat, context):
    method build_sampler (line 598) | def build_sampler(self):
    method build_stochastic_sampler (line 640) | def build_stochastic_sampler(self):
    method get_sample (line 651) | def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=...
  class DecoderAtt (line 762) | class DecoderAtt(Decoder):
    method __init__ (line 767) | def __init__(self,
    method prepare_xy (line 827) | def prepare_xy(self, target, cc_matrix):
    method build_decoder (line 875) | def build_decoder(self,
    method _step_sample (line 1051) | def _step_sample(self,
    method build_sampler (line 1155) | def build_sampler(self):
    method get_sample (line 1209) | def get_sample(self,
  class FnnDecoder (line 1491) | class FnnDecoder(Model):
    method __init__ (line 1492) | def __init__(self, config, rng, prefix='fnndec'):
    method _grab_prob (line 1520) | def _grab_prob(probs, X):
    method build_decoder (line 1530) | def build_decoder(self, target, context):
    method build_sampler (line 1538) | def build_sampler(self):
    method get_sample (line 1545) | def get_sample(self, context, argmax=True):
  class RNNLM (line 1557) | class RNNLM(Model):
    method __init__ (line 1562) | def __init__(self,
    method build_ (line 1573) | def build_(self):
    method compile_ (line 1590) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 1616) | def compile_train(self):
    method compile_train_CE (line 1659) | def compile_train_CE(self):
    method compile_sample (line 1662) | def compile_sample(self):
    method compile_inference (line 1668) | def compile_inference(self):
    method default_context (line 1671) | def default_context(self):
    method generate_ (line 1679) | def generate_(self, context=None, max_len=None, mode='display'):
  class AutoEncoder (line 1707) | class AutoEncoder(RNNLM):
    method __init__ (line 1712) | def __init__(self,
    method build_ (line 1723) | def build_(self):
    method compile_train (line 1769) | def compile_train(self, mode='train'):
  class NRM (line 1813) | class NRM(Model):
    method __init__ (line 1818) | def __init__(self,
    method build_ (line 1835) | def build_(self, lr=None, iterations=None):
    method compile_ (line 1863) | def compile_(self, mode='all', contrastive=False):
    method compile_train (line 1886) | def compile_train(self):
    method compile_sample (line 1950) | def compile_sample(self):
    method compile_inference (line 1959) | def compile_inference(self):
    method generate_ (line 1962) | def generate_(self, inputs, mode='display', return_attend=False, retur...
    method generate_multiple (line 2006) | def generate_multiple(self, inputs, mode='display', return_attend=Fals...
    method evaluate_ (line 2060) | def evaluate_(self, inputs, outputs, idx2word, inputs_unk=None, encode...
    method evaluate_multiple (line 2089) | def evaluate_multiple(self, inputs, outputs,
    method analyse_ (line 2377) | def analyse_(self, inputs, outputs, idx2word, inputs_unk=None, return_...
    method analyse_cover (line 2434) | def analyse_cover(self, inputs, outputs, idx2word, inputs_unk=None, re...

FILE: emolga/models/encdec.py
  class Encoder (line 177) | class Encoder(Model):
    method __init__ (line 183) | def __init__(self,
    method build_encoder (line 257) | def build_encoder(self, source, context=None, return_embed=False, retu...
    method compile_encoder (line 319) | def compile_encoder(self, with_context=False, return_embed=False, retu...
  class Decoder (line 337) | class Decoder(Model):
    method __init__ (line 347) | def __init__(self,
    method _grab_prob (line 505) | def _grab_prob(probs, X):
    method prepare_xy (line 530) | def prepare_xy(self, target):
    method build_decoder (line 545) | def build_decoder(self, target, context=None,
    method _step_sample (line 620) | def _step_sample(self, prev_word, prev_stat, context):
    method build_sampler (line 678) | def build_sampler(self):
    method build_stochastic_sampler (line 721) | def build_stochastic_sampler(self):
    method get_sample (line 732) | def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=...
  class DecoderAtt (line 843) | class DecoderAtt(Decoder):
    method __init__ (line 848) | def __init__(self,
    method prepare_xy (line 889) | def prepare_xy(self, target, context=None):
    method build_decoder (line 924) | def build_decoder(self,
    method build_representer (line 1035) | def build_representer(self,
    method _step_sample (line 1152) | def _step_sample(self, prev_word, prev_stat, context, c_mask):
    method build_sampler (line 1217) | def build_sampler(self):
    method get_sample (line 1256) | def get_sample(self, encoding, c_mask, inputs,
  class FnnDecoder (line 1448) | class FnnDecoder(Model):
    method __init__ (line 1449) | def __init__(self, config, rng, prefix='fnndec'):
    method _grab_prob (line 1477) | def _grab_prob(probs, X):
    method build_decoder (line 1487) | def build_decoder(self, target, context):
    method build_sampler (line 1495) | def build_sampler(self):
    method get_sample (line 1502) | def get_sample(self, context, argmax=True):
  class RNNLM (line 1514) | class RNNLM(Model):
    method __init__ (line 1519) | def __init__(self,
    method build_ (line 1530) | def build_(self):
    method compile_ (line 1547) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 1573) | def compile_train(self):
    method compile_train_CE (line 1617) | def compile_train_CE(self):
    method compile_sample (line 1620) | def compile_sample(self):
    method compile_inference (line 1626) | def compile_inference(self):
    method default_context (line 1629) | def default_context(self):
    method generate_ (line 1637) | def generate_(self, context=None, max_len=None, mode='display'):
  class AutoEncoder (line 1665) | class AutoEncoder(RNNLM):
    method __init__ (line 1670) | def __init__(self,
    method build_ (line 1681) | def build_(self):
    method compile_train (line 1727) | def compile_train(self, mode='train'):
  class NRM (line 1770) | class NRM(Model):
    method __init__ (line 1775) | def __init__(self,
    method build_ (line 1792) | def build_(self):
    method compile_ (line 1815) | def compile_(self, mode='all', contrastive=False):
    method compile_train (line 1838) | def compile_train(self):
    method compile_sample (line 1902) | def compile_sample(self):
    method compile_inference (line 1911) | def compile_inference(self):
    method generate_ (line 1914) | def generate_(self, inputs, mode='display', return_all=False):
    method generate_multiple (line 1951) | def generate_multiple(self, inputs, mode='display', return_all=True, a...
    method evaluate_ (line 2030) | def evaluate_(self, inputs, outputs, idx2word, inputs_unk=None):
    method evaluate_multiple (line 2057) | def evaluate_multiple(self, inputs, outputs,
    method analyse_ (line 2281) | def analyse_(self, inputs, outputs, idx2word):
    method analyse_cover (line 2298) | def analyse_cover(self, inputs, outputs, idx2word):

FILE: emolga/models/ntm_encdec.py
  class RecurrentBase (line 20) | class RecurrentBase(Model):
    method __init__ (line 24) | def __init__(self, config, model='RNN', prefix='enc', use_contxt=True,...
    method get_context (line 105) | def get_context(self, context):
    method loop (line 127) | def loop(self, X, X_mask, info=None, return_sequence=False, return_ful...
    method step (line 135) | def step(self, X, prev_info):
    method build_ (line 153) | def build_(self):
    method get_init (line 209) | def get_init(self, context):
    method get_next_state (line 228) | def get_next_state(self, prev_X, prev_info):
  class Encoder (line 255) | class Encoder(Model):
    method __init__ (line 261) | def __init__(self,
    method build_encoder (line 306) | def build_encoder(self, source, context=None):
  class Decoder (line 347) | class Decoder(Model):
    method __init__ (line 357) | def __init__(self,
    method _grab_prob (line 462) | def _grab_prob(probs, X):
    method prepare_xy (line 475) | def prepare_xy(self, target):
    method build_decoder (line 490) | def build_decoder(self, target, context=None, return_count=False):
    method _step_embed (line 527) | def _step_embed(self, prev_word):
    method _step_sample (line 540) | def _step_sample(self, X, next_stat, context):
    method build_sampler (line 563) | def build_sampler(self):
    method get_sample (line 595) | def get_sample(self, context, k=1, maxlen=30, stochastic=True, argmax=...
  class RNNLM (line 717) | class RNNLM(Model):
    method __init__ (line 722) | def __init__(self,
    method build_ (line 733) | def build_(self):
    method compile_ (line 749) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 775) | def compile_train(self):
    method compile_train_CE (line 815) | def compile_train_CE(self):
    method compile_sample (line 818) | def compile_sample(self):
    method compile_inference (line 823) | def compile_inference(self):
    method default_context (line 826) | def default_context(self):
    method generate_ (line 834) | def generate_(self, context=None, mode='display', max_len=None):
  class Helmholtz (line 862) | class Helmholtz(RNNLM):
    method __init__ (line 871) | def __init__(self,
    method build_ (line 882) | def build_(self):
    method compile_train (line 951) | def compile_train(self):
    method compile_sample (line 1053) | def compile_sample(self):
    method compile_inference (line 1077) | def compile_inference(self):
    method default_context (line 1100) | def default_context(self):
  class BinaryHelmholtz (line 1105) | class BinaryHelmholtz(RNNLM):
    method __init__ (line 1114) | def __init__(self,
    method build_ (line 1125) | def build_(self):
    method compile_train (line 1175) | def compile_train(self):
    method compile_sample (line 1277) | def compile_sample(self):
    method compile_inference (line 1300) | def compile_inference(self):
    method default_context (line 1323) | def default_context(self):
  class AutoEncoder (line 1328) | class AutoEncoder(RNNLM):
    method __init__ (line 1334) | def __init__(self,
    method build_ (line 1345) | def build_(self):
    method compile_train (line 1368) | def compile_train(self, mode='train'):
    method compile_sample (line 1404) | def compile_sample(self):

FILE: emolga/models/pointers.py
  class PtrDecoder (line 19) | class PtrDecoder(Model):
    method __init__ (line 23) | def __init__(self,
    method grab_prob (line 64) | def grab_prob(probs, X):
    method grab_source (line 75) | def grab_source(source, target):
    method build_decoder (line 91) | def build_decoder(self,
    method _step_sample (line 139) | def _step_sample(self, prev_idx, prev_stat,
    method build_sampler (line 156) | def build_sampler(self):
    method get_sample (line 199) | def get_sample(self, context, inputs, source, smask,
  class PointerDecoder (line 316) | class PointerDecoder(Model):
    method __init__ (line 321) | def __init__(self,
    method grab_prob (line 370) | def grab_prob(probs, X):
    method grab_source (line 381) | def grab_source(source, target):
    method build_decoder (line 397) | def build_decoder(self,
    method _step_sample (line 473) | def _step_sample(self,
    method build_sampler (line 495) | def build_sampler(self):
    method get_sample (line 540) | def get_sample(self, context, inputs, source, smask,
  class MemNet (line 616) | class MemNet(Model):
    method __init__ (line 621) | def __init__(self,
    method __call__ (line 653) | def __call__(self, key, memory=None, mem_mask=None, out_memory=None):
  class PtrNet (line 675) | class PtrNet(Model):
    method __init__ (line 679) | def __init__(self, config, n_rng, rng,
    method build_ (line 689) | def build_(self, encoder=None):
    method build_train (line 743) | def build_train(self, memory=None, out_memory=None, compile_train=Fals...
    method build_sampler (line 881) | def build_sampler(self, memory=None, out_mem=None):
    method build_predict_sampler (line 925) | def build_predict_sampler(self):
    method generate_ (line 973) | def generate_(self, inputs, context, source, smask):

FILE: emolga/models/variational.py
  class VAE (line 19) | class VAE(RNNLM):
    method __init__ (line 33) | def __init__(self,
    method _add_tag (line 45) | def _add_tag(self, layer, tag):
    method build_ (line 52) | def build_(self):
    method compile_train (line 115) | def compile_train(self):
    method compile_sample (line 176) | def compile_sample(self):
    method compile_inference (line 192) | def compile_inference(self):
    method default_context (line 208) | def default_context(self):
  class Helmholtz (line 212) | class Helmholtz(VAE):
    method __init__ (line 220) | def __init__(self,
    method build_ (line 235) | def build_(self):
    method dynamic (line 301) | def dynamic(self):
    method compile_ (line 330) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 356) | def compile_train(self):
    method build_dynamics (line 477) | def build_dynamics(self, states, action, Y):
    method compile_sample (line 487) | def compile_sample(self):
    method compile_inference (line 523) | def compile_inference(self):
    method evaluate_ (line 539) | def evaluate_(self, inputs):
    method compile_train_CE (line 564) | def compile_train_CE(self):
  class HarX (line 695) | class HarX(Helmholtz):
    method __init__ (line 705) | def __init__(self,
    method build_ (line 720) | def build_(self):
    method compile_ (line 798) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 824) | def compile_train(self):
    method generate_ (line 1015) | def generate_(self, context=None, max_len=None, mode='display'):
  class THarX (line 1023) | class THarX(Helmholtz):
    method __init__ (line 1033) | def __init__(self,
    method build_ (line 1048) | def build_(self):
    method compile_ (line 1126) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 1152) | def compile_train(self):
    method generate_ (line 1350) | def generate_(self, context=None, max_len=None, mode='display'):
  class NVTM (line 1358) | class NVTM(Helmholtz):
    method __init__ (line 1364) | def __init__(self,
    method build_ (line 1376) | def build_(self):
    method compile_ (line 1492) | def compile_(self, mode='train', contrastive=False):
    method compile_train (line 1518) | def compile_train(self):
    method generate_ (line 1701) | def generate_(self, context=None, max_len=None, mode='display'):

FILE: emolga/utils/generic_utils.py
  function get_from_module (line 10) | def get_from_module(identifier, module_params, module_name, instantiate=...
  function make_tuple (line 24) | def make_tuple(*args):
  function printv (line 27) | def printv(v, prefix=''):
  function make_batches (line 49) | def make_batches(size, batch_size):
  function slice_X (line 54) | def slice_X(X, start=None, stop=None):
  class Progbar (line 67) | class Progbar(object):
    method __init__ (line 68) | def __init__(self, target, logger, width=30, verbose=1):
    method update (line 83) | def update(self, current, values=[]):
    method add (line 158) | def add(self, n, values=[]):
    method clear (line 161) | def clear(self):
  function print_sample (line 168) | def print_sample(idx2word, idx):
  function visualize_ (line 178) | def visualize_(subplots, data, w=None, h=None, name=None,
  function vis_Gaussian (line 255) | def vis_Gaussian(subplot, mean, std, name=None, display='off', size=10):

FILE: emolga/utils/io_utils.py
  class HDF5Matrix (line 8) | class HDF5Matrix():
    method __init__ (line 11) | def __init__(self, datapath, dataset, start, end, normalizer=None):
    method __len__ (line 22) | def __len__(self):
    method __getitem__ (line 25) | def __getitem__(self, key):
    method shape (line 52) | def shape(self):
  function save_array (line 56) | def save_array(array, name):
  function load_array (line 65) | def load_array(name):
  function save_config (line 75) | def save_config():
  function load_config (line 79) | def load_config():

FILE: emolga/utils/np_utils.py
  function to_categorical (line 8) | def to_categorical(y, nb_classes=None):
  function normalize (line 21) | def normalize(a, axis=-1, order=2):
  function binary_logloss (line 27) | def binary_logloss(p, y):
  function multiclass_logloss (line 36) | def multiclass_logloss(P, Y):
  function accuracy (line 43) | def accuracy(p, y):
  function probas_to_classes (line 47) | def probas_to_classes(y_pred):
  function categorical_probas_to_classes (line 53) | def categorical_probas_to_classes(p):

FILE: emolga/utils/test_utils.py
  function get_test_data (line 4) | def get_test_data(nb_train=1000, nb_test=500, input_shape=(10,), output_...

FILE: emolga/utils/theano_utils.py
  function floatX (line 10) | def floatX(X):
  function sharedX (line 14) | def sharedX(X, dtype=theano.config.floatX, name=None):
  function shared_zeros (line 18) | def shared_zeros(shape, dtype=theano.config.floatX, name=None):
  function shared_scalar (line 22) | def shared_scalar(val=0., dtype=theano.config.floatX, name=None):
  function shared_ones (line 26) | def shared_ones(shape, dtype=theano.config.floatX, name=None):
  function alloc_zeros_matrix (line 30) | def alloc_zeros_matrix(*dims):
  function alloc_ones_matrix (line 34) | def alloc_ones_matrix(*dims):
  function ndim_tensor (line 38) | def ndim_tensor(ndim):
  function ndim_itensor (line 51) | def ndim_itensor(ndim, name=None):
  function dot (line 62) | def dot(inp, matrix, bias=None):
  function logSumExp (line 87) | def logSumExp(x, axis=None, mask=None, status='theano', c=None, err=1e-7):
  function softmax (line 121) | def softmax(x):
  function masked_softmax (line 125) | def masked_softmax(x, mask, err=1e-7):
  function cosine_sim (line 133) | def cosine_sim(k, M):
  function cosine_sim2d (line 144) | def cosine_sim2d(k, M):
  function dot_2d (line 160) | def dot_2d(k, M, b=None, g=None):
  function shift_convolve (line 182) | def shift_convolve(weight, shift, shift_conv):
  function shift_convolve2d (line 187) | def shift_convolve2d(weight, shift, shift_conv):

FILE: keyphrase/baseline/evaluate.py
  function load_phrase (line 19) | def load_phrase(file_path, tokenize=True):
  function evaluate_ (line 30) | def evaluate_(text_dir, target_dir, prediction_dir, model_name, dataset_...
  function init_logging (line 304) | def init_logging(logfile):
  function evaluate_baselines (line 330) | def evaluate_baselines():
  function significance_test (line 356) | def significance_test():

FILE: keyphrase/baseline/export_dataset.py
  function export_UTD (line 10) | def export_UTD():
  class Document (line 43) | class Document(object):
    method __init__ (line 44) | def __init__(self):
    method __str__ (line 50) | def __str__(self):
  function load_text (line 53) | def load_text(doclist, textdir):
  function load_keyphrase (line 83) | def load_keyphrase(doclist, keyphrasedir):
  function get_doc (line 101) | def get_doc(text_dir, phrase_dir):
  function export_maui (line 114) | def export_maui():
  function export_krapivin_maui (line 169) | def export_krapivin_maui():
  function export_ke20k_testing_maui (line 203) | def export_ke20k_testing_maui():
  function export_ke20k_train_maui (line 219) | def export_ke20k_train_maui():
  function prepare_data_cross_validation (line 245) | def prepare_data_cross_validation(input_dir, output_dir, folds=5):

FILE: keyphrase/config.py
  function setup_keyphrase_stable (line 7) | def setup_keyphrase_stable():
  function setup_keyphrase_train (line 192) | def setup_keyphrase_train():
  function setup_keyphrase_baseline (line 374) | def setup_keyphrase_baseline():

FILE: keyphrase/dataset/dataset_utils.py
  function prepare_text (line 23) | def prepare_text(record, process_type=1):
  function get_tokens (line 44) | def get_tokens(text, process_type=1):
  function process_keyphrase (line 76) | def process_keyphrase(keyword_str):
  function build_data (line 90) | def build_data(data, idx2word, word2idx):
  function load_pairs (line 127) | def load_pairs(records, process_type=1 ,do_filter=False):
  function get_none_phrases (line 179) | def get_none_phrases(source_text, source_postag, max_len):

FILE: keyphrase/dataset/inspec/inspec_export_json.py
  function export_Inspec_tokenized (line 19) | def export_Inspec_tokenized(dir_name, output_name):
  function export_Inspec (line 88) | def export_Inspec(Inspec_input_path, Inspec_output_path):

FILE: keyphrase/dataset/keyphrase_dataset.py
  function get_tokens (line 28) | def get_tokens(text, type=1):
  function load_data (line 62) | def load_data(input_path, tokenize_sentence=True):
  function build_dict (line 126) | def build_dict(wordfreq):
  function build_data (line 148) | def build_data(data, idx2word, word2idx):
  function load_data_and_dict (line 182) | def load_data_and_dict(training_dataset, testing_dataset):
  function export_data_for_maui (line 209) | def export_data_for_maui():

FILE: keyphrase/dataset/keyphrase_test_dataset.py
  class Document (line 30) | class Document(object):
    method __init__ (line 31) | def __init__(self):
    method __str__ (line 37) | def __str__(self):
  class DataLoader (line 41) | class DataLoader(object):
    method __init__ (line 42) | def __init__(self, **kwargs):
    method get_docs (line 48) | def get_docs(self, return_dict=True):
    method __call__ (line 73) | def __call__(self, idx2word, word2idx, type = 1):
    method load_xml (line 106) | def load_xml(self, xmldir):
    method load_text (line 134) | def load_text(self, textdir):
    method load_keyphrase (line 164) | def load_keyphrase(self, keyphrasedir):
    method load_testing_data_postag (line 181) | def load_testing_data_postag(self, word2idx):
    method load_testing_data (line 215) | def load_testing_data(self, word2idx):
  class INSPEC (line 250) | class INSPEC(DataLoader):
    method __init__ (line 251) | def __init__(self, **kwargs):
  class NUS (line 264) | class NUS(DataLoader):
    method __init__ (line 265) | def __init__(self, **kwargs):
    method export (line 275) | def export(self):
    method get_docs (line 312) | def get_docs(self, only_abstract=True, return_dict=True):
  class SemEval (line 397) | class SemEval(DataLoader):
    method __init__ (line 398) | def __init__(self, **kwargs):
  class KRAPIVIN (line 488) | class KRAPIVIN(DataLoader):
    method __init__ (line 489) | def __init__(self, **kwargs):
    method load_text (line 499) | def load_text(self, textdir):
  class KDD (line 528) | class KDD(DataLoader):
    method __init__ (line 529) | def __init__(self, **kwargs):
  class WWW (line 536) | class WWW(DataLoader):
    method __init__ (line 537) | def __init__(self, **kwargs):
  class UMD (line 544) | class UMD(DataLoader):
    method __init__ (line 545) | def __init__(self, **kwargs):
  class DUC (line 552) | class DUC(DataLoader):
    method __init__ (line 553) | def __init__(self, **kwargs):
    method export_text_phrase (line 563) | def export_text_phrase(self):
  class KP20k (line 621) | class KP20k(DataLoader):
    method __init__ (line 622) | def __init__(self, **kwargs):
    method get_docs (line 632) | def get_docs(self, return_dict=True):
  class KP2k_NEW (line 661) | class KP2k_NEW(DataLoader):
    method __init__ (line 665) | def __init__(self, **kwargs):
    method get_docs (line 671) | def get_docs(self, return_dict=True):
  class IRBooks (line 700) | class IRBooks(DataLoader):
    method __init__ (line 701) | def __init__(self, **kwargs):
    method get_docs (line 709) | def get_docs(self, return_dict=True):
  class Quora (line 745) | class Quora(DataLoader):
    method __init__ (line 746) | def __init__(self, **kwargs):
    method get_docs (line 754) | def get_docs(self, return_dict=True):
  function testing_data_loader (line 796) | def testing_data_loader(identifier, kwargs=None):
  function load_additional_testing_data (line 805) | def load_additional_testing_data(testing_names, idx2word, word2idx, conf...
  function check_data (line 848) | def check_data():
  function add_padding (line 941) | def add_padding(data):
  function split_into_multiple_and_padding (line 955) | def split_into_multiple_and_padding(data_s_o, data_t_o):
  function get_postag_with_record (line 967) | def get_postag_with_record(records, pairs):
  function get_postag_with_index (line 995) | def get_postag_with_index(sources, idx2word, word2idx):
  function check_postag (line 1022) | def check_postag(config):

FILE: keyphrase/dataset/keyphrase_train_dataset.py
  function build_dict (line 18) | def build_dict(wordfreq):
  function dump_samples_to_json (line 41) | def dump_samples_to_json(records, file_path):
  function load_data_and_dict (line 53) | def load_data_and_dict(training_dataset):

FILE: keyphrase/dataset/million-paper/preprocess.py
  function load_file (line 17) | def load_file(input_path):

FILE: keyphrase/keyphrase_copynet.py
  class LoggerWriter (line 49) | class LoggerWriter:
    method __init__ (line 50) | def __init__(self, level):
    method write (line 55) | def write(self, message):
    method flush (line 61) | def flush(self):
  function init_logging (line 68) | def init_logging(logfile):
  function output_stream (line 86) | def output_stream(dataset, batch_size, size=1):
  function prepare_batch (line 98) | def prepare_batch(batch, mask, fix_len=None):
  function cc_martix (line 115) | def cc_martix(source, target):
  function unk_filter (line 127) | def unk_filter(data):
  function add_padding (line 144) | def add_padding(data):
  function split_into_multiple_and_padding (line 160) | def split_into_multiple_and_padding(data_s_o, data_t_o):
  function build_data (line 172) | def build_data(data):

FILE: keyphrase/keyphrase_utils.py
  function evaluate_multiple (line 14) | def evaluate_multiple(config, test_set, inputs, outputs,
  function export_keyphrase (line 557) | def export_keyphrase(predictions, text_dir, prediction_dir):

FILE: keyphrase/util/stanford-pos-tagger.py
  class StanfordTagger (line 32) | class StanfordTagger(TaggerI):
    method __init__ (line 46) | def __init__(self, model_filename, path_to_jar=None, encoding='utf8', ...
    method _cmd (line 66) | def _cmd(self):
  function tag (line 70) | def tag(self, tokens):
  function tag_sents (line 75) | def tag_sents(self, sentences):
  function parse_output (line 108) | def parse_output(self, text, sentences=None):
  class StanfordNERTagger (line 119) | class StanfordNERTagger(StanfordTagger):
    method __init__ (line 142) | def __init__(self, *args, **kwargs):

Download .json

Condensed preview — 51 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (733K chars).

[
  {
    "path": ".gitignore",
    "chars": 794,
    "preview": "# added by memray\n/.idea/\n/Experiment/\n/dataset/\n/stanford-postagger/\n.DS_Store\n\n# Byte-compiled / optimized / DLL files"
  },
  {
    "path": "LICENSE",
    "chars": 1065,
    "preview": "MIT License\n\nCopyright (c) 2016 Rui Meng\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
  },
  {
    "path": "README.md",
    "chars": 1790,
    "preview": "# seq2seq-keyphrase\n### Note: this repository has been deprecated. Please move to our latest code/data/model release for"
  },
  {
    "path": "emolga/__init__.py",
    "chars": 28,
    "preview": "__author__ = 'yinpengcheng'\n"
  },
  {
    "path": "emolga/basic/__init__.py",
    "chars": 24,
    "preview": "__author__ = 'jiataogu'\n"
  },
  {
    "path": "emolga/basic/activations.py",
    "chars": 1456,
    "preview": "import theano.tensor as T\n\n\ndef softmax(x):\n    return T.nnet.softmax(x.reshape((-1, x.shape[-1]))).reshape(x.shape)\n\n\nd"
  },
  {
    "path": "emolga/basic/initializations.py",
    "chars": 2300,
    "preview": "import theano\nimport theano.tensor as T\nimport numpy as np\n\nfrom emolga.utils.theano_utils import sharedX, shared_zeros,"
  },
  {
    "path": "emolga/basic/objectives.py",
    "chars": 3130,
    "preview": "from __future__ import absolute_import\nimport theano\nimport theano.tensor as T\nimport numpy as np\nfrom six.moves import "
  },
  {
    "path": "emolga/basic/optimizers.py",
    "chars": 11061,
    "preview": "from __future__ import absolute_import\nimport theano\nimport sys\n\nfrom theano.sandbox.rng_mrg import MRG_RandomStreams\nim"
  },
  {
    "path": "emolga/dataset/__init__.py",
    "chars": 180,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\n\n__author__ = \"Rui Meng\"\n__email_"
  },
  {
    "path": "emolga/dataset/build_dataset.py",
    "chars": 5230,
    "preview": "import json\n\n__author__ = 'jiataogu'\nimport numpy as np\nimport numpy.random as rng\nimport cPickle as pickle\nimport pprin"
  },
  {
    "path": "emolga/layers/__init__.py",
    "chars": 28,
    "preview": "__author__ = 'yinpengcheng'\n"
  },
  {
    "path": "emolga/layers/attention.py",
    "chars": 5305,
    "preview": "__author__ = 'jiataogu'\nfrom .core import *\n\"\"\"\nAttention Model.\n    <::: Two kinds of attention models ::::>\n    -- Lin"
  },
  {
    "path": "emolga/layers/core.py",
    "chars": 8439,
    "preview": "# -*- coding: utf-8 -*-\n\nfrom emolga.utils.theano_utils import *\nimport emolga.basic.initializations as initializations\n"
  },
  {
    "path": "emolga/layers/embeddings.py",
    "chars": 2855,
    "preview": "# -*- coding: utf-8 -*-\n\nfrom .core import Layer\nfrom emolga.utils.theano_utils import *\nimport emolga.basic.initializat"
  },
  {
    "path": "emolga/layers/gridlstm.py",
    "chars": 33060,
    "preview": "__author__ = 'jiataogu'\n\"\"\"\nThe file is the implementation of Grid-LSTM\nIn this stage we only support 2D LSTM with Pooli"
  },
  {
    "path": "emolga/layers/ntm_minibatch.py",
    "chars": 28501,
    "preview": "__author__ = 'jiataogu'\nimport theano\nimport theano.tensor as T\n\nimport scipy.linalg as sl\nimport numpy as np\nfrom .core"
  },
  {
    "path": "emolga/layers/recurrent.py",
    "chars": 22899,
    "preview": "# -*- coding: utf-8 -*-\nfrom abc import abstractmethod\nfrom .core import *\n\n\nclass Recurrent(MaskedLayer):\n    \"\"\"\n     "
  },
  {
    "path": "emolga/models/__init__.py",
    "chars": 24,
    "preview": "__author__ = 'jiataogu'\n"
  },
  {
    "path": "emolga/models/core.py",
    "chars": 4730,
    "preview": "import json\n\nimport numpy\n\nfrom keyphrase import config\n\n__author__ = 'jiataogu'\nimport theano\nimport logging\nimport dee"
  },
  {
    "path": "emolga/models/covc_encdec.py",
    "chars": 108092,
    "preview": "__author__ = 'jiataogu, memray'\nimport theano\nimport logging\nimport copy\nimport emolga.basic.objectives as objectives\nim"
  },
  {
    "path": "emolga/models/encdec.py",
    "chars": 94600,
    "preview": "import math\n\n__author__ = 'jiataogu, memray'\nimport theano\n\nimport logging\nimport copy\nimport emolga.basic.objectives as"
  },
  {
    "path": "emolga/models/ntm_encdec.py",
    "chars": 53418,
    "preview": "__author__ = 'jiataogu'\n\nimport theano\ntheano.config.exception_verbosity = 'high'\n\nimport logging\nimport copy\n\nimport em"
  },
  {
    "path": "emolga/models/pointers.py",
    "chars": 36210,
    "preview": "__author__ = 'jiataogu'\nimport theano\nimport logging\nimport copy\n\nfrom emolga.layers.recurrent import *\nfrom emolga.laye"
  },
  {
    "path": "emolga/models/variational.py",
    "chars": 63827,
    "preview": "__author__ = 'jiataogu'\nimport theano\n# theano.config.exception_verbosity = 'high'\nimport logging\n\nimport emolga.basic.o"
  },
  {
    "path": "emolga/utils/__init__.py",
    "chars": 28,
    "preview": "__author__ = 'yinpengcheng'\n"
  },
  {
    "path": "emolga/utils/generic_utils.py",
    "chars": 8427,
    "preview": "from __future__ import absolute_import\nfrom matplotlib.ticker import FuncFormatter\nimport numpy as np\nimport time\nimport"
  },
  {
    "path": "emolga/utils/io_utils.py",
    "chars": 2110,
    "preview": "from __future__ import absolute_import\nimport h5py\nimport numpy as np\nimport cPickle\nfrom collections import defaultdict"
  },
  {
    "path": "emolga/utils/np_utils.py",
    "chars": 1395,
    "preview": "from __future__ import absolute_import\nimport numpy as np\nimport scipy as sp\nfrom six.moves import range\nfrom six.moves "
  },
  {
    "path": "emolga/utils/test_utils.py",
    "chars": 1091,
    "preview": "import numpy as np\n\n\ndef get_test_data(nb_train=1000, nb_test=500, input_shape=(10,), output_shape=(2,),\n               "
  },
  {
    "path": "emolga/utils/theano_utils.py",
    "chars": 5533,
    "preview": "from __future__ import absolute_import\n\nfrom theano import gof\nfrom theano.tensor import basic as tensor\nimport numpy as"
  },
  {
    "path": "keyphrase/__init__.py",
    "chars": 180,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\n\n__author__ = \"Rui Meng\"\n__email_"
  },
  {
    "path": "keyphrase/baseline/evaluate.py",
    "chars": 16209,
    "preview": "import math\nimport logging\nimport string\n\nimport scipy\nfrom nltk.stem.porter import *\nimport numpy as np\n\nimport os\nimpo"
  },
  {
    "path": "keyphrase/baseline/export_dataset.py",
    "chars": 12902,
    "preview": "import os\n\nimport numpy\nimport shutil\n\nimport keyphrase.config\nfrom emolga.dataset.build_dataset import deserialize_from"
  },
  {
    "path": "keyphrase/config.py",
    "chars": 22930,
    "preview": "import time\n\nimport os\nimport os.path as path\n\n\ndef setup_keyphrase_stable():\n    config = dict()\n    '''\n    Meta infor"
  },
  {
    "path": "keyphrase/dataset/__init__.py",
    "chars": 180,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\n\n__author__ = \"Rui Meng\"\n__email_"
  },
  {
    "path": "keyphrase/dataset/dataset_utils.py",
    "chars": 8456,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\nimport nltk\nimport numpy\nimport n"
  },
  {
    "path": "keyphrase/dataset/inspec/__init__.py",
    "chars": 180,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\n\n__author__ = \"Rui Meng\"\n__email_"
  },
  {
    "path": "keyphrase/dataset/inspec/inspec_export_json.py",
    "chars": 5370,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\nimport re\nimport json\n\nimport nlt"
  },
  {
    "path": "keyphrase/dataset/inspec/key_convert_maui.py",
    "chars": 1013,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nClean the dirty format of keyword files\nConvert to one keyword per lin"
  },
  {
    "path": "keyphrase/dataset/json_count.py",
    "chars": 399,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\njust check how many data instances in the json\n\"\"\"\n\nimport os\nimport j"
  },
  {
    "path": "keyphrase/dataset/keyphrase_dataset.py",
    "chars": 12333,
    "preview": "# coding=utf-8\nimport json\nimport sys\nimport time\n\nimport nltk\nimport numpy\nimport numpy as np\n\nimport keyphrase.config "
  },
  {
    "path": "keyphrase/dataset/keyphrase_test_dataset.py",
    "chars": 44569,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\nfrom nltk.internals import find_jars_within_"
  },
  {
    "path": "keyphrase/dataset/keyphrase_train_dataset.py",
    "chars": 9286,
    "preview": "# coding=utf-8\nimport json\nimport sys\nimport time\n\nimport nltk\nimport numpy\nimport numpy as np\nimport re\n\nfrom keyphrase"
  },
  {
    "path": "keyphrase/dataset/million-paper/clean_export_json.py",
    "chars": 1670,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nCheck how many non-duplicate and valid (some doesn't contain title/key"
  },
  {
    "path": "keyphrase/dataset/million-paper/preprocess.py",
    "chars": 3319,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n'''\nLoad the paper metadata from json, do preprocess (cleanup, tokenizatio"
  },
  {
    "path": "keyphrase/keyphrase_copynet.py",
    "chars": 31597,
    "preview": "import os\n# set environment variables in advance of importing theano as well as any possible module\n# os.environ['PATH']"
  },
  {
    "path": "keyphrase/keyphrase_utils.py",
    "chars": 25321,
    "preview": "import math\n\nimport logging\nfrom nltk.stem.porter import *\nimport numpy as np\nimport os\nimport copy\n\nimport keyphrase.da"
  },
  {
    "path": "keyphrase/util/__init__.py",
    "chars": 180,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nPython File Template \n\"\"\"\n\nimport os\n\n__author__ = \"Rui Meng\"\n__email_"
  },
  {
    "path": "keyphrase/util/gpu-test.py",
    "chars": 823,
    "preview": "import numpy\nimport time\nimport os\nos.environ['PATH'] = \"/usr/local/cuda-8.0/bin:/usr/local/cuda-8.0/lib64:\" + os.enviro"
  },
  {
    "path": "keyphrase/util/stanford-pos-tagger.py",
    "chars": 5093,
    "preview": "# -*- coding: utf-8 -*-\n# Natural Language Toolkit: Interface to the Stanford Part-of-speech and Named-Entity Taggers\n#\n"
  }
]

About this extraction

This page contains the full source code of the memray/seq2seq-keyphrase GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 51 files (693.0 KB), approximately 170.8k tokens, and a symbol index with 584 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo